Hello, I am running a job with following configura...
# ask-metaflow
s
Hello, I am running a job with following configurations to train a model
Copy code
"""A sample Metaflow pipeline using Axolotl for training a model.

This reads a local Axolotl training config file, forwards it to the trainer,
and then triggers Axolotl to do the end to end training.

To run, use a command like
  python scripts/axolotl_outerbounds.py --environment=fast-bakery run --config experiments/collinear-guard/base/pointwise_phi35_base.yaml

This assumes you are running from the base planar repo and
experiments/collinear-guard/base/pointwise_phi35_base.yaml contains a proper
complete axolotl config.
"""

import os
from metaflow import (
    FlowSpec,
    step,
    card,
    pypi,
    kubernetes,
    retry,
    IncludeFile,
    environment,
)

from gpu_profile import gpu_profile


class AxolotlTrainWorkflow(FlowSpec):
    config = IncludeFile(
        "config",
        is_text=True,
        help="The local Axolotl config file to use for training.",
        default="config.yaml",
    )
    @kubernetes(
        cpu=25,
        gpu=8,
        memory=1000_000,
        disk=500_000,
        node_selector="<http://gpu.nvidia.com/class=A100_NVLINK_80GB|gpu.nvidia.com/class=A100_NVLINK_80GB>",
        image="public.ecr.aws/l8v1m7k8/aditya/axolotl:latest",
    )
    @gpu_profile(interval=1)
    @card()
via command
python axolotl_outerbounds.py --environment=fast-bakery run --config /root/sky_workdir/outerbounds/llama_8b.yaml
The process is running well like following shown in attached image, I want to kill this process and start another process. I am unable to kill it somehow can someone please help me?
1
a
@stale-vr-93035 - you should be able to use
python flow.py kubernetes kill --help
to kill runaway jobs
🙌 1
s
Instead of
flow.py
do I need to do it on
axolotl_outerbounds.py
?
a
yep
btw where are you running this flow?
outerbounds
or on a self hosted stack? if on
outerbounds
- you can also use the realtime gpu_profile card btw.
s
I am running it on
outerbounds
, can you please elaborate the way to use realtime
gpu_profile
card with some example
Also does it take time to reflect on outerbounds platform to show that it is killed? because I see its still running.
a
yeah the UI is eventually consistent so it may take a while to reflect there
re: realtime gpu profile - i can follow up on our private slack channel for
outerbounds
s
also what's the best way to see how many GPUs I can use?
a
this should also be available in the compute status page in the
outerbounds
ui