Hello I am running a job with following configurations to tr Outerbounds #ask-metaflow

Hello, I am running a job with following configura...

stale-vr-93035

10/14/2024, 7:23 PM

Hello, I am running a job with following configurations to train a model

Copy code

"""A sample Metaflow pipeline using Axolotl for training a model.

This reads a local Axolotl training config file, forwards it to the trainer,
and then triggers Axolotl to do the end to end training.

To run, use a command like
  python scripts/axolotl_outerbounds.py --environment=fast-bakery run --config experiments/collinear-guard/base/pointwise_phi35_base.yaml

This assumes you are running from the base planar repo and
experiments/collinear-guard/base/pointwise_phi35_base.yaml contains a proper
complete axolotl config.
"""

import os
from metaflow import (
    FlowSpec,
    step,
    card,
    pypi,
    kubernetes,
    retry,
    IncludeFile,
    environment,
)

from gpu_profile import gpu_profile


class AxolotlTrainWorkflow(FlowSpec):
    config = IncludeFile(
        "config",
        is_text=True,
        help="The local Axolotl config file to use for training.",
        default="config.yaml",
    )
    @kubernetes(
        cpu=25,
        gpu=8,
        memory=1000_000,
        disk=500_000,
        node_selector="<http://gpu.nvidia.com/class=A100_NVLINK_80GB|gpu.nvidia.com/class=A100_NVLINK_80GB>",
        image="public.ecr.aws/l8v1m7k8/aditya/axolotl:latest",
    )
    @gpu_profile(interval=1)
    @card()

via command

python axolotl_outerbounds.py --environment=fast-bakery run --config /root/sky_workdir/outerbounds/llama_8b.yaml

The process is running well like following shown in attached image, I want to kill this process and start another process. I am unable to kill it somehow can someone please help me?

✅ 1

ancient-application-36103

10/14/2024, 7:25 PM

@stale-vr-93035 - you should be able to use

python flow.py kubernetes kill --help

to kill runaway jobs

🙌 1

stale-vr-93035

10/14/2024, 7:26 PM

Instead of

flow.py

do I need to do it on

axolotl_outerbounds.py

ancient-application-36103

10/14/2024, 7:26 PM

yep

ancient-application-36103

10/14/2024, 7:27 PM

btw where are you running this flow?

outerbounds

or on a self hosted stack? if on

outerbounds

- you can also use the realtime gpu_profile card btw.

stale-vr-93035

10/14/2024, 7:30 PM

I am running it on

outerbounds

, can you please elaborate the way to use realtime

gpu_profile

card with some example

stale-vr-93035

10/14/2024, 7:31 PM

Also does it take time to reflect on outerbounds platform to show that it is killed? because I see its still running.

ancient-application-36103

10/14/2024, 7:32 PM

yeah the UI is eventually consistent so it may take a while to reflect there

ancient-application-36103

10/14/2024, 7:33 PM

re: realtime gpu profile - i can follow up on our private slack channel for

outerbounds

stale-vr-93035

10/14/2024, 7:33 PM

also what's the best way to see how many GPUs I can use?

ancient-application-36103

10/14/2024, 8:06 PM

this should also be available in the compute status page in the

outerbounds

Open in Slack

Previous Next