stale-vr-93035
10/14/2024, 7:23 PM"""A sample Metaflow pipeline using Axolotl for training a model.
This reads a local Axolotl training config file, forwards it to the trainer,
and then triggers Axolotl to do the end to end training.
To run, use a command like
python scripts/axolotl_outerbounds.py --environment=fast-bakery run --config experiments/collinear-guard/base/pointwise_phi35_base.yaml
This assumes you are running from the base planar repo and
experiments/collinear-guard/base/pointwise_phi35_base.yaml contains a proper
complete axolotl config.
"""
import os
from metaflow import (
FlowSpec,
step,
card,
pypi,
kubernetes,
retry,
IncludeFile,
environment,
)
from gpu_profile import gpu_profile
class AxolotlTrainWorkflow(FlowSpec):
config = IncludeFile(
"config",
is_text=True,
help="The local Axolotl config file to use for training.",
default="config.yaml",
)
@kubernetes(
cpu=25,
gpu=8,
memory=1000_000,
disk=500_000,
node_selector="<http://gpu.nvidia.com/class=A100_NVLINK_80GB|gpu.nvidia.com/class=A100_NVLINK_80GB>",
image="public.ecr.aws/l8v1m7k8/aditya/axolotl:latest",
)
@gpu_profile(interval=1)
@card()
via command python axolotl_outerbounds.py --environment=fast-bakery run --config /root/sky_workdir/outerbounds/llama_8b.yaml
The process is running well like following shown in attached image, I want to kill this process and start another process. I am unable to kill it somehow can someone please help me?ancient-application-36103
10/14/2024, 7:25 PMpython flow.py kubernetes kill --help
to kill runaway jobsstale-vr-93035
10/14/2024, 7:26 PMflow.py
do I need to do it on axolotl_outerbounds.py
?ancient-application-36103
10/14/2024, 7:26 PMancient-application-36103
10/14/2024, 7:27 PMouterbounds
or on a self hosted stack? if on outerbounds
- you can also use the realtime gpu_profile card btw.stale-vr-93035
10/14/2024, 7:30 PMouterbounds
, can you please elaborate the way to use realtime gpu_profile
card with some examplestale-vr-93035
10/14/2024, 7:31 PMancient-application-36103
10/14/2024, 7:32 PMancient-application-36103
10/14/2024, 7:33 PMouterbounds
stale-vr-93035
10/14/2024, 7:33 PMancient-application-36103
10/14/2024, 8:06 PMouterbounds
ui