Hi I m attempting to run a simple Ray training job in Metafl Outerbounds #ask-metaflow

Hi, I'm attempting to run a simple Ray training jo...

hallowed-soccer-94479

11/26/2024, 3:57 PM

Hi, I'm attempting to run a simple Ray training job in Metaflow using the metaflow-ray plugin and I'm getting the following error. It seems like this

_control_mapper_tasks

is defined as a part of the ParallelDecorator that should be inherited in the RayDecorator class. Is this a bug or am I doing something wrong? The ray job that I'm attempting is just this sample where the training functionality is put into a single step decorated with

metaflow_ray

Copy code

Traceback (most recent call last):
  File "/metaflow/metaflow/cli.py", line 1139, in main
    start(auto_envvar_prefix="METAFLOW", obj=state)
  File "/metaflow/metaflow/tracing/__init__.py", line 27, in wrapper_func
    return func(args, kwargs)
  File "/metaflow/metaflow/_vendor/click/core.py", line 829, in __call__
    return self.main(args, kwargs)
  File "/metaflow/metaflow/_vendor/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/metaflow/metaflow/_vendor/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/metaflow/metaflow/_vendor/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, ctx.params)
  File "/metaflow/metaflow/_vendor/click/core.py", line 610, in invoke
    return callback(args, kwargs)
  File "/metaflow/metaflow/_vendor/click/decorators.py", line 21, in new_func
    return f(get_current_context(), args, kwargs)
  File "/metaflow/metaflow/cli.py", line 469, in step
    task.run_step(
  File "/metaflow/metaflow/task.py", line 702, in run_step
    self._finalize_control_task()
  File "/metaflow/metaflow/task.py", line 349, in _finalize_control_task
    mapper_tasks = self.flow._control_mapper_tasks
  File "/metaflow/metaflow/flowspec.py", line 254, in __getattr__
    raise AttributeError("Flow %s has no attribute '%s'" % (self.name, name))
AttributeError: Flow TorchTrainerGPU has no attribute '_control_mapper_tasks'

✅ 1

hallowed-glass-14538

11/26/2024, 5:01 PM

R u running on batch or kubernetes or just locally ?

hallowed-soccer-94479

11/26/2024, 5:02 PM

This is running on kubernetes

hallowed-glass-14538

11/26/2024, 5:03 PM

What’s the version of Metaflow-ray ?

hallowed-glass-14538

11/26/2024, 5:03 PM

And the version of Metaflow ?

hallowed-soccer-94479

11/26/2024, 5:04 PM

I'm using the newest version

2.12.31

and using

0.1.0

for

metaflow-ray

but I have tested this on multiple versions of metaflow and got the same error

hallowed-soccer-94479

11/26/2024, 5:06 PM

The control pod is able to finish successfully if I set a value for

_control_mapper_tasks

self._control_mapper_tasks = [node.get("NodeManagerHostname") for node in ray.nodes()]

but then the worker pod remains running and fails after a while because it is no longer connected to the head node.

hallowed-glass-14538

11/26/2024, 5:07 PM

Are you using jobsets in kubernetes ?

hallowed-soccer-94479

11/26/2024, 5:07 PM

yep

hallowed-glass-14538

11/26/2024, 5:07 PM

Huh. Let me have a look in a bit. This is new. Never seen this error before.

hallowed-glass-14538

11/26/2024, 5:08 PM

We set the mapper task def here https://github.com/Netflix/metaflow/blob/15c86b9db509edd14600b9a017865bc0348aba78/metaflow/plugins/kubernetes/kubernetes_decorator.py#L497

hallowed-soccer-94479

11/26/2024, 5:09 PM

Yeah thats what I thought. Is there a specific version of jobsets that is supported? I'm using

v0.7.0

hallowed-glass-14538

11/26/2024, 5:14 PM

It should be fine. Would you be able to share your step code if possible ?

hallowed-glass-14538

11/26/2024, 5:19 PM

Or can you share the full logs ?

hallowed-soccer-94479

11/26/2024, 5:21 PM

Sure I can share a sample script. Just give me a few minutes

hallowed-soccer-94479

11/26/2024, 5:27 PM

Here is the Flow File

hallowed-soccer-94479

11/26/2024, 5:27 PM

and model.py file that the flow file imports

hallowed-glass-14538

11/26/2024, 5:27 PM

So this fails after the loss is printed right ?

hallowed-glass-14538

11/26/2024, 5:28 PM

Can you grow num parallel to > 1 ?

hallowed-soccer-94479

11/26/2024, 5:28 PM

Yes after the loss is printed it just hangs

hallowed-soccer-94479

11/26/2024, 5:28 PM

Yeah I can change that

hallowed-glass-14538

11/26/2024, 5:28 PM

Currently the requirement around jobsets is that we need num parallel > 1

hallowed-soccer-94479

11/26/2024, 5:29 PM

ah interesting

hallowed-soccer-94479

11/26/2024, 5:29 PM

yeah let me try that

hallowed-glass-14538

11/26/2024, 5:30 PM

Additionally, you don’t need to set Metaflow-ray in the @pypi. Just ensure that it’s installed in the python environment calling the flow

✅ 1

hallowed-soccer-94479

11/26/2024, 5:30 PM

Yeah that was added as an act of desperation to try and figure out the error

hallowed-glass-14538

11/26/2024, 5:30 PM

Because Metaflow manages the packaging of any extensions which are within its ecosystem

✅ 1

hallowed-soccer-94479

11/26/2024, 5:35 PM

yep setting num_parallel > 1 fixes the

_control_mapper_tasks

issue

hallowed-soccer-94479

11/26/2024, 5:35 PM

Thanks for the help!

hallowed-glass-14538

11/26/2024, 5:38 PM

Thanks for reporting the bug. Let me ship a patch that will crash the task before it sends anything to k8s is when num parallel is set to < 2

thankyou 1

hallowed-soccer-94479

11/26/2024, 6:17 PM

@little-apartment-49355 Another quick question, is there anyway to get the control pod and the worker pods to run on different instance types? I would like to not waste a valuable GPU node on the control pod if it is only being used for bookeeping.

hallowed-glass-14538

11/26/2024, 6:58 PM

currently what ever node selectors you set get applied to both control and worker jobs. But the head node's resources are fully available for usage. It's not a book-keeping node. There is a note to make here about the execution semantics of metaflow ray. The way metaflow_ray works is by launching the a

head

node process (in a separate subprocess) on your

control

task. The separate subprocess detail is important here because the

@step

code can call

ray.init

and detects a cluster where it's own node (the control node) is also a part of the cluster. Essentially allowing you to utilize the resources of the head node too. The worker tasks join the cluster via the control-node's ip. The control task is where your actual

@step

code runs. The workers tasks have no

@step

code running, they will just be blocking the "metaflow's step process" because Ray will be launching stuff on it. Once the control task is done, the workers get a signal to shut down.

thankyou 1

hallowed-soccer-94479

11/26/2024, 7:01 PM

Ah great that sounds good. Okay another issue popped up when running the flow file I sent before, the flow fails at the

end

step with the following error

Copy code

<flow TorchTrainerGPU step end> failed:
    Flow failed:
    Environment variable 'MF_MASTER_ADDR' is missing.

Not sure why this variable would be needed at that step, since it is not using any of the metaflow_ray stuff

hallowed-glass-14538

11/26/2024, 7:05 PM

what is

print(self.result.path)

hallowed-glass-14538

11/26/2024, 7:07 PM

Oh i see there is a bug here.

hallowed-glass-14538

11/26/2024, 7:07 PM

Let me ship a patch for this. Can you for now remove the

self.merge_artifacts

hallowed-soccer-94479

11/26/2024, 7:07 PM

Sure I'll try that

hallowed-glass-14538

11/26/2024, 7:07 PM

Some internal

@parallel

variable is getting piped through

hallowed-soccer-94479

11/26/2024, 7:23 PM

yep removing the

self.merge_artifacts

solved the problem

hallowed-soccer-94479

11/26/2024, 7:23 PM

what is the variable that is getting piped in, just thinking of what should I ignore if I do need to bring inputs in the join step.

hallowed-glass-14538

11/26/2024, 7:25 PM

you can use

exclude

parameter in

merge_artifacts

. I have a open PR to ship this in the core that will get released soon. You can do

self.merge_artifacts(exclude=["_parallel_ubf_iter"])

thankyou 1

hallowed-glass-14538

11/26/2024, 7:26 PM

once the pr is merged that code block will not be required.

3 Views

Open in Slack

Previous Next