Hi, I'm attempting to run a simple Ray training jo...
# ask-metaflow
h
Hi, I'm attempting to run a simple Ray training job in Metaflow using the metaflow-ray plugin and I'm getting the following error. It seems like this
_control_mapper_tasks
is defined as a part of the ParallelDecorator that should be inherited in the RayDecorator class. Is this a bug or am I doing something wrong? The ray job that I'm attempting is just this sample where the training functionality is put into a single step decorated with
metaflow_ray
Copy code
Traceback (most recent call last):
  File "/metaflow/metaflow/cli.py", line 1139, in main
    start(auto_envvar_prefix="METAFLOW", obj=state)
  File "/metaflow/metaflow/tracing/__init__.py", line 27, in wrapper_func
    return func(args, kwargs)
  File "/metaflow/metaflow/_vendor/click/core.py", line 829, in __call__
    return self.main(args, kwargs)
  File "/metaflow/metaflow/_vendor/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/metaflow/metaflow/_vendor/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/metaflow/metaflow/_vendor/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, ctx.params)
  File "/metaflow/metaflow/_vendor/click/core.py", line 610, in invoke
    return callback(args, kwargs)
  File "/metaflow/metaflow/_vendor/click/decorators.py", line 21, in new_func
    return f(get_current_context(), args, kwargs)
  File "/metaflow/metaflow/cli.py", line 469, in step
    task.run_step(
  File "/metaflow/metaflow/task.py", line 702, in run_step
    self._finalize_control_task()
  File "/metaflow/metaflow/task.py", line 349, in _finalize_control_task
    mapper_tasks = self.flow._control_mapper_tasks
  File "/metaflow/metaflow/flowspec.py", line 254, in __getattr__
    raise AttributeError("Flow %s has no attribute '%s'" % (self.name, name))
AttributeError: Flow TorchTrainerGPU has no attribute '_control_mapper_tasks'
1
h
R u running on batch or kubernetes or just locally ?
h
This is running on kubernetes
h
What’s the version of Metaflow-ray ?
And the version of Metaflow ?
h
I'm using the newest version
2.12.31
and using
0.1.0
for
metaflow-ray
but I have tested this on multiple versions of metaflow and got the same error
The control pod is able to finish successfully if I set a value for
_control_mapper_tasks
like
self._control_mapper_tasks = [node.get("NodeManagerHostname") for node in ray.nodes()]
but then the worker pod remains running and fails after a while because it is no longer connected to the head node.
h
Are you using jobsets in kubernetes ?
h
yep
h
Huh. Let me have a look in a bit. This is new. Never seen this error before.
h
Yeah thats what I thought. Is there a specific version of jobsets that is supported? I'm using
v0.7.0
h
It should be fine. Would you be able to share your step code if possible ?
Or can you share the full logs ?
h
Sure I can share a sample script. Just give me a few minutes
Here is the Flow File
and model.py file that the flow file imports
h
So this fails after the loss is printed right ?
Can you grow num parallel to > 1 ?
h
Yes after the loss is printed it just hangs
Yeah I can change that
h
Currently the requirement around jobsets is that we need num parallel > 1
h
ah interesting
yeah let me try that
h
Additionally, you don’t need to set Metaflow-ray in the @pypi. Just ensure that it’s installed in the python environment calling the flow
1
h
Yeah that was added as an act of desperation to try and figure out the error
h
Because Metaflow manages the packaging of any extensions which are within its ecosystem
1
h
yep setting num_parallel > 1 fixes the
_control_mapper_tasks
issue
Thanks for the help!
h
Thanks for reporting the bug. Let me ship a patch that will crash the task before it sends anything to k8s is when num parallel is set to < 2
thankyou 1
h
@little-apartment-49355 Another quick question, is there anyway to get the control pod and the worker pods to run on different instance types? I would like to not waste a valuable GPU node on the control pod if it is only being used for bookeeping.
h
currently what ever node selectors you set get applied to both control and worker jobs. But the head node's resources are fully available for usage. It's not a book-keeping node. There is a note to make here about the execution semantics of metaflow ray. The way metaflow_ray works is by launching the a
head
node process (in a separate subprocess) on your
control
task. The separate subprocess detail is important here because the
@step
code can call
ray.init
and detects a cluster where it's own node (the control node) is also a part of the cluster. Essentially allowing you to utilize the resources of the head node too. The worker tasks join the cluster via the control-node's ip. The control task is where your actual
@step
code runs. The workers tasks have no
@step
code running, they will just be blocking the "metaflow's step process" because Ray will be launching stuff on it. Once the control task is done, the workers get a signal to shut down.
thankyou 1
h
Ah great that sounds good. Okay another issue popped up when running the flow file I sent before, the flow fails at the
end
step with the following error
Copy code
<flow TorchTrainerGPU step end> failed:
    Flow failed:
    Environment variable 'MF_MASTER_ADDR' is missing.
Not sure why this variable would be needed at that step, since it is not using any of the metaflow_ray stuff
h
what is
print(self.result.path)
?
Oh i see there is a bug here.
Let me ship a patch for this. Can you for now remove the
self.merge_artifacts
h
Sure I'll try that
h
Some internal
@parallel
variable is getting piped through
h
yep removing the
self.merge_artifacts
solved the problem
what is the variable that is getting piped in, just thinking of what should I ignore if I do need to bring inputs in the join step.
h
you can use
exclude
parameter in
merge_artifacts
. I have a open PR to ship this in the core that will get released soon. You can do
self.merge_artifacts(exclude=["_parallel_ubf_iter"])
thankyou 1
once the pr is merged that code block will not be required.