hallowed-soccer-94479
11/26/2024, 3:57 PM_control_mapper_tasks
is defined as a part of the
ParallelDecorator that should be inherited in the RayDecorator class. Is this a bug or am I doing something wrong? The ray job that I'm attempting is just this sample where the training functionality is put into a single step decorated with metaflow_ray
Traceback (most recent call last):
File "/metaflow/metaflow/cli.py", line 1139, in main
start(auto_envvar_prefix="METAFLOW", obj=state)
File "/metaflow/metaflow/tracing/__init__.py", line 27, in wrapper_func
return func(args, kwargs)
File "/metaflow/metaflow/_vendor/click/core.py", line 829, in __call__
return self.main(args, kwargs)
File "/metaflow/metaflow/_vendor/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/metaflow/metaflow/_vendor/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/metaflow/metaflow/_vendor/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/metaflow/metaflow/_vendor/click/core.py", line 610, in invoke
return callback(args, kwargs)
File "/metaflow/metaflow/_vendor/click/decorators.py", line 21, in new_func
return f(get_current_context(), args, kwargs)
File "/metaflow/metaflow/cli.py", line 469, in step
task.run_step(
File "/metaflow/metaflow/task.py", line 702, in run_step
self._finalize_control_task()
File "/metaflow/metaflow/task.py", line 349, in _finalize_control_task
mapper_tasks = self.flow._control_mapper_tasks
File "/metaflow/metaflow/flowspec.py", line 254, in __getattr__
raise AttributeError("Flow %s has no attribute '%s'" % (self.name, name))
AttributeError: Flow TorchTrainerGPU has no attribute '_control_mapper_tasks'
hallowed-glass-14538
11/26/2024, 5:01 PMhallowed-soccer-94479
11/26/2024, 5:02 PMhallowed-glass-14538
11/26/2024, 5:03 PMhallowed-glass-14538
11/26/2024, 5:03 PMhallowed-soccer-94479
11/26/2024, 5:04 PM2.12.31
and using 0.1.0
for metaflow-ray
but I have tested this on multiple versions of metaflow and got the same errorhallowed-soccer-94479
11/26/2024, 5:06 PM_control_mapper_tasks
like self._control_mapper_tasks = [node.get("NodeManagerHostname") for node in ray.nodes()]
but then the worker pod remains running and fails after a while because it is no longer connected to the head node.hallowed-glass-14538
11/26/2024, 5:07 PMhallowed-soccer-94479
11/26/2024, 5:07 PMhallowed-glass-14538
11/26/2024, 5:07 PMhallowed-glass-14538
11/26/2024, 5:08 PMhallowed-soccer-94479
11/26/2024, 5:09 PMv0.7.0
hallowed-glass-14538
11/26/2024, 5:14 PMhallowed-glass-14538
11/26/2024, 5:19 PMhallowed-soccer-94479
11/26/2024, 5:21 PMhallowed-soccer-94479
11/26/2024, 5:27 PMhallowed-soccer-94479
11/26/2024, 5:27 PMhallowed-glass-14538
11/26/2024, 5:27 PMhallowed-glass-14538
11/26/2024, 5:28 PMhallowed-soccer-94479
11/26/2024, 5:28 PMhallowed-soccer-94479
11/26/2024, 5:28 PMhallowed-glass-14538
11/26/2024, 5:28 PMhallowed-soccer-94479
11/26/2024, 5:29 PMhallowed-soccer-94479
11/26/2024, 5:29 PMhallowed-glass-14538
11/26/2024, 5:30 PMhallowed-soccer-94479
11/26/2024, 5:30 PMhallowed-glass-14538
11/26/2024, 5:30 PMhallowed-soccer-94479
11/26/2024, 5:35 PM_control_mapper_tasks
issuehallowed-soccer-94479
11/26/2024, 5:35 PMhallowed-glass-14538
11/26/2024, 5:38 PMhallowed-soccer-94479
11/26/2024, 6:17 PMhallowed-glass-14538
11/26/2024, 6:58 PMhead
node process (in a separate subprocess) on your control
task. The separate subprocess detail is important here because the @step
code can call ray.init
and detects a cluster where it's own node (the control node) is also a part of the cluster. Essentially allowing you to utilize the resources of the head node too.
The worker tasks join the cluster via the control-node's ip. The control task is where your actual @step
code runs. The workers tasks have no @step
code running, they will just be blocking the "metaflow's step process" because Ray will be launching stuff on it. Once the control task is done, the workers get a signal to shut down.hallowed-soccer-94479
11/26/2024, 7:01 PMend
step with the following error
<flow TorchTrainerGPU step end> failed:
Flow failed:
Environment variable 'MF_MASTER_ADDR' is missing.
Not sure why this variable would be needed at that step, since it is not using any of the metaflow_ray stuffhallowed-glass-14538
11/26/2024, 7:05 PMprint(self.result.path)
?hallowed-glass-14538
11/26/2024, 7:07 PMhallowed-glass-14538
11/26/2024, 7:07 PMself.merge_artifacts
hallowed-soccer-94479
11/26/2024, 7:07 PMhallowed-glass-14538
11/26/2024, 7:07 PM@parallel
variable is getting piped throughhallowed-soccer-94479
11/26/2024, 7:23 PMself.merge_artifacts
solved the problemhallowed-soccer-94479
11/26/2024, 7:23 PMhallowed-glass-14538
11/26/2024, 7:25 PMexclude
parameter in merge_artifacts
.
I have a open PR to ship this in the core that will get released soon. You can do self.merge_artifacts(exclude=["_parallel_ubf_iter"])
hallowed-glass-14538
11/26/2024, 7:26 PM