kind-horse-40048
04/03/2025, 10:09 PMhi-torchrun.py
script copied exactly from the repo (except for commenting out the kubernetes
import and decoration), I get the following traceback with Python 3.10, metaflow==2.15.7
, metaflow-torchrun==0.1.1
, and an otherwise fresh virtual environment.
$ python hello_parallel.py run
Metaflow 2.15.7+<unk>(<unk>) executing HelloTorchrun for user:peterw
Validating your flow...
The graph looks good!
Running pylint...
Pylint not found, so extra checks are disabled.
Creating local datastore in current directory (/scratch/test_parallel/.metaflow)
2025-04-03 15:01:34.703 Workflow starting (run-id 12188):
2025-04-03 15:01:35.308 [12188/start/636332 (pid 17942)] Task is starting.
2025-04-03 15:01:36.921 [12188/start/636332 (pid 17942)] Task finished successfully.
2025-04-03 15:01:37.135 [12188/torch_multinode/636333 (pid 17967)] Task is starting.
2025-04-03 15:01:37.895 [12188/torch_multinode/636333 (pid 17967)] <flow HelloTorchrun step torch_multinode[0] (input: 0)> failed:
2025-04-03 15:01:38.248 [12188/torch_multinode/636333 (pid 17967)] Internal error
2025-04-03 15:01:38.249 [12188/torch_multinode/636333 (pid 17967)] Traceback (most recent call last):
2025-04-03 15:01:38.249 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/cli.py", line 611, in main
2025-04-03 15:01:38.249 [12188/torch_multinode/636333 (pid 17967)] start(auto_envvar_prefix="METAFLOW", obj=state)
2025-04-03 15:01:38.250 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 829, in __call__
2025-04-03 15:01:38.250 [12188/torch_multinode/636333 (pid 17967)] return self.main(args, kwargs)
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 782, in main
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] rv = self.invoke(ctx)
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/cli_components/utils.py", line 69, in invoke
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] return _process_result(sub_ctx.command.invoke(sub_ctx))
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 1066, in invoke
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] return ctx.invoke(self.callback, ctx.params)
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 610, in invoke
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] return callback(args, kwargs)
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/tracing/__init__.py", line 27, in wrapper_func
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] return func(args, kwargs)
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/_vendor/click/decorators.py", line 21, in new_func
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] return f(get_current_context(), args, kwargs)
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/cli_components/step_cmd.py", line 167, in step
1 from metaflow import (
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] task.run_step(
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/task.py", line 660, in run_step
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] step_func = deco.task_decorate(
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/plugins/parallel_decorator.py", line 156, in task_decorate
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] ",".join(self.input_paths),
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] AttributeError: 'TorchrunDecoratorParallel' object has no attribute 'input_paths'
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)]
2025-04-03 15:01:38.364 [12188/torch_multinode/636333 (pid 17967)] Task failed.
2025-04-03 15:01:38.417 Workflow failed.
2025-04-03 15:01:38.417 Terminating 0 active tasks...
2025-04-03 15:01:38.417 Flushing logs...
Step failure:
Step torch_multinode (task-id 636333) failed.
Does the step need to be decorated with @kubernetes
or @batch
for the example to work? Is there some other issue? I've tried downgrading metaflow
and metaflow-torchrun
to various versions without any luck. Thanks for any help!hallowed-glass-14538
04/03/2025, 10:18 PMhallowed-glass-14538
04/03/2025, 10:18 PM@batch
hallowed-glass-14538
04/03/2025, 10:20 PMkind-horse-40048
04/03/2025, 10:28 PMhallowed-glass-14538
04/03/2025, 10:38 PMcurrent.torch.run
in their code. This ensures that all nodes end up calling torchrun command and the training process if fully managed by torchrun.kind-horse-40048
04/04/2025, 12:37 AMkind-horse-40048
04/04/2025, 2:32 AMnproc_per_node
. Here is my flow:
from metaflow import (
FlowSpec,
Parameter,
batch,
current,
step,
torchrun,
)
class HelloTorchrun(FlowSpec):
num_nodes = Parameter(
name="num_nodes", default=2, type=int, required=True, help="num_nodes"
)
@step
def start(self):
self.next(self.torch_multinode, num_parallel=self.num_nodes)
@batch
@torchrun
@step
def torch_multinode(self):
current.torch.run(entrypoint="hi-torchrun.py", nproc_per_node=1)
self.next(self.join)
@step
def join(self, inputs):
self.next(self.end)
@step
def end(self):
pass
if __name__ == "__main__":
HelloTorchrun()
and hi-torchrun.py
is copied directly from the repo. I'm running this on Batch with an image that has the latest metaflow
and metaflow-torchrun
. I added nproc_per_node=1
to the example from the repo, just to check.
Here is the error I'm getting:
Code package downloaded.
Task is starting.
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 676, in determine_local_world_size
return int(nproc_per_node)
^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'None'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 905, in <module>
main()
File "/usr/local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 891, in run
config, cmd, cmd_args = config_from_args(args)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 776, in config_from_args
nproc_per_node = determine_local_world_size(args.nproc_per_node)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 703, in determine_local_world_size
raise ValueError(f"Unsupported nproc_per_node value: {nproc_per_node}") from e
ValueError: Unsupported nproc_per_node value: None
<flow HelloTorchrun step torch_multinode[0] (input: 0)> failed:
:
The torchrun command
s could be a transient error. Use @retry to retry.
The `torchrun` command running on node 0 has crashed.
[stderr]: Process exited with errors (see above for details)
failed to complete.
Looking through the metaflow-torchrun
code I don't see any place nproc_per_node
might be specified as None
. Any suggestions?hallowed-glass-14538
04/04/2025, 2:36 AMkind-horse-40048
04/04/2025, 2:36 AMkind-horse-40048
04/04/2025, 2:49 AMhallowed-glass-14538
04/04/2025, 2:50 AMhallowed-glass-14538
04/04/2025, 3:11 AM@batch
decorator to this:
@batch(
image="<http://registry.hub.docker.com/pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime|registry.hub.docker.com/pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime>",
cpu=2,
)
Just wanted to validate reproducibility on your side since I wasn't able to reproduce.kind-horse-40048
04/04/2025, 1:42 PM----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| timestamp | message |
|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1743773792903 | Setting up task environment. |
| 1743773793225 | WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1006)'))': /simple/boto3/ |
| 1743773793736 | WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1006)'))': /simple/boto3/ |
| 1743773794748 | WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1006)'))': /simple/boto3/ |
| 1743773796760 | WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1006)'))': /simple/boto3/ |
| 1743773800773 | WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1006)'))': /simple/boto3/ |
| 1743773800790 | ERROR: Could not find a version that satisfies the requirement boto3 (from versions: none) |
| 1743773800790 | ERROR: No matching distribution found for boto3 |
| 1743773800839 | bash: line 1: [: -le: unary operator expected |
| 1743773800839 | bash: line 1: [: -gt: unary operator expected |
| 1743773800840 | tar: job.tar: Cannot open: No such file or directory |
| 1743773800840 | tar: Error is not recoverable: exiting now |
| 1743773800855 | /opt/conda/bin/python: Error while finding module specification for 'metaflow.mflog.save_logs' (ModuleNotFoundError: No module named 'metaflow') |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Does the image not have to have metaflow installed? I forget whether metaflow is packaged up in the code package.kind-horse-40048
04/04/2025, 3:56 PMmetaflow
and metaflow-torchrun
installed got the Batch job to run but left me with
[W404 15:54:20.325384868 socket.cpp:697] [c10d] The client socket has failed to connect to [<redacted hostname>]:3339 (errno: 110 - Connection timed out).
I suspect this is a networking issue I'll need to work out with my team. If you have any suggestions on where to look first, I'd appreciate it, but either way thank you for your help!
First thing I'll try: updating torch, per here.hallowed-glass-14538
04/04/2025, 4:03 PMkind-horse-40048
04/04/2025, 4:05 PMkind-horse-40048
04/04/2025, 7:48 PM