Hi Outerbounds, I'm having trouble running the <h...
# ask-metaflow
k
Hi Outerbounds, I'm having trouble running the hello-world example from metaflow-torchrun on my local machine. With the flow and
hi-torchrun.py
script copied exactly from the repo (except for commenting out the
kubernetes
import and decoration), I get the following traceback with Python 3.10,
metaflow==2.15.7
,
metaflow-torchrun==0.1.1
, and an otherwise fresh virtual environment.
Copy code
$ python hello_parallel.py run
Metaflow 2.15.7+<unk>(<unk>) executing HelloTorchrun for user:peterw
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
Creating local datastore in current directory (/scratch/test_parallel/.metaflow)
2025-04-03 15:01:34.703 Workflow starting (run-id 12188):
2025-04-03 15:01:35.308 [12188/start/636332 (pid 17942)] Task is starting.
2025-04-03 15:01:36.921 [12188/start/636332 (pid 17942)] Task finished successfully.
2025-04-03 15:01:37.135 [12188/torch_multinode/636333 (pid 17967)] Task is starting.
2025-04-03 15:01:37.895 [12188/torch_multinode/636333 (pid 17967)] <flow HelloTorchrun step torch_multinode[0] (input: 0)> failed:
2025-04-03 15:01:38.248 [12188/torch_multinode/636333 (pid 17967)] Internal error
2025-04-03 15:01:38.249 [12188/torch_multinode/636333 (pid 17967)] Traceback (most recent call last):
2025-04-03 15:01:38.249 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/cli.py", line 611, in main
2025-04-03 15:01:38.249 [12188/torch_multinode/636333 (pid 17967)] start(auto_envvar_prefix="METAFLOW", obj=state)
2025-04-03 15:01:38.250 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 829, in __call__
2025-04-03 15:01:38.250 [12188/torch_multinode/636333 (pid 17967)] return self.main(args, kwargs)
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 782, in main
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] rv = self.invoke(ctx)
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/cli_components/utils.py", line 69, in invoke
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] return _process_result(sub_ctx.command.invoke(sub_ctx))
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 1066, in invoke
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] return ctx.invoke(self.callback, ctx.params)
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/_vendor/click/core.py", line 610, in invoke
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] return callback(args, kwargs)
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/tracing/__init__.py", line 27, in wrapper_func
2025-04-03 15:01:38.334 [12188/torch_multinode/636333 (pid 17967)] return func(args, kwargs)
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/_vendor/click/decorators.py", line 21, in new_func
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] return f(get_current_context(), args, kwargs)
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/cli_components/step_cmd.py", line 167, in step
  1 from metaflow import (
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] task.run_step(
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/task.py", line 660, in run_step
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] step_func = deco.task_decorate(
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] File "/scratch/test_parallel/venv/lib/python3.10/site-packages/metaflow/plugins/parallel_decorator.py", line 156, in task_decorate
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] ",".join(self.input_paths),
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)] AttributeError: 'TorchrunDecoratorParallel' object has no attribute 'input_paths'
2025-04-03 15:01:38.335 [12188/torch_multinode/636333 (pid 17967)]
2025-04-03 15:01:38.364 [12188/torch_multinode/636333 (pid 17967)] Task failed.
2025-04-03 15:01:38.417 Workflow failed.
2025-04-03 15:01:38.417 Terminating 0 active tasks...
2025-04-03 15:01:38.417 Flushing logs...
    Step failure:
    Step torch_multinode (task-id 636333) failed.
Does the step need to be decorated with
@kubernetes
or
@batch
for the example to work? Is there some other issue? I've tried downgrading
metaflow
and
metaflow-torchrun
to various versions without any luck. Thanks for any help!
👀 1
1
h
hey ! so if you are running torch run on a single machine locally (with multiple GPUs) then you dont need the @torchrun decorator. You can check out this example https://github.com/outerbounds/metaflow-torchrun/tree/main/examples/single-node-multigpu launching torchrun scripts on a single machine with multiple gpus
The torchrun decorator is generally when you want to spawn multiple nodes with`@kubernetes` or
@batch
But for single node torch run executions we have simple abstraction that provides syntactic sugar to do torchrun calls.
k
Ah, I guess I assumed that this example would work locally just like a step wrapped in the resources decorator might work, that it would work whether or not a multi-node setup was actually available. I'll try it with the intended resources. Thanks for your response.
h
@torchrun is built on top of @parallel decorator. when we run anything using @parallel decorator locally, metaflow will try to create multiple copies of your step function in different subprocesses. The remote behavior is replicated in the same way by creating multiple jobs at the same time. Now this style of creating multiple subprocesses on local can even influence runtime behavior of libraries like torchrun since torchrun is itself trying to spawn the subprocesses and wants to mange them itself. So we encourage users to just spawn a single task for multi-gpu single node training and just let torchrun manage the subprocesses on its own. For the remote case, users might need multiple nodes of specific sizes with network/connectivity setup between the nodes and then want to call torchrun on all of the nodes. Here metaflow can take over most of the complexities of job setup/connectivity and users can call just
current.torch.run
in their code. This ensures that all nodes end up calling torchrun command and the training process if fully managed by torchrun.
k
Okay, thanks for the context. My intention is to have a single flow that I can run locally, on Batch with a single node (and sometimes multiple GPUs), or with torchrun + Batch when I want to use a distributed setup. So it sounds like I'll need to condition various things on environment variables at deploy time. I'm using Lightning to manage the torch training logic, and it supposedly handles torch.distributed in stride, so I'm hopeful that I can just execute my training script with subprocess rather than current.torch.run when I don't need the distribution across nodes.
I'm still trying to get the hello-world example to work and I'm getting an error about
nproc_per_node
. Here is my flow:
Copy code
from metaflow import (
    FlowSpec,
    Parameter,
    batch,
    current,
    step,
    torchrun,
)


class HelloTorchrun(FlowSpec):
    num_nodes = Parameter(
        name="num_nodes", default=2, type=int, required=True, help="num_nodes"
    )

    @step
    def start(self):
        self.next(self.torch_multinode, num_parallel=self.num_nodes)

    @batch
    @torchrun
    @step
    def torch_multinode(self):
        current.torch.run(entrypoint="hi-torchrun.py", nproc_per_node=1)
        self.next(self.join)

    @step
    def join(self, inputs):
        self.next(self.end)

    @step
    def end(self):
        pass


if __name__ == "__main__":
    HelloTorchrun()
and
hi-torchrun.py
is copied directly from the repo. I'm running this on Batch with an image that has the latest
metaflow
and
metaflow-torchrun
. I added
nproc_per_node=1
to the example from the repo, just to check. Here is the error I'm getting:
Copy code
Code package downloaded.
 Task is starting.
 Traceback (most recent call last):
   File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 676, in determine_local_world_size
     return int(nproc_per_node)
            ^^^^^^^^^^^^^^^^^^^
    ValueError: invalid literal for int() with base 10: 'None'

 The above exception was the direct cause of the following exception:
 Traceback (most recent call last):
   File "<frozen runpy>", line 198, in _run_module_as_main
   File "<frozen runpy>", line 88, in _run_code
   File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 905, in <module>
     main()
   File "/usr/local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
     return f(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
     run(args)
   File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 891, in run
     config, cmd, cmd_args = config_from_args(args)
                             ^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 776, in config_from_args
     nproc_per_node = determine_local_world_size(args.nproc_per_node)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 703, in determine_local_world_size
     raise ValueError(f"Unsupported nproc_per_node value: {nproc_per_node}") from e
 ValueError: Unsupported nproc_per_node value: None
 <flow HelloTorchrun step torch_multinode[0] (input: 0)> failed:
 :
     The torchrun command

s could be a transient error. Use @retry to retry.


     The `torchrun` command running on node 0 has crashed.

     [stderr]: Process exited with errors (see above for details)

     failed to complete.
Looking through the
metaflow-torchrun
code I don't see any place
nproc_per_node
might be specified as
None
. Any suggestions?
h
whats the version of torch ?
👀 1
k
2.4.1
Somewhere it's overriding a default of "1", specified here.
h
ya I noticed. I am testing it out RN. let me see if we have a bug in the package and let me ship a fix.
🙏 1
can you try setting the
@batch
decorator to this:
Copy code
@batch(
        image="<http://registry.hub.docker.com/pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime|registry.hub.docker.com/pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime>",
        cpu=2,
    )
Just wanted to validate reproducibility on your side since I wasn't able to reproduce.
k
Were you able to run it with that image? I get this handful of errors when running with that image:
Copy code
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   timestamp   |                                                                                                                         message                                                                                                                          |
|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1743773792903 | Setting up task environment.                                                                                                                                                                                                                             |
| 1743773793225 | WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1006)'))': /simple/boto3/ |
| 1743773793736 | WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1006)'))': /simple/boto3/ |
| 1743773794748 | WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1006)'))': /simple/boto3/ |
| 1743773796760 | WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1006)'))': /simple/boto3/ |
| 1743773800773 | WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1006)'))': /simple/boto3/ |
| 1743773800790 | ERROR: Could not find a version that satisfies the requirement boto3 (from versions: none)                                                                                                                                                               |
| 1743773800790 | ERROR: No matching distribution found for boto3                                                                                                                                                                                                          |
| 1743773800839 | bash: line 1: [: -le: unary operator expected                                                                                                                                                                                                            |
| 1743773800839 | bash: line 1: [: -gt: unary operator expected                                                                                                                                                                                                            |
| 1743773800840 | tar: job.tar: Cannot open: No such file or directory                                                                                                                                                                                                     |
| 1743773800840 | tar: Error is not recoverable: exiting now                                                                                                                                                                                                               |
| 1743773800855 | /opt/conda/bin/python: Error while finding module specification for 'metaflow.mflog.save_logs' (ModuleNotFoundError: No module named 'metaflow')                                                                                                         |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Does the image not have to have metaflow installed? I forget whether metaflow is packaged up in the code package.
Running the example on a modified version of that image with
metaflow
and
metaflow-torchrun
installed got the Batch job to run but left me with
Copy code
[W404 15:54:20.325384868 socket.cpp:697] [c10d] The client socket has failed to connect to [<redacted hostname>]:3339 (errno: 110 - Connection timed out).
I suspect this is a networking issue I'll need to work out with my team. If you have any suggestions on where to look first, I'd appreciate it, but either way thank you for your help! First thing I'll try: updating torch, per here.
h
Some of these issues may happen if the batch array nodes are unable to communicate with each other. One more thing is that you may not need Metaflow and Metaflow torch run in the image, but you might need boto3 install installed in the image. From the earlier error it seems like metaflow is unable to install Boto3 because of some network issue or because the requirement could not be found.
k
I see, that makes sense. I am running this in an environment without access to standard PyPI servers so if it was trying to look that it makes sense that it failed. Maybe the fix I made with my custom image was just pointing pip to a different PyPI server.
I've had success after some changes to the security group associated with my Batch instances. Thanks for your help!