Has anybody experienced this metaflow-ray error du...
# ask-metaflow
i
Has anybody experienced this metaflow-ray error during training? It's entirely possible I am running out of memory and this is a side-effect but I'm not sure
Copy code
File "/metaflow/metaflow_extensions/ray/plugins/status_notifier.py", line 106, in wait_for_task_completion
    if time.time() - status.timestamp > heartbeat_timeout:
       ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'
Running: • metaflow==2.13 • metaflow-ray==0.1.3
1
s
@hallowed-glass-14538 any thoughts here?
i
adding
@retry
did eventually work on the last retry
h
hey! can you share more context on when this starts happening ?
s
reading through the code - status.timestamp can be
None
i
It is happening sporadically while training. Sometimes it happens when my data is downloading, sometimes it happens in the middle of a training epoch.
h
few more questions for more context: • are you running this on K8s / aws-batch ? • what is the metaflow datastore you are running this on? • what is the size of
num_parallel
you are running ?
i
Running on k8s, on GCP so using the “gs” datastore, and num_parallel==3
h
lemme take a look and reach out! thanks for the bug report
hey! so we have cut a patch release (0.1.4) that handles the failure case for this condition gracefully. The patch also adds an environment variable for debugging whats happening behind the scenes. you can set the
@environment(vars={"METAFLOW_RAY_DEBUG_MODE":"true"})
on the
@step
that has
@metaflow_ray
and this will make it publish debug logs to stderr for the worker tasks. Can you try it out and potentially share the logs if this you face this issue again ?
i
Upgraded to
metaflow-ray==0.1.4
and that fixed my issue! I'm seeing the fix in the logs too.
Copy code
[@metaflow_ray] Task 0 status: running with timestamp 1739281146.9187553
23
08:39:12
[@metaflow_ray] Task 0 status: unreachable with timestamp None
24
08:39:12
[@metaflow_ray] Task 0 unreachable
25
08:39:12
[@metaflow_ray] Task 0 still unreachable after 0.0 seconds
26
08:39:13
[@metaflow_ray] Task 0 status: running with timestamp 1739281152.0030744
27
08:39:14
[@metaflow_ray] Task 0 status: running with timestamp 1739281152.0030744
Thanks for the quick support!