melodic-train-1526
01/18/2022, 5:46 PMsquare-wire-39606
01/18/2022, 5:47 PMsquare-wire-39606
01/18/2022, 5:51 PMmelodic-train-1526
01/18/2022, 5:57 PMancient-application-36103
01/18/2022, 5:57 PMancient-application-36103
01/18/2022, 5:58 PM2022-01-18 09:55:35.269 Workflow starting (run-id 5866):
2022-01-18 09:55:36.290 [5866/start/129824 (pid 29590)] Task is starting.
2022-01-18 09:55:37.816 [5866/start/129824 (pid 29590)] [029f6e95-7974-4471-88eb-10743e651ef4] Task is starting (status SUBMITTED)...
2022-01-18 09:55:43.460 [5866/start/129824 (pid 29590)] [029f6e95-7974-4471-88eb-10743e651ef4] Task is starting (status RUNNABLE)...
2022-01-18 09:56:13.792 [5866/start/129824 (pid 29590)] [029f6e95-7974-4471-88eb-10743e651ef4] Task is starting (status RUNNABLE)...
2022-01-18 09:56:43.806 [5866/start/129824 (pid 29590)] [029f6e95-7974-4471-88eb-10743e651ef4] Task is starting (status RUNNABLE)...
2022-01-18 09:57:14.121 [5866/start/129824 (pid 29590)] [029f6e95-7974-4471-88eb-10743e651ef4] Task is starting (status RUNNABLE)...
2022-01-18 09:57:20.872 [5866/start/129824 (pid 29590)] [029f6e95-7974-4471-88eb-10743e651ef4] Task is starting (status STARTING)...
2022-01-18 09:57:43.245 [5866/start/129824 (pid 29590)] [029f6e95-7974-4471-88eb-10743e651ef4] Task is starting (status FAILED)...
2022-01-18 09:57:43.607 [5866/start/129824 (pid 29590)] AWS Batch error:
2022-01-18 09:57:43.607 [5866/start/129824 (pid 29590)] CannotPullContainerError: Error response from daemon: pull access denied for foo, repository does not exist or may require 'docker login': denied: requested access to the resource is denied This could be a transient error. Use @retry to retry.
2022-01-18 09:57:43.685 [5866/start/129824 (pid 29590)]
2022-01-18 09:57:43.988 [5866/start/129824 (pid 29590)] Task failed.
2022-01-18 09:57:44.331 [5866/start/129824 (pid 29616)] Task fallback is starting to handle the failure.
2022-01-18 09:57:46.347 [5866/start/129824 (pid 29616)] @catch caught an exception from <flow _LinearFlow step start>
2022-01-18 09:57:46.348 [5866/start/129824 (pid 29616)] > Traceback (most recent call last):
2022-01-18 09:57:46.349 [5866/start/129824 (pid 29616)] > File "/Users/savin/Code/metaflow/metaflow/task.py", line 547, in run_step
2022-01-18 09:57:46.349 [5866/start/129824 (pid 29616)] > self._exec_step_function(step_func)
2022-01-18 09:57:46.349 [5866/start/129824 (pid 29616)] > File "/Users/savin/Code/metaflow/metaflow/task.py", line 53, in _exec_step_function
2022-01-18 09:57:48.092 [5866/start/129824 (pid 29616)] > step_function()
2022-01-18 09:57:48.092 [5866/start/129824 (pid 29616)] > File "/Users/savin/Code/metaflow/metaflow/plugins/catch_decorator.py", line 118, in fallback_step
2022-01-18 09:57:48.092 [5866/start/129824 (pid 29616)] > raise FailureHandledByCatch(retry_count)
2022-01-18 09:57:48.092 [5866/start/129824 (pid 29616)] > metaflow.plugins.catch_decorator.FailureHandledByCatch: Task execution kept failing over 1 attempts. Your code did not raise an exception. Something in the execution environment caused the failure.
2022-01-18 09:57:48.762 [5866/start/129824 (pid 29616)] Task finished successfully.
2022-01-18 09:57:49.804 [5866/a/129825 (pid 29624)] Task is starting.
2022-01-18 09:57:51.141 [5866/a/129825 (pid 29624)] [12107cab-d786-44ef-b3de-541fd924c23a] Task is starting (status SUBMITTED)...
2022-01-18 09:57:55.614 [5866/a/129825 (pid 29624)] [12107cab-d786-44ef-b3de-541fd924c23a] Task is starting (status RUNNABLE)...
2022-01-18 09:57:58.977 [5866/a/129825 (pid 29624)] [12107cab-d786-44ef-b3de-541fd924c23a] Task is starting (status STARTING)...
ancient-application-36103
01/18/2022, 6:13 PMancient-application-36103
01/18/2022, 6:13 PMDockerTimeoutError: Could not transition to created; timed out after waiting 4m0s
- from DMsbulky-gpu-70936
01/18/2022, 9:08 PMInternal Error: end step failed to run
(even though the end step is not the next in the Flow). We can run the exact same flow and when there are no such failures, i.e. any catch that does occur at least enters the code on batch, and the flow will run without issue.
tl;dr The @catch decorator does appear to work for us too. But catching errors where containers fail to launch seems to correlate with an error launching the next step of the flow.
@melodic-train-1526 could you share the error seen when the flow fails to move to the join step?