Hi, another question from me! Having an issue with...
# dev-metaflow
m
Hi, another question from me! Having an issue with the @catch decorator, it seems to work nicely when a batch job fails inside the code but if an error occurs that causes the batch job to never actually hit that code, catch seems to know it happens but fails the job anyway. This means that after all batch jobs are done, the code crashes with no error message and the join step never hits. Internally, we think we know why the error occurs (ie some jobs don't actually launch the code) but this feels as if it isn't how metaflow should behave. Does anyone have any experience with this/any thoughts how I can alter the catch decorator to handle this?
1
s
let me check
Can you help me with what is the error that AWS Batch is running into? I am trying to replicate this issue.
m
So you probably can't easily replicate, the error is to do with our docker repository. The main point though is if you have an error at that stage in the process, catch can't handle it
a
Got it. Let me try to spoof the error.
I tried with a non-existent image -
Copy code
2022-01-18 09:55:35.269 Workflow starting (run-id 5866):
2022-01-18 09:55:36.290 [5866/start/129824 (pid 29590)] Task is starting.
2022-01-18 09:55:37.816 [5866/start/129824 (pid 29590)] [029f6e95-7974-4471-88eb-10743e651ef4] Task is starting (status SUBMITTED)...
2022-01-18 09:55:43.460 [5866/start/129824 (pid 29590)] [029f6e95-7974-4471-88eb-10743e651ef4] Task is starting (status RUNNABLE)...
2022-01-18 09:56:13.792 [5866/start/129824 (pid 29590)] [029f6e95-7974-4471-88eb-10743e651ef4] Task is starting (status RUNNABLE)...
2022-01-18 09:56:43.806 [5866/start/129824 (pid 29590)] [029f6e95-7974-4471-88eb-10743e651ef4] Task is starting (status RUNNABLE)...
2022-01-18 09:57:14.121 [5866/start/129824 (pid 29590)] [029f6e95-7974-4471-88eb-10743e651ef4] Task is starting (status RUNNABLE)...
2022-01-18 09:57:20.872 [5866/start/129824 (pid 29590)] [029f6e95-7974-4471-88eb-10743e651ef4] Task is starting (status STARTING)...
2022-01-18 09:57:43.245 [5866/start/129824 (pid 29590)] [029f6e95-7974-4471-88eb-10743e651ef4] Task is starting (status FAILED)...
2022-01-18 09:57:43.607 [5866/start/129824 (pid 29590)] AWS Batch error:
2022-01-18 09:57:43.607 [5866/start/129824 (pid 29590)] CannotPullContainerError: Error response from daemon: pull access denied for foo, repository does not exist or may require 'docker login': denied: requested access to the resource is denied This could be a transient error. Use @retry to retry.
2022-01-18 09:57:43.685 [5866/start/129824 (pid 29590)]
2022-01-18 09:57:43.988 [5866/start/129824 (pid 29590)] Task failed.
2022-01-18 09:57:44.331 [5866/start/129824 (pid 29616)] Task fallback is starting to handle the failure.
2022-01-18 09:57:46.347 [5866/start/129824 (pid 29616)] @catch caught an exception from <flow _LinearFlow step start>
2022-01-18 09:57:46.348 [5866/start/129824 (pid 29616)] >  Traceback (most recent call last):
2022-01-18 09:57:46.349 [5866/start/129824 (pid 29616)] >    File "/Users/savin/Code/metaflow/metaflow/task.py", line 547, in run_step
2022-01-18 09:57:46.349 [5866/start/129824 (pid 29616)] >      self._exec_step_function(step_func)
2022-01-18 09:57:46.349 [5866/start/129824 (pid 29616)] >    File "/Users/savin/Code/metaflow/metaflow/task.py", line 53, in _exec_step_function
2022-01-18 09:57:48.092 [5866/start/129824 (pid 29616)] >      step_function()
2022-01-18 09:57:48.092 [5866/start/129824 (pid 29616)] >    File "/Users/savin/Code/metaflow/metaflow/plugins/catch_decorator.py", line 118, in fallback_step
2022-01-18 09:57:48.092 [5866/start/129824 (pid 29616)] >      raise FailureHandledByCatch(retry_count)
2022-01-18 09:57:48.092 [5866/start/129824 (pid 29616)] >  metaflow.plugins.catch_decorator.FailureHandledByCatch: Task execution kept failing over 1 attempts. Your code did not raise an exception. Something in the execution environment caused the failure.
2022-01-18 09:57:48.762 [5866/start/129824 (pid 29616)] Task finished successfully.
2022-01-18 09:57:49.804 [5866/a/129825 (pid 29624)] Task is starting.
2022-01-18 09:57:51.141 [5866/a/129825 (pid 29624)] [12107cab-d786-44ef-b3de-541fd924c23a] Task is starting (status SUBMITTED)...
2022-01-18 09:57:55.614 [5866/a/129825 (pid 29624)] [12107cab-d786-44ef-b3de-541fd924c23a] Task is starting (status RUNNABLE)...
2022-01-18 09:57:58.977 [5866/a/129825 (pid 29624)] [12107cab-d786-44ef-b3de-541fd924c23a] Task is starting (status STARTING)...
If you can help me with the AWS Batch job response code (from the AWS Batch console) for the offending job, that will be great.
DockerTimeoutError: Could not transition to created; timed out after waiting 4m0s
- from DMs
b
Just to add some detail to this. It doesn't appear that the @catch at that step is directly causing an error, in fact it behaves the same as the output above. However, when one of the batch tasks fails with that error, the flow fails to run the following join step and simply returns something like
Internal Error: end step failed to run
(even though the end step is not the next in the Flow). We can run the exact same flow and when there are no such failures, i.e. any catch that does occur at least enters the code on batch, and the flow will run without issue. tl;dr The @catch decorator does appear to work for us too. But catching errors where containers fail to launch seems to correlate with an error launching the next step of the flow. @melodic-train-1526 could you share the error seen when the flow fails to move to the join step?