Investigating a recent bug and it looks like the w...
# dev-metaflow
h
Investigating a recent bug and it looks like the way metaflow+argo handles retries caused some issues: 1. trigger flow on argo with retries 2. all 4 attempts fail 3. manually retry using argo UI / CLI 4. task attempt 5 is successful a. after manual retry, the argo workflow resolves
{{attempt}}
to be
0
b.
{{attempt}}
is passed into command as
MF_ATTEMPT=0
c. metaflow overwrites files in s3 for attempt
0
instead of creating a new attempt
4
5. next task in DAG fails because it cannot fetch the artifacts from latest attempt, and instead fetches artifacts from failed attempt=3 Is this something you are aware of already? Can we figure out an option to be able to do manual retries with argo?
1
a
Hi Cole - retrying a failed flow from Argo is not yet supported. It’s an open issue - happy to take a contribution 🙂
Retrying using resume would work though
h
Ah ok, good to know. I think it works when you don't use
@retry
in some cases since all attempts go to 0. Will take a look what options we have here then
Small update, @ancient-guitar-13766 started looking into options here. It seems reasonable to fetch latest attempt from an external source of truth instead of relying on the argo workflow template to pass the correct
--retry-count
argument. Would you have any concerns with that approach? Follow-up question: It seems like metaflow iterates over s3 files to find the latest attempt number. Couldn't we also fetch this from the metaflow backend metadata? Is there any preference if s3 or metaflow api should be the source of truth?
a
nice! re: s3 - we actively avoid reading from the metadata service since it can very easily become a bottleneck - reading from s3 helps us easily scale to concurrency of millions of tasks
are you open to a quick chat re: implementation for this feature - we would love to get this in metaflow
h
Yeah would be great to get some tips from the experts 🙂
a
how is monday at 10a PT?
h
@ancient-guitar-13766 is out on Monday, how about tuesday/thursday 9am or 10am PT?
a
yeah - thursday 9a works best!
👍 1
a
Hey everyone, thanks for the chat last week! Here is the improved version of the proposal we discussed, to make Argo retries work with Metaflow retries: 🔗 Draft - GitHub PR I’ve added a description of the issue along with steps to reproduce it. It would be great if you could review the solution and share your thoughts on whether we can proceed with this approach. Our long term objective is to get this merged upstream. If the direction looks good, our next step will be to remove
retry_count
from the step command-line interface. We tested the solution for an error in a linear flow and plan to spend more time covering additional use cases.
Regarding the
current.retry_count
, with our solution applied: For an Argo-retried workflow, the
retry_count
appears to update correctly. In this case, the flow was retried by Argo after three failed attempts.
a
hi! thanks for the PR! re: max-attempts - it will be hard for us to increase the number of max-attempts allowed since it will increase the latency of every single task start as well as slow down the UI. we could support
retry
till the current
max-attempts
are reached - wdyt?
h
I agree that splitting up those concerns into 2 separate tasks makes sense. > it will increase the latency of every single task start as well as slow down the UI Question though - do you think it is worth looking into how to improve the performance here? I'd be willing to look into the code and run some benchmarks to see if we can help improve this