Hi Outerbounds, I have a question about the behavi...
# ask-metaflow
k
Hi Outerbounds, I have a question about the behavior of
retry
when used with Step Functions. I am interested in using
retry(..., minutes_between_retries=...)
to avoid periods of likely Spot interruptions. I understand that
retry
is performed at the Batch level, not within the state machine definition. But the
time.sleep
hereโ€”is that happening within the Batch job? Stepping through the logic there, it looks like
time.sleep
happens before a Batch job submission, so I'm confused about what process is doing the sleeping, and specifically whether it's happening within a Batch job or by something orchestrating the Batch job. (It's possible I just have a poor understanding of how SFN works, or that I'm looking in the wrong place ๐Ÿ™‚ ) Anyway, I'd be very grateful for a little help in understanding how
retry
works with SFN and AWS Batch. Thanks!
โœ… 1
b
To flesh out just a bit the motivation for the above question... We've noticed less than a minute between a Batch job's Attempt i stop time and its Attempt (i + 1) start time, despite having specified an inter-retry interval of 30 minutes thusly:
Copy code
python my_flow.py \
    --with retry:minutes_between_retries=30 \
    step-functions create \
    ...
This has led us to question our understanding of how `@retry`'s
minutes_between_retries
parameter is intended to work.
s
sorry for the delay - if you are looking for the retry control for step functions - this is the code snippet. when we integrated step functions with aws batch, there wasn't any support for
minutes_between_retries
in step functions - hence that field is ignored
and unfortunately that is still the case with step functions today
b
I see. Thanks for the explanation!
k
Yes, thanks!
b
Feel free to disregard, but I threw up a PR adding a line to the docstring noting this fact.
s
thanks! we usually try to keep the decorators that are common across multiple stacks free of any references to those stacks. let me think of a way to modify the docs differently.
b
OK, sounds good. Thanks!
Perhaps, rather than indicating where
minutes_between_retries
doesn't apply, as in my closed PR, it would be appropriate to indicate where it does apply, as in this example I just remembered. Obviously, whatever you think is best. Just figured I'd link to a potential precedent.