Hey - I am trying to write a bit of resume logic i...
# dev-metaflow
c
Hey - I am trying to write a bit of resume logic into my flow to help with debugging long flows in prod ( SFN ). I am getting the user to input a resume config as a parameter, something like:
Copy code
resume_config = Parameter("resume_run_id", type=JSONType, default=None) # {'run_id': 'sfn-76a57bef-5f83-70b0-4284-e9807e49998f', 'last_successful_step': 'create_jockey_features'}
I also set a variable
has_resumed
in case the flow has already resumed. Then I have a check at the start of each step:
Copy code
if self.resume_config and not self.has_resumed:
            if current.step_name == self.resume_config["last_successful_step"]:
                successful_steps = get_flow_run_successful_steps(run_id=self.resume_config["run_id"], flow_name=current.flow_name)
                step = [s for s in successful_steps if s.id == current.step_name][0]
                data = step.task.data # how to set properly 
                self.has_resumed = True
            
            self.next(self.process_data)
I was just wondering is there a way I can actually set the entire step.task.data object for that step, instead of having to manually get each data object and set it with
self.df = some_code_to_get_previous_flow_run_data.step.data
( if you follow my drift )
1
maybe through the context object or something like that?
a
You can consider iterating over
MetaflowData
objects exposed by
task.data
and make the assignments?
c
yeah - then I just have to be specific, which is what I was hoping to avoid - hence the question 😉
Hoping I can avoid
Copy code
self.df = step.task.data.df
self.df2 = step.task.data.df2
Is there anyway I can do something like
Copy code
self.data = step.task.data
?
f
Could you clarify a bit more what you’re trying to achieve? If you’d just like to check if a flow is being resumed, you can look at the
current
object for the
origin_run_id
, e.g.
Copy code
# Check if there's an origin run id for the case of resumes instead of retries.
run_id = current.origin_run_id
if not run_id:
    run_id = current.run_id
c
I have previously been told that there is no way top resume a flow in an AWS environment ( a step functon )
so I am trying to change my flow so that this is possible.
so when I resume a flow, I need the flow to skip the steps, and then when it gets to the step that was succesfull before failing in the previous run, it needs to load that data into the step so that the next steps can run ok
so when I load that data - I am looking for a way to do it that is nicer than listing out all the data artifacts one by one and instead assigning all of them which I access through
Step("something").tasks.data
Does that make sense?
f
hmm gotcha, for what it’s worth you can certainly resume a failed SFN execution - e.g.
python flow.py resume --origin-run-id sfn-248a324-23ec23h49-42fea39
and also include a
--production
tag if you have particular behavior for that. the caveat is that the machine you run the resume from will act as the job scheduler, rather than step functions, so it’s like running a flow with
@batch
decorators. if I’m understanding correctly, you need the resumed jobs to also be using step functions as the job scheduler instead of another machine?
c
we just need the jobs to resume and use the namespace that they were instantiated from
will that behaviour occur or will it default to my local machines namespace?
f
yup it’ll be resumed within the namespace of the local machine, that said I think there’s some work in flight that’ll allow for tags to be modified – not sure if that will also include the option of modifying the user/namespace upon a resume
c
ok - so now that we have that sorted - does the context of my initial question make a bit more sense ?
Hi @ancient-application-36103 any more comments on this thread or tips?
a
@careful-dress-39510 you can do
python flow.py resume --origin-run-id <sfn-id> --tag <tag>
where the tag is the namespace you are interested in.
🙌 1
c
Game changer! Is this new @ancient-application-36103?
And it will resume with any code changes I make locally but in the aws env?
If I have the batch decorators etc on the local code?
a
correct
if you use the batch/kubernetes decorator, it will run that step on batch/kubernetes
this functionality has existed since day 1 🙂
c
Ah ok, very strange - I think I've asked this question before and received a different answer
I tagged you in the previous discussion @ancient-application-36103
q
a
@brief-kite-90012
@ancient-application-36103 correct me if I’m wrong but adding in the tag argument with a value of production will rerun a failed SFN on AWS instead of locally right?