Hi Team, We are deploying metaflow using Stepfunct...
# ask-metaflow
s
Hi Team, We are deploying metaflow using Stepfunctions/AWS Batch(with Fargate based Compute env). We are trying to run multiple executions of same flow to do load test. What we noticed is the jobs running times are abnormal. There are jobs which took around 2 to 3 minutes to complete and there are jobs which shows runtime as 2 hours. When we checked the logs, we could see the step completed in 2 minutes(based on the logs we added to the step). After completing the step, the job was running for 2 hours for no reason. Anyone else faced this? or any idea why. Let me know if you need more details on this.
a
Can you share more details on what these jobs were doing?
And are you able to replicate it reliably?
s
Its just a simple job with only start and end steps. the start step invokes an API and logs the response. We have added multiple logs to the start step to ensure that gives enough info that the step is completed. We can see the logs in cloud watch that the api is invoked and the response is logged and the last line of step with logs stating step completed is also present. The same flow is invoked like 1000 times. So I can see the jobs running times and compare it for the same start step. We noticed the jobs runtimes are inconsistent as it varies very abnormally. for ex: Job started at: Sep 09, 2025, 163007 stopped at: Sep 09, 2025, 165853 Total runtime: 28 minutes, 46 seconds When checked the cloudwatch logs, for this job, we could see the job completed log at Sep 09, 2025, 163225. Which is around 2 mins. This logs is consistent across all the jobs, but the Total runtime is inconsistent and went up to 2.5 hours.
i
We can create reliably in the sense that every time we run the load test we are seeing this behaviour to varying degrees
f
I think this is due to the load test setup, if you're running 1000k parallel jobs there can be network issues / throttling which could lead to sfn timing out, then batch is sitting there waiting for the sfn to callback
you can poke around the step functions console and look at the execution history and timeouts
I ran metaflow on batch for years and years and this happened a few times - if it occurs during normal load thats another thing but the load test I think is the main clue
you can check out how you are networking is set up for step functions - see if you have a vpc interface endpoint set up or just NAT, if its a nat gateway can prob solve this issue by using the sfn endpoint
s
Cool, thanks for the explanation. We will check this by running load test again.