:wave: hi folks….big fan (and user) of Metaflow wi...
# dev-metaflow
s
👋 hi folks….big fan (and user) of Metaflow with AWS Step Functions, but really feeling the limitations of the 40 max concurrent AWS batch jobs. I know there is an open issue to look into “DistributedMap” (https://github.com/Netflix/metaflow/issues/1216) and I’d love to help contribute to that effort. If anyone has any pointers on how to go about making adjustments to the current Step Function “Map” to using “DistributedMap” that would be really appreciated. Thanks!
👀 2
a
@salmon-exabyte-11054 would you like to join one of our weekly OSS kick-off meetings on Monday mornings where I can walk you through the relevant codebase?
s
Definitely! I’m free this Monday (10/9) if that works. I got some stuff working so it’d be great to dig into where I’m having trouble
a
invite sent! talk to you then!
🙌 1
s
@square-wire-39606 @bulky-afternoon-92433 👋 hey folks….I got a working (but probably not ideal) version of distributed map woohoo …when you folks have a chance could you look at the PR: https://github.com/Netflix/metaflow/pull/1576 Thank you!
🙌 3
@square-wire-39606 @bulky-afternoon-92433 Is there anything I can do to help get this PR reviewed?
a
is the PR ready for review?
@quick-lighter-52296 is this something you would like to help review?
s
The PR is more of a proof of concept to confirm this approach works for you folks
a
I can go through the PR this friday?
s
And I say proof of concept since it introduces some new functions that can be cleaned up. But didn’t want to pour more time into this if this is a bad implementation.
@square-wire-39606 amazing! Looking forward to your thoughts
a
makes sense!
s
@square-wire-39606 any thoughts on the PR? Anything I can do to make the review easier to digest?
@square-wire-39606 @bulky-afternoon-92433 bump 🙏 — happy if this isn’t a priority, but just want to figure out how to move forward (even if that means waiting until January, or something)
s
Hi! Sorry - let me look into this
🙌 1
s
@square-wire-39606 no apologies needed, you folks are busy, so I totally get it. Thank you for looking into it!
Running this myself, I was able to run a job with 100 max workers, but I never pushed it higher. Recently someone on my team did and ran into
AWSBatch.TooManyRequestsException
so I’m going to look into that.
I worked through the issue by adding a retry — was able to kick of a job with
--max-workers=1000
😄
@square-wire-39606 I know this is a change to an important portion of the code, let me know if you’d prefer I work on something smaller before tackling this….like maybe using S3 instead of Dynamo — I believe you floated that idea when we spoke
FYI I signed up for an office hours talk to go over using the distributed map
a
Hi @ancient-application-36103, @salmon-exabyte-11054 We are also feeling the limitations of this as well! One of our users is trying to run a flow that processes 16M files orchestrated by Step Functions, and the max concurrency of 50 enforced by Step Functions is becoming very problematic in terms of cost-effectiveness and efficiency of the flow. Wondering what the status is on this PR, which looks very promising
a
hello! sorry i was ooo for a bit - this is a fairly involved change. realistically, it will be another week or so before i am able to shift focus to this work
a
Gotcha - thanks Savin - I've advised to our user to use larger instance types and increase their batch sizes in the meantime and use
parallel_map
, but will be most helpful to have this enabled