Currently running only 12 000 jobs within a week results in Outerbounds #dev-metaflow

Currently running only 12.000 jobs within a week r...

thankful-father-61351

10/18/2023, 9:28 AM

Currently running only 12.000 jobs within a week results in an above 50% chance of error due to job name collisions. This could eg. be caused by a foreach-branch of 1000 being run 12 times in a week. Not the kind of scale I hoped for. Remember that the TTL (time of live) is set to 7 days by default. (I am aware that we can change that with a Metaflow extension btw.) I am quite surprised that other people are not having issues with this? Would you guys be up for doing something about this? Would be happy to help! Solution? Adding more characters to the job name? With eg. 8 characters instead of 5 it would take approx. 3.000.000 jobs to reach a 50% chance of errors. Personally I would go for 10 characters. By the way, the computation is done using the Birthday Problem formula “(1)” from Wikipedia like so, hope it is correct:

Copy code

n_names = 40**5
n_jobs = 12000
1 - np.prod([(n_names - j) / n_names for j in range(n_jobs)])

Out[]: 0.5049486924884223

Copy code

n_names = 40**8
n_jobs = 3000000
1 - np.prod([(n_names - j) / n_names for j in range(n_jobs)])

Out[]: 0.4967385102900259

bulky-afternoon-92433

10/18/2023, 10:54 AM

Noticed that you found the PR from Tyler already addressing this issue (https://github.com/Netflix/metaflow/pull/1588). Have you tried it out yet? This should be straightforward and can go in the next release (after 2.10.3 as that flew out already)

thankful-father-61351

10/18/2023, 11:01 AM

I did not try it out, but I would expect it to work. However. I have the feeling the the core developers of Metaflow would probably not accept adding 36 characters to the job name. After all there most be a reason why the current job names are so short. This is why instead I wanted to get a conversation going, so that we could reach agreement on the solution first, and maybe also pull a few more people into the conversation.

👌 1

bulky-afternoon-92433

10/18/2023, 11:33 AM

That's actually my only slight concern with the PR as well, as a full-blown uuid + apiserver generated part is a bit overkill considering the entropy, and it makes the terminal output quite lengthy. I'd sweep this under cosmetics though, as its not that far off compared to running

--with batch

. Will see what others think of it before proceeding, but presumably you're not opposed to even the lengthy id's?

thankful-father-61351

10/18/2023, 11:40 AM

I don’t have enough experience with kubernetes to fully understand the downsides of long names. I would imagine that long names might get truncated different places, eg. in terminal output and in tools such as K9s, and since the pod names are “{job name}-{5 characters}, this would probably mean that the pod names would be truncated in such a way that two pods from the same job would be indistinguishable. Two pods originating from the same job would happen eg. on retries I believe.

bulky-afternoon-92433

10/18/2023, 5:12 PM

@elegant-beach-10818 there's slight pushback for the full-length uuid in favor of something shorter or more descriptive. I commented on the PR on an alternative, but also poking here for visibility 😉 sidenote, seems like there's considerations on Kubernetes side on this issue as well: https://github.com/kubernetes/kubernetes/issues/115489 but the underlying issue should be easily tackled by us with a more honed prefix

👍 1

👀 1

thankful-father-61351

10/18/2023, 9:17 PM

Lets move the conversation to the PR 👍

bulky-afternoon-92433

10/26/2023, 8:16 PM

https://github.com/Netflix/metaflow/releases/tag/2.10.4 is out now and includes this fix, cheers all for the contributions 🙂

thankful-father-61351

10/27/2023, 8:32 AM

Amazing! Thank you @brave-flag-76472 🙌

Open in Slack

Previous Next