Hi I have some questions about orchestration of metaflow in Outerbounds #dev-metaflow

Hi I have some questions about orchestration of m...

mammoth-rainbow-82717

12/02/2021, 11:03 AM

Hi I have some questions about orchestration of metaflow in production. In particular, regarding the deployment to kubernetes. We are currently considering different orchestration engines, among them are Argo Workflows and Prefect, and I want to clarify a few things in my head. 1) With the Argo Workflows integration you are due to roll out, will it be possible to get the argo workflow manifests somehow? If we went with Argo Workflows, then we would probably be looking to take on a GitOps type approach. In particular, we would look to have the manifests in a Git repository and then deploy them in production through Argo CD. 2) Given the upcoming kubernetes integration, would it be possible to deploy metaflow flows to production through other orchestration engines, such as Prefect? Naively I was thinking something like having an initial task in prefect that would spin up a kubernetes pod and that would then run the metaflow command from there. Might not be a very elegant solution, but I am just wondering if it is technically feasible? I really like metaflow, so would not like to block ourselves from using it in the future if we pick a workflow orchestrator that is not compatible with it.

✅ 2

ancient-application-36103

12/02/2021, 11:27 AM

@mammoth-rainbow-82717 re: 1 - yes, you would be able to get argo manifests easily. This would be quite similar to

python flow.py step-functions create --only-json

re:2 - Yes, you can always embed the entire flow as a single task in Prefect (or any other scheduler). I believe, that's the pattern that @worried-mechanic-36312 at Coveo uses currently.

mammoth-rainbow-82717

12/02/2021, 11:27 AM

ok, cool

mammoth-rainbow-82717

12/02/2021, 11:27 AM

I thought that would be the case.

mammoth-rainbow-82717

12/02/2021, 11:27 AM

Thanks for clarifying! 🙂

fancy-eve-65019

12/02/2021, 12:38 PM

@mammoth-rainbow-82717

If we went with Argo Workflows, then we would probably be looking to take on a GitOps type approach.

we’re using Argo Workflows with Metaflow in a quite similar setup. I’d recommend to try it out first before fully commit to ArgoCD approach.

create

trigger

and

list-runs

provide a very convenient interface. One of the pain points for us with ArgoCD is how to propagate template errors back to a user? It becomes an async process, there is a delay between git commit and a sync. User had to go to ArogCD to find out the sync didn’t happen because of a problem in a template. And with

python flow.py argo-workflows create

user immediately getting error message if something wrong. And it’s a single command vs smth. like

git add && git commit && git push

. 2nd, the generated yaml/json templates are barely readable, diffs mostly unusable. I personally treat them like generated code, almost a binary blob. As soon as I have a flow.py in a git repo, I always can re-generate a template from it. Hope this info would be helpful.

mammoth-rainbow-82717

12/02/2021, 1:03 PM

Hi Roman

mammoth-rainbow-82717

12/02/2021, 1:03 PM

Thanks for the really helpful feedback.

mammoth-rainbow-82717

12/02/2021, 1:44 PM

A couple of follow up questions.

mammoth-rainbow-82717

12/02/2021, 1:44 PM

Did you use Argo CD for the development process as well as staging and production? I am expecting people will be use metaflow outside of a gitops framework in development setting.

mammoth-rainbow-82717

12/02/2021, 1:45 PM

Let me explain my intended workflow.

mammoth-rainbow-82717

12/02/2021, 1:45 PM

An ML team has two git repos, a model repo and a gitops repo. The model repo contains the plain model code, including the metaflow flows etc. The Gitops repo contains the actual deployment code, which in this case would be the manifests.

mammoth-rainbow-82717

12/02/2021, 1:45 PM

• An ML user is working on a model in development, during with they can use metaflow directly on the development kubernetes cluster without having to work through Argo CD.

mammoth-rainbow-82717

12/02/2021, 1:45 PM

• The user is happy with the model and wants to test it in staging. They commit their flow to the model repo and merge it into staging. At this point the manifest is created and a PR containing it is auto-generated for the gitops repo. Once this is merged the flow is deployed to staging.

mammoth-rainbow-82717

12/02/2021, 1:45 PM

• Similarly, once they are happy that it is working in staging, they deploy it to production through a similar process as the above.

mammoth-rainbow-82717

12/02/2021, 1:46 PM

Would be great to see if it is similar to what you had on your side. Would also be great to know which parts of your feedback would still be applicable to this type of approach.

mammoth-rainbow-82717

12/02/2021, 1:47 PM

I suppose the the generated yaml/json will be close to unreadable no matter what (I was actually a bit worried about this, but am not too surprised it is the case.)

mammoth-rainbow-82717

12/02/2021, 1:48 PM

I'm hoping that if we keep the development process pretty open people will still have a smooth development experience though.

fancy-eve-65019

12/02/2021, 2:23 PM

Did you use Argo CD for the development process as well as staging and production?

Originally, the idea was to use it for "productive" setups only. Expectation was that users would use their own k8s clusters (even with KF) for development and as soon as they are satisfied with a template, commit it to the staging/prod/etc. repo. Like you described. What happened in reality, it was a significant overhead for teams to admin their own k8s clusters. They preferred to use productive clusters for development (in a separate account). And there they didn't have a proper access to internals of k8s. We've got tons of tickets why template didn't sync, didn't start, etc.

An ML user is working on a model in development, during with they can use metaflow directly on the development kubernetes cluster without having to work through Argo CD.

It could work as soon as your team would maintain such dev cluster for them.

I suppose the the generated yaml/json will be close to unreadable no matter what

yeah, but it's not a problem. Even manually written yaml templates besides some trivial ones are hard to read. And metaflow's flows are really shine here.

mammoth-rainbow-82717

12/02/2021, 2:26 PM

I see, thanks.

mammoth-rainbow-82717

12/02/2021, 2:26 PM

Yes, I would see our team maintaining the dev cluster for the other teams, so I think this would be ok in our case.

fancy-eve-65019

12/02/2021, 2:27 PM

yup, that's where we ended so far too

mammoth-rainbow-82717

12/02/2021, 2:27 PM

Yeah, I can see that using the argo CD approach in development would lead to lots of issues and generally wouldn't be fun for the users.

mammoth-rainbow-82717

12/02/2021, 2:27 PM

In terms of production, you feel it works ok though?

fancy-eve-65019

12/02/2021, 2:29 PM

I don't have too much experience with productive setup, I'm pretty confident to maintain my local k8s cluster 🙂 We have some sporadic failures with sync but overall it seems to be ok.

mammoth-rainbow-82717

12/02/2021, 2:29 PM

naively I feel argo CD brings a lot to the table in general, but I am not super experienced in using it yet, to be honest.

mammoth-rainbow-82717

12/02/2021, 2:29 PM

ok, cool

mammoth-rainbow-82717

12/02/2021, 2:30 PM

It's really useful to get feedback from someone who has gone with this type of approach! 🙂

fancy-eve-65019

12/02/2021, 2:31 PM

you're welcome. Feel free to ask more, would be glad to help.

worried-mechanic-36312

12/02/2021, 10:07 PM

@mammoth-rainbow-82717 , @square-wire-39606 is correct. We have a lt of dataOps before ML stuff, which is esp true at a B2B company I think, so a general, not-ML specific orchestrator is useful ... if you want a comparison between MF and a prefect+metaflow pipeline, this repo is a full open source back-end

Copy code

<https://github.com/jacopotagliabue/you-dont-need-a-bigger-boat>

mammoth-rainbow-82717

12/03/2021, 10:32 AM

oh wow, you have a whole repo of stuff.

mammoth-rainbow-82717

12/03/2021, 10:33 AM

Thanks, that's really useful

worried-mechanic-36312

12/03/2021, 2:34 PM

https://outerbounds-community.slack.com/archives/C020U025QJK/p1638527574109700?thread_ts=1638443025.102700&cid=C020U025QJK 😜 yes, and it's all open, 30M of data points included if you want to test the scal, plus a SOTA transformer model for recs 😉 share it and ⭐ it if you like it 😉

mind blown 1

❤️ 1

🎉 1

8 Views

Open in Slack

Previous Next