Hi I have some questions about orchestration of m...
# dev-metaflow
m
Hi I have some questions about orchestration of metaflow in production. In particular, regarding the deployment to kubernetes. We are currently considering different orchestration engines, among them are Argo Workflows and Prefect, and I want to clarify a few things in my head. 1) With the Argo Workflows integration you are due to roll out, will it be possible to get the argo workflow manifests somehow? If we went with Argo Workflows, then we would probably be looking to take on a GitOps type approach. In particular, we would look to have the manifests in a Git repository and then deploy them in production through Argo CD. 2) Given the upcoming kubernetes integration, would it be possible to deploy metaflow flows to production through other orchestration engines, such as Prefect? Naively I was thinking something like having an initial task in prefect that would spin up a kubernetes pod and that would then run the metaflow command from there. Might not be a very elegant solution, but I am just wondering if it is technically feasible? I really like metaflow, so would not like to block ourselves from using it in the future if we pick a workflow orchestrator that is not compatible with it.
2
a
@mammoth-rainbow-82717 re: 1 - yes, you would be able to get argo manifests easily. This would be quite similar to
python flow.py step-functions create --only-json
re:2 - Yes, you can always embed the entire flow as a single task in Prefect (or any other scheduler). I believe, that's the pattern that @worried-mechanic-36312 at Coveo uses currently.
m
ok, cool
I thought that would be the case.
Thanks for clarifying! 🙂
f
@mammoth-rainbow-82717
If we went with Argo Workflows, then we would probably be looking to take on a GitOps type approach.
we’re using Argo Workflows with Metaflow in a quite similar setup. I’d recommend to try it out first before fully commit to ArgoCD approach.
create
,
trigger
and
list-runs
provide a very convenient interface. One of the pain points for us with ArgoCD is how to propagate template errors back to a user? It becomes an async process, there is a delay between git commit and a sync. User had to go to ArogCD to find out the sync didn’t happen because of a problem in a template. And with
python flow.py argo-workflows create
user immediately getting error message if something wrong. And it’s a single command vs smth. like
git add && git commit && git push
. 2nd, the generated yaml/json templates are barely readable, diffs mostly unusable. I personally treat them like generated code, almost a binary blob. As soon as I have a flow.py in a git repo, I always can re-generate a template from it. Hope this info would be helpful.
m
Hi Roman
Thanks for the really helpful feedback.
A couple of follow up questions.
Did you use Argo CD for the development process as well as staging and production? I am expecting people will be use metaflow outside of a gitops framework in development setting.
Let me explain my intended workflow.
An ML team has two git repos, a model repo and a gitops repo. The model repo contains the plain model code, including the metaflow flows etc. The Gitops repo contains the actual deployment code, which in this case would be the manifests.
• An ML user is working on a model in development, during with they can use metaflow directly on the development kubernetes cluster without having to work through Argo CD.
• The user is happy with the model and wants to test it in staging. They commit their flow to the model repo and merge it into staging. At this point the manifest is created and a PR containing it is auto-generated for the gitops repo. Once this is merged the flow is deployed to staging.
• Similarly, once they are happy that it is working in staging, they deploy it to production through a similar process as the above.
Would be great to see if it is similar to what you had on your side. Would also be great to know which parts of your feedback would still be applicable to this type of approach.
I suppose the the generated yaml/json will be close to unreadable no matter what (I was actually a bit worried about this, but am not too surprised it is the case.)
I'm hoping that if we keep the development process pretty open people will still have a smooth development experience though.
f
Did you use Argo CD for the development process as well as staging and production?
Originally, the idea was to use it for "productive" setups only. Expectation was that users would use their own k8s clusters (even with KF) for development and as soon as they are satisfied with a template, commit it to the staging/prod/etc. repo. Like you described. What happened in reality, it was a significant overhead for teams to admin their own k8s clusters. They preferred to use productive clusters for development (in a separate account). And there they didn't have a proper access to internals of k8s. We've got tons of tickets why template didn't sync, didn't start, etc.
An ML user is working on a model in development, during with they can use metaflow directly on the development kubernetes cluster without having to work through Argo CD.
It could work as soon as your team would maintain such dev cluster for them.
I suppose the the generated yaml/json will be close to unreadable no matter what
yeah, but it's not a problem. Even manually written yaml templates besides some trivial ones are hard to read. And metaflow's flows are really shine here.
m
I see, thanks.
Yes, I would see our team maintaining the dev cluster for the other teams, so I think this would be ok in our case.
f
yup, that's where we ended so far too
m
Yeah, I can see that using the argo CD approach in development would lead to lots of issues and generally wouldn't be fun for the users.
In terms of production, you feel it works ok though?
f
I don't have too much experience with productive setup, I'm pretty confident to maintain my local k8s cluster 🙂 We have some sporadic failures with sync but overall it seems to be ok.
m
naively I feel argo CD brings a lot to the table in general, but I am not super experienced in using it yet, to be honest.
ok, cool
It's really useful to get feedback from someone who has gone with this type of approach! 🙂
f
you're welcome. Feel free to ask more, would be glad to help.
w
@mammoth-rainbow-82717 , @square-wire-39606 is correct. We have a lt of dataOps before ML stuff, which is esp true at a B2B company I think, so a general, not-ML specific orchestrator is useful ... if you want a comparison between MF and a prefect+metaflow pipeline, this repo is a full open source back-end
Copy code
<https://github.com/jacopotagliabue/you-dont-need-a-bigger-boat>
m
oh wow, you have a whole repo of stuff.
Thanks, that's really useful
w
https://outerbounds-community.slack.com/archives/C020U025QJK/p1638527574109700?thread_ts=1638443025.102700&amp;cid=C020U025QJK 😜 yes, and it's all open, 30M of data points included if you want to test the scal, plus a SOTA transformer model for recs 😉 share it and it if you like it 😉
mind blown 1
❤️ 1
🎉 1