@flaky-plumber-70709 I know you wrote
this project last year. Would you be open to PRs to update yours to AWS CDK v2?
We
could do a restart. I'm wanting to do some customizations to our metaflow deployment at work. We're using CDK v2, so I've been writing my own constructs. Here are some thoughts.
Some of these may come across as over-opinionated, fighting words. These are just my opinions and I respect anyone who disagrees.
ā¢
Hypothesis: I think the project would have more contributors if it were Python-native. Personally I'm decent at TypeScript, but much stronger at Python. I'm assuming that's the case with most of the developers in this community. I know you sacrifice the ability to export IaC to all the CDK supported languages when you don't write it in TypeScript. I'm open to doing TypeScript if it would get more support. We could do a survey of potential contributors and ask their preference.
ā¢
Claim: If you're deploying to AWS (not kubernetes), CDK:
ā¦
(a) CDK is the most flexible, user-friendly, way to write AWS infrastructure these days. There seems to be an army of developers at AWS constantly adding features since it's AWS's officially supported tool. In commit frequence, it outpaces Pulumi and is focused exclusively on AWS. Having infrastructure as actual code is... really nice. If you need any customizations to your Metaflow setup to integrate it with your organization, my team has experienced CDK to be the quickest way to do so. (Aside: Much respect to terraform for it's module system)
ā¦
(b) Second to raw CloudFormation, CDK makes the least assumptions about your environment. With terraform, you may need to figure out how to manage your state and set up deploy locking. With Pulumi, you pretty much have to use the vendor offering. CDK uses CloudFormation as the deployment mechanism which is free and doesn't require prior setup.
ā¢
Design ideas: I think the components of a Metaflow IaC library should be as loosely coupled as possible, so that individual pieces can be entirely replaced in different ways.
⦠You may want to run flows on my own on-premise GPU machines, but use other components (like RDS) in AWS. I think you can achieve that with ECS-anywhere. This could be great for companies/hobbyists/research labs with their own hardware.
⦠You may want to protect the UI in different ways: I could use AWS Cognito to put a login page in front of the metaflow UI. Or I might want to use a different auth provider like Active Directory or Auth0. Or maybe I'd skip auth altogether and just put these things into an existing VPC set up with a VPN.
⦠Save lots of money by hosting the SQL database on something that isn't RDS, and the containers on something that isn't ECS. For example, if you were motivated, you could put all of those things on a single AWS Lightsail instance for $10/mo. I could see a research lab doing this... and me just for fun.
⦠Straightforwardly modify the AWS Batch settings to use a private PyPI server. My company definitely has private python pacakges with utilities used during training and batch serving. Having flexibility would make this easier to set up.
⦠Mess with networking and IAM permissions if needed to give Metaflow runs access to protected services like a feature store, tracking server (thinking of MLFlow and maybe Optuna), a data warehouse, etc.
In short, I think there are a lot of good reasons that hobbyists, research labs, and businesses might want to customize their Metaflow deployment using AWS CDK. If folks agree, I'd love to collaborate on this. Maybe we could even get it to the point that it moves under the official metaflow umbrella š¤(I appreciate that it's hard to maintain IaC modules when there are a zillion IaC frameworks out there, so that may not work out).