Been reading up on k8s/EKS and looking over the Me...
# dev-metaflow
f
Been reading up on k8s/EKS and looking over the Metaflow eks_argo terraform example in preparation of all the upcoming Argo Events goodies on the horizon excited My hope is to deploy the EKS stack side-by-side with an existing AWS-native Metaflow stack, allowing people to gradually shift workflows from using
@batch
to
@kubernetes
and likewise from SFN to Argo (if all goes well) – with features like event triggering being available to workflows using/migrated onto the EKS stack of tools, while continuing to have a shared lineage / metadata backend. I took that
eks_argo
terraform example for a spin first, as a quick isolated deployment to get a bit of exposure as someone who isn't super well versed in ways of terraform and kube. Sharing some notes/observations along the journey in case others find it useful: • ❤️ all the thorough readmes across the infra repos! • 🐛
terraform plan
spit out an error about a missing
with_public_ip
variable related to the metadata service. Defined the bool in a
<http://variables.tf|variables.tf>
with a value of false to continue on. • 🐛 The EKS cluster that's deployed uses k8s version
1.21
which is past its end of life (ended on Feb 15, 2023). • 🚀 Smooth sailing through the readme instructions, and after swapping in the new
~/.metaflowconfig
and whipping up a test pyenv locally, was able to run a sample flow using
@kubernetes
+ view in Argo UI via port forwarding • Added in the metaflow k8s config params for
METAFLOW_KUBERNETES_CONTAINER_IMAGE
and
METAFLOW_KUBERNETES_CONTAINER_REGISTRY
to point to the same docker image and ECR we use with Batch – was able to run a slightly more involved flow via
@kubernetes
🤔 Attempted to tweak the
~/.metaflowconfig
we use on our development EC2 instances, appending in the k8s params to supplement our main metaflow deployment as a quick test – unsurprisingly failed due to missing
~/.kube/config
, followed by another failure due to insufficient permissions (
Unable to launch Kubernetes job.  Unauthorized
) ◦ Seems to be due to running on an EC2 instance with an IAM Role that is different than what was used to create the cluster (an admin IAM role that I specified via terraform aws provider
profile
param) ◦ To help debug where the permissions fell off the rails, installed
kubectl
on the dev EC2 instance and gave
kubectl get serviceaccounts
a go, which failed with
error: You must be logged in to the server
which helped pin down that there was no ability to talk with the EKS cluster. • 😨 Went down a rabbit hole trying to understand k8s auth, which seems to be handled by the EKS configmap ◦ Exported the current configmap via
kubectl get configmap aws-auth -n kube-system -o yaml
◦ Updated the yaml with a rolearn for the developer IAM role, and applied to the cluster via
kubectl apply -f aws-auth-cm.yaml
◦ Lil bit of progress... new error after attempting to run a flow, with slightly more info to suggest that cluster connectivity was good but now lacking other IAM permissions ▪︎
Unable to launch Kubernetes job.  jobs.batch is forbidden: User "..." cannot create resource "jobs" in API group "batch" in the namespace "default"
🤔 Thought surely terraform has a common way to handle adding the IAM role as part of the EKS module, which it does seem to in newer versions via
manage_aws_auth_configmap = true
alongside
aws_auth_roles
◦ Tried to upgrade the terraform module versions and EKS cluster major version, finding that there are many, many breaking changes across terraform
eks/aws
when going from version
17.xx
to the latest
19.xx
. Some are quite small like tweaks to variable names, others are fairly substantial new ways to declare node groups, changes to default values, new EKS cluster addon support, better auth mechanisms, etc ◦ EKS cluster versions can only be upgraded one increment at a time, so going from
1.21
to the latest
1.24
would be a bit tedious for this quick n dirty POC. • All in all, a successful quick exploration into the Metaflow k8s/Argo stack and plenty of learnings along the way. Next steps and open questions: • Given the existing infra in our AWS-native metaflow stack, seems fairly straightforward to create and wire up the eks/argo terraform components needed to supplement it. ◦ Does it make sense to go ahead and use all the latest-and-greatest terraform kube versions or are there reasons to restrict versions for compatibility? ◦ Hopefully I can share a minimal terraform EKS/Argo supplementary example that others can reference if they have existing AWS-native metaflow stacks. • Need to better understand EKS Node Groups and how the myriad of scheduling/placement mechanisms work - node selectors, affinity, taints, tolerations mind blown ◦ How those relate to GPU instances - seems like the EKS autoscaler will auto-label those nodes as having nvidia accelerators and then node selectors deal with most of it? ◦ How those relate to spot instances - similarly looks like they'll be auto-labeled as having
<http://eks.amazonaws.com/capacityType|eks.amazonaws.com/capacityType>: SPOT
which can work alongside those scheduling mechanisms. ◦ Feels like it's heading in the right direction to create multiple EKS Managed Node Groups that are equivalent to existing AWS Batch Compute Environments to support different types of workloads? kube 🚀
🔥 7
a
very cool, thanks for the awesome write up! yes we should fix the
with_public_ip
thing and update cluster version. They release new oens quite frequently
🙏 1
❤️ 1
and we should put some pointers in the doc about aws config map stuff, its a common gotcha with EKS. And yeah that EKS tf module made things more confusing. They actually decided at some point that it should not manage that config map at all, i think it was in module version 18.0, and removed the option. But then rowed back on that added an option back..
🙌 1
c
@fresh-laptop-72652 What is the perceived benefit you were looking to gain with a shift from SFN/batch to arg/k8s? I am just curious. Some of my guesses are: • Cost - I don't really have a good guess which method is most costly • Portability between clouds
f
Neither of those primarily 😛 Main motivation is simply around having access to upcoming metaflow features around event triggering via Argo Events - https://docs.google.com/document/d/1liTvpACWKioCSQTUv5iO3g2AKuLu4x3EYFwEl43WAZU/edit#heading=h.lzd4f7btdmb6
🙌🏽 1
🙌 3
s
thanks for sharing - this is super useful 🙌 🤗
m
thank you for the write-up, Russel. We're looking at going from batch/SF to EKS/Argo in the future as well. One way to look at K8s adoption in the context of Metaflow is to be prepared for the amount of effort/experience to manage EKS/GKE intricacies independent of any actual ML. IMO if your org has no other need for K8s, it can be a tough sell.