fresh-laptop-72652
02/23/2023, 6:48 AM@batch
to @kubernetes
and likewise from SFN to Argo (if all goes well) – with features like event triggering being available to workflows using/migrated onto the EKS stack of tools, while continuing to have a shared lineage / metadata backend.
I took that eks_argo
terraform example for a spin first, as a quick isolated deployment to get a bit of exposure as someone who isn't super well versed in ways of terraform and kube.
Sharing some notes/observations along the journey in case others find it useful:
• ❤️ all the thorough readmes across the infra repos!
• 🐛 terraform plan
spit out an error about a missing with_public_ip
variable related to the metadata service. Defined the bool in a <http://variables.tf|variables.tf>
with a value of false to continue on.
• 🐛 The EKS cluster that's deployed uses k8s version 1.21
which is past its end of life (ended on Feb 15, 2023).
• 🚀 Smooth sailing through the readme instructions, and after swapping in the new ~/.metaflowconfig
and whipping up a test pyenv locally, was able to run a sample flow using @kubernetes
+ view in Argo UI via port forwarding
• ✨ Added in the metaflow k8s config params for METAFLOW_KUBERNETES_CONTAINER_IMAGE
and METAFLOW_KUBERNETES_CONTAINER_REGISTRY
to point to the same docker image and ECR we use with Batch – was able to run a slightly more involved flow via @kubernetes
• 🤔 Attempted to tweak the ~/.metaflowconfig
we use on our development EC2 instances, appending in the k8s params to supplement our main metaflow deployment as a quick test – unsurprisingly failed due to missing ~/.kube/config
, followed by another failure due to insufficient permissions (Unable to launch Kubernetes job. Unauthorized
)
◦ Seems to be due to running on an EC2 instance with an IAM Role that is different than what was used to create the cluster (an admin IAM role that I specified via terraform aws provider profile
param)
◦ To help debug where the permissions fell off the rails, installed kubectl
on the dev EC2 instance and gave kubectl get serviceaccounts
a go, which failed with error: You must be logged in to the server
which helped pin down that there was no ability to talk with the EKS cluster.
• 😨 Went down a rabbit hole trying to understand k8s auth, which seems to be handled by the EKS configmap
◦ Exported the current configmap via kubectl get configmap aws-auth -n kube-system -o yaml
◦ Updated the yaml with a rolearn for the developer IAM role, and applied to the cluster via kubectl apply -f aws-auth-cm.yaml
◦ Lil bit of progress... new error after attempting to run a flow, with slightly more info to suggest that cluster connectivity was good but now lacking other IAM permissions
▪︎ Unable to launch Kubernetes job. jobs.batch is forbidden: User "..." cannot create resource "jobs" in API group "batch" in the namespace "default"
• 🤔 Thought surely terraform has a common way to handle adding the IAM role as part of the EKS module, which it does seem to in newer versions via manage_aws_auth_configmap = true
alongside aws_auth_roles
◦ Tried to upgrade the terraform module versions and EKS cluster major version, finding that there are many, many breaking changes across terraform eks/aws
when going from version 17.xx
to the latest 19.xx
. Some are quite small like tweaks to variable names, others are fairly substantial new ways to declare node groups, changes to default values, new EKS cluster addon support, better auth mechanisms, etc
◦ EKS cluster versions can only be upgraded one increment at a time, so going from 1.21
to the latest 1.24
would be a bit tedious for this quick n dirty POC.
• ✅ All in all, a successful quick exploration into the Metaflow k8s/Argo stack and plenty of learnings along the way.
Next steps and open questions:
• Given the existing infra in our AWS-native metaflow stack, seems fairly straightforward to create and wire up the eks/argo terraform components needed to supplement it.
◦ Does it make sense to go ahead and use all the latest-and-greatest terraform kube versions or are there reasons to restrict versions for compatibility?
◦ Hopefully I can share a minimal terraform EKS/Argo supplementary example that others can reference if they have existing AWS-native metaflow stacks.
• Need to better understand EKS Node Groups and how the myriad of scheduling/placement mechanisms work - node selectors, affinity, taints, tolerations mind blown
◦ How those relate to GPU instances - seems like the EKS autoscaler will auto-label those nodes as having nvidia accelerators and then node selectors deal with most of it?
◦ How those relate to spot instances - similarly looks like they'll be auto-labeled as having <http://eks.amazonaws.com/capacityType|eks.amazonaws.com/capacityType>: SPOT
which can work alongside those scheduling mechanisms.
◦ Feels like it's heading in the right direction to create multiple EKS Managed Node Groups that are equivalent to existing AWS Batch Compute Environments to support different types of workloads?
kube 🚀average-beach-28850
02/23/2023, 7:53 PMwith_public_ip
thing and update cluster version. They release new oens quite frequentlyaverage-beach-28850
02/23/2023, 7:55 PMcuddly-family-48123
02/23/2023, 11:50 PMfresh-laptop-72652
02/23/2023, 11:57 PMstraight-shampoo-11124
02/24/2023, 4:00 AMmost-leather-78733
10/18/2023, 12:41 AM