:wave: Hey Folks! Need some advice on the best way...
# ask-metaflow
s
👋 Hey Folks! Need some advice on the best way to get setup for training models on multiple-gpu (not multi node yet). Here are the use cases we are currently targeting: • Run several experiments and track metrics on WandB • We’d like to monitor the progress of the jobs on the Metaflow UI • Develop using notebooks and submit longer running training jobs to the multi GPU node. • [Nice to have] Ideally, we’d want to seamlessly switch between AWS and GCP We looked at this https://outerbounds.com/engineering/deployment/aws-managed/cloudformation/ Curious how much ops overhead we would have for deploying all these services and if there is a simpler stack that we can deploy to achieve these use cases.
meow wave 2
a
@sparse-dress-4861 ops overhead for the cloudformation stack at somewhat reasonable scale should be very minimal. if you are looking for the minimum possible overhead, at the expense of losing some metaflow functionality - you can skip deploying the metadata service and the UI - just the S3 bucket and AWS Batch compute environment (that scales down to 0) should have as close to 0 overhead as is feasible in the cloud. metaflow will store all the tracking information in your local filesystem - so your results wouldn't be shareable across users. cards would work and can perhaps stand-in as a very lightweight UI. the next frontier would be to deploy the metadata service to get tracking capabilities across users. an enhancement on top of that would be to deploy the ui service.
s
hey @ancient-application-36103! Thanks!
so your results wouldn’t be shareable across users.
ah, we definitely would like to make it shareable across the team. I am assuming it is not too much of an overhead for the metadata service and the UI ? The compute footprint for these services is also fairly low (IIRC)?
a
correct!
s
awesome! We ll give this a try and get back here if we run into any issues
when creating the stack, since we want to provision V100 and A100 GPU instance types, do we need to change anything in the ComputeEnvInstanceTypes field?
a
Yep! You would want to list the instances from that you would like to use instead of the defaults that the template setup
s
perfect!
these are spot instances or dedicated hosts? need this for AWS service quotas
a
It’s totally up to you - you can configure your batch compute environment as you wish
Getting GPUs in the spot market might be tough
s
I see. would you recommend getting dedicated instances in that case? Also, is GCP better for such use cases?
a
Depending on the instance type, on-demand could work well
GPUs very likely won’t be available on spot anywhere :)
s
we ll try with the on demand. Spot would have been awesome for our wallet though 😃
a
You could create multiple compute environments prioritising spot instances ahead of on-demand when you attach them to the job queue
s
I think on demand is okay for now. But we were to add in this policy of prioritizing spot over on-demand, how can we do that in this CloudFormation template?
a
The template is a starter template, it doesn’t have all the bells and whistles - it might be easier to set the compute environment manually on AWS Batch’s end
s
got it. we ll use this template to set up the stack and the customize batch as needed 👍
alright! I have Metaflow setup with AWS batch. Also ran my first flow on this stack! 🎉
in
@batch
how do I specify the type of GPU (p3 v/s p4 etc) ?
1. is this parameter exposed in
@batch
?
2. also does the “shape” of the resource parameter matter to place the job on the GPU node? Here I specify 2VCPUS for placement on a GPU node that has 8VCPUs. I remember previously (at Nflx) we had this problem with GPU nodes on a different compute layer so wanted to double check.
3. I tried using
metaflow-torchrun
but no luck in getting the
hello
job to succeed. I see warnings on the AWS batch node. Seems like downloading from pypi is not working in that node.
Copy code
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f61751a7160>, 'Connection to <http://pypi.org|pypi.org> timed out. (connect timeout=15)')': /simple/awscli/
Hey @stale-eve-11739, here is the thread we were chatting about yesterday 🙂 did you run into similar issues while trying this plugin?
s
hey @sparse-dress-4861! I only ran metaflow-torchrun locally but we plan on kicking off some trail runs on argo + k8 in the next two weeks (if things go according to plan). ill will let you know how that goes from the looks of the error though it seems like your batch nodes might not have internet connection, probably a security setting from the template. IIRC when metaflow was open-sourced the free sandbox environment also had outside connections disabled one possible workaround is to just bake a docker image with all your required dependencies and push the image to your internal ECR and use that as the base/default image for your metaflow tasks
s
thanks @stale-eve-11739! let me know how your trials go. Makes sense on the lack of an internet connection for security reasons. We ll look into baking everything in to a docker image but that seems a bit inconvenient. We want some flexibility in terms of being able to install packages not in the image for different experiments.
s
Yeah that would only be a temporary solution just to see if that resolves your problem. You should update the permissions attached to whatever role you are using
s
Yes been struggling with the permissions bit for the role that we use for this stack. AWS doesn’t make this easy at all. 😕
c
Hey @sparse-dress-4861 happy to help you debug this on a call. We should be able to use the base CUDA or PyTorch images for Metaflow tasks while allowing you to flexibly layer packages on top. Would you like to schedule a time for tomorrow or Thursday to pair program?
s
@crooked-jordan-29960 that would be awesome! Could you share your calendar? I also DM’ed you my calendar for convenience
👍 1