wave Hey Folks Need some advice on the best way to get setu Outerbounds #ask-metaflow

:wave: Hey Folks! Need some advice on the best way...

sparse-dress-4861

08/16/2024, 3:52 PM

👋 Hey Folks! Need some advice on the best way to get setup for training models on multiple-gpu (not multi node yet). Here are the use cases we are currently targeting: • Run several experiments and track metrics on WandB • We’d like to monitor the progress of the jobs on the Metaflow UI • Develop using notebooks and submit longer running training jobs to the multi GPU node. • [Nice to have] Ideally, we’d want to seamlessly switch between AWS and GCP We looked at this https://outerbounds.com/engineering/deployment/aws-managed/cloudformation/ Curious how much ops overhead we would have for deploying all these services and if there is a simpler stack that we can deploy to achieve these use cases.

meow wave 2

ancient-application-36103

08/16/2024, 4:09 PM

@sparse-dress-4861 ops overhead for the cloudformation stack at somewhat reasonable scale should be very minimal. if you are looking for the minimum possible overhead, at the expense of losing some metaflow functionality - you can skip deploying the metadata service and the UI - just the S3 bucket and AWS Batch compute environment (that scales down to 0) should have as close to 0 overhead as is feasible in the cloud. metaflow will store all the tracking information in your local filesystem - so your results wouldn't be shareable across users. cards would work and can perhaps stand-in as a very lightweight UI. the next frontier would be to deploy the metadata service to get tracking capabilities across users. an enhancement on top of that would be to deploy the ui service.

sparse-dress-4861

08/16/2024, 4:29 PM

hey @ancient-application-36103! Thanks!

so your results wouldn’t be shareable across users.

ah, we definitely would like to make it shareable across the team. I am assuming it is not too much of an overhead for the metadata service and the UI ? The compute footprint for these services is also fairly low (IIRC)?

ancient-application-36103

08/16/2024, 4:51 PM

correct!

sparse-dress-4861

08/16/2024, 7:29 PM

awesome! We ll give this a try and get back here if we run into any issues

sparse-dress-4861

08/16/2024, 8:57 PM

when creating the stack, since we want to provision V100 and A100 GPU instance types, do we need to change anything in the ComputeEnvInstanceTypes field?

ancient-application-36103

08/16/2024, 9:51 PM

Yep! You would want to list the instances from that you would like to use instead of the defaults that the template setup

sparse-dress-4861

08/16/2024, 9:51 PM

perfect!

sparse-dress-4861

08/16/2024, 9:52 PM

these are spot instances or dedicated hosts? need this for AWS service quotas

ancient-application-36103

08/16/2024, 9:53 PM

It’s totally up to you - you can configure your batch compute environment as you wish

ancient-application-36103

08/16/2024, 9:53 PM

Getting GPUs in the spot market might be tough

sparse-dress-4861

08/16/2024, 9:56 PM

I see. would you recommend getting dedicated instances in that case? Also, is GCP better for such use cases?

ancient-application-36103

08/16/2024, 9:57 PM

Depending on the instance type, on-demand could work well

ancient-application-36103

08/16/2024, 9:57 PM

GPUs very likely won’t be available on spot anywhere :)

sparse-dress-4861

08/16/2024, 9:59 PM

we ll try with the on demand. Spot would have been awesome for our wallet though 😃

ancient-application-36103

08/16/2024, 10:00 PM

You could create multiple compute environments prioritising spot instances ahead of on-demand when you attach them to the job queue

sparse-dress-4861

08/16/2024, 10:02 PM

I think on demand is okay for now. But we were to add in this policy of prioritizing spot over on-demand, how can we do that in this CloudFormation template?

ancient-application-36103

08/16/2024, 10:03 PM

The template is a starter template, it doesn’t have all the bells and whistles - it might be easier to set the compute environment manually on AWS Batch’s end

sparse-dress-4861

08/16/2024, 10:06 PM

got it. we ll use this template to set up the stack and the customize batch as needed 👍

sparse-dress-4861

08/17/2024, 12:02 AM

alright! I have Metaflow setup with AWS batch. Also ran my first flow on this stack! 🎉

sparse-dress-4861

08/17/2024, 12:03 AM

@batch

how do I specify the type of GPU (p3 v/s p4 etc) ?

sparse-dress-4861

08/17/2024, 12:05 AM

1. is this parameter exposed in

@batch

sparse-dress-4861

08/17/2024, 12:08 AM

2. also does the “shape” of the resource parameter matter to place the job on the GPU node? Here I specify 2VCPUS for placement on a GPU node that has 8VCPUs. I remember previously (at Nflx) we had this problem with GPU nodes on a different compute layer so wanted to double check.

sparse-dress-4861

08/17/2024, 12:14 AM

3. I tried using

metaflow-torchrun

but no luck in getting the

hello

job to succeed. I see warnings on the AWS batch node. Seems like downloading from pypi is not working in that node.

Copy code

WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f61751a7160>, 'Connection to <http://pypi.org|pypi.org> timed out. (connect timeout=15)')': /simple/awscli/

sparse-dress-4861

08/18/2024, 5:12 PM

Hey @stale-eve-11739, here is the thread we were chatting about yesterday 🙂 did you run into similar issues while trying this plugin?

stale-eve-11739

08/19/2024, 2:17 AM

hey @sparse-dress-4861! I only ran metaflow-torchrun locally but we plan on kicking off some trail runs on argo + k8 in the next two weeks (if things go according to plan). ill will let you know how that goes from the looks of the error though it seems like your batch nodes might not have internet connection, probably a security setting from the template. IIRC when metaflow was open-sourced the free sandbox environment also had outside connections disabled one possible workaround is to just bake a docker image with all your required dependencies and push the image to your internal ECR and use that as the base/default image for your metaflow tasks

sparse-dress-4861

08/19/2024, 6:05 PM

thanks @stale-eve-11739! let me know how your trials go. Makes sense on the lack of an internet connection for security reasons. We ll look into baking everything in to a docker image but that seems a bit inconvenient. We want some flexibility in terms of being able to install packages not in the image for different experiments.

stale-eve-11739

08/19/2024, 8:23 PM

Yeah that would only be a temporary solution just to see if that resolves your problem. You should update the permissions attached to whatever role you are using

sparse-dress-4861

08/20/2024, 6:09 AM

Yes been struggling with the permissions bit for the role that we use for this stack. AWS doesn’t make this easy at all. 😕

crooked-jordan-29960

08/20/2024, 5:24 PM

Hey @sparse-dress-4861 happy to help you debug this on a call. We should be able to use the base CUDA or PyTorch images for Metaflow tasks while allowing you to flexibly layer packages on top. Would you like to schedule a time for tomorrow or Thursday to pair program?

sparse-dress-4861

08/20/2024, 6:00 PM

@crooked-jordan-29960 that would be awesome! Could you share your calendar? I also DM’ed you my calendar for convenience

👍 1

3 Views

Open in Slack

Previous Next