Hey everyone slightly smiling face I am creating a Flow whic Outerbounds #ask-metaflow

Hey everyone :slightly_smiling_face: I am creatin...

ripe-oyster-50903

07/24/2024, 8:53 AM

Hey everyone 🙂 I am creating a Flow which needs torch with cuda acceleration. Using a torch based docker image works, but when combining it with @pypi or @conda the docker container is running out of disk space after multiple steps while setting up the environment. Using a slimmer base image(like

python

) and trying to install torch works, but only cpu based, didnt manage to make cuda work so far. Whats the best practice here ? The examples i found either didnt specify the base image or just installed torch with conda which results in cpu only for me when i tried it out. I am using metaflow

--with batch

based on the standard cloud-formation template.

✅ 1

ripe-oyster-50903

07/24/2024, 10:54 AM

I increased the size of the root volume in a launch template which worked. Still would be interested if others managed to get torch to run with cuda without using a fat docker image as base (eg. pytorch/pytorch:2.3.1-cuda11.8-cudnn8-runtime)

ripe-oyster-50903

07/24/2024, 10:55 AM

Copy code

BatchLaunchTemplateMetaFlow:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: "BatchLaunchTemplateMetaFlow"
      LaunchTemplateData:
        BlockDeviceMappings:
          - DeviceName: /dev/xvda
            Ebs:
              VolumeSize: 100
              VolumeType: gp2
  ComputeEnvironment:
    Type: AWS::Batch::ComputeEnvironment
    DependsOn: BatchLaunchTemplateMetaFlow
    Properties:
      Type: MANAGED
      ServiceRole: !GetAtt 'BatchExecutionRole.Arn'
      ComputeResources:
        MaxvCpus: !Ref MaxVCPUBatch
        SecurityGroupIds:
          - !GetAtt VPC.DefaultSecurityGroup
        Type: !If [EnableFargateOnBatch, 'FARGATE', 'EC2']
        Subnets:
          - !Ref Subnet1
          - !Ref Subnet2
        MinvCpus: !If [EnableFargateOnBatch, !Ref AWS::NoValue, !Ref MinVCPUBatch]
        InstanceRole: !If [EnableFargateOnBatch, !Ref AWS::NoValue, !GetAtt 'ECSInstanceProfile.Arn']
        InstanceTypes: !If [EnableFargateOnBatch, !Ref AWS::NoValue, !Ref ComputeEnvInstanceTypes]
        DesiredvCpus: !If [EnableFargateOnBatch, !Ref AWS::NoValue, !Ref DesiredVCPUBatch]
        LaunchTemplate:
          LaunchTemplateId: !Ref BatchLaunchTemplateMetaFlow
      State: ENABLED

that was the relevant part of the cloud formation template that i did change in case someone has similar issues

ancient-application-36103

07/24/2024, 4:13 PM

@ripe-oyster-50903 are you installing

pytorch

from the

pytorch

conda channel?

pytorch::pytorch

as package name will do the trick

ripe-oyster-50903

07/24/2024, 4:18 PM

@ancient-application-36103 Yes, tried that and while it works i still get no cuda acceleration, only cpu. Installing with pypi works though if i use pytorch as a base image. The same doesnt with conda. There only cpu 😄

ancient-application-36103

07/24/2024, 4:19 PM

have you tried pytorch-gpu - https://anaconda.org/conda-forge/pytorch-gpu ?

ripe-oyster-50903

07/24/2024, 9:19 PM

Yes, i think it may be related to my local environment and/or being on Mac (pytorch-(gpu) fails while trying to solve the environment with a message about glibc missing ). This environment solving process is done locally from what i understand ? Will try to test if it works on a Linux environment tomorrow

👍🏼 1

Open in Slack

Previous Next