Hey everyone :slightly_smiling_face: I am creatin...
# ask-metaflow
r
Hey everyone 🙂 I am creating a Flow which needs torch with cuda acceleration. Using a torch based docker image works, but when combining it with @pypi or @conda the docker container is running out of disk space after multiple steps while setting up the environment. Using a slimmer base image(like
python
) and trying to install torch works, but only cpu based, didnt manage to make cuda work so far. Whats the best practice here ? The examples i found either didnt specify the base image or just installed torch with conda which results in cpu only for me when i tried it out. I am using metaflow
--with batch
based on the standard cloud-formation template.
1
I increased the size of the root volume in a launch template which worked. Still would be interested if others managed to get torch to run with cuda without using a fat docker image as base (eg. pytorch/pytorch:2.3.1-cuda11.8-cudnn8-runtime)
Copy code
BatchLaunchTemplateMetaFlow:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: "BatchLaunchTemplateMetaFlow"
      LaunchTemplateData:
        BlockDeviceMappings:
          - DeviceName: /dev/xvda
            Ebs:
              VolumeSize: 100
              VolumeType: gp2
  ComputeEnvironment:
    Type: AWS::Batch::ComputeEnvironment
    DependsOn: BatchLaunchTemplateMetaFlow
    Properties:
      Type: MANAGED
      ServiceRole: !GetAtt 'BatchExecutionRole.Arn'
      ComputeResources:
        MaxvCpus: !Ref MaxVCPUBatch
        SecurityGroupIds:
          - !GetAtt VPC.DefaultSecurityGroup
        Type: !If [EnableFargateOnBatch, 'FARGATE', 'EC2']
        Subnets:
          - !Ref Subnet1
          - !Ref Subnet2
        MinvCpus: !If [EnableFargateOnBatch, !Ref AWS::NoValue, !Ref MinVCPUBatch]
        InstanceRole: !If [EnableFargateOnBatch, !Ref AWS::NoValue, !GetAtt 'ECSInstanceProfile.Arn']
        InstanceTypes: !If [EnableFargateOnBatch, !Ref AWS::NoValue, !Ref ComputeEnvInstanceTypes]
        DesiredvCpus: !If [EnableFargateOnBatch, !Ref AWS::NoValue, !Ref DesiredVCPUBatch]
        LaunchTemplate:
          LaunchTemplateId: !Ref BatchLaunchTemplateMetaFlow
      State: ENABLED
that was the relevant part of the cloud formation template that i did change in case someone has similar issues
a
@ripe-oyster-50903 are you installing
pytorch
from the
pytorch
conda channel?
pytorch::pytorch
as package name will do the trick
r
@ancient-application-36103 Yes, tried that and while it works i still get no cuda acceleration, only cpu. Installing with pypi works though if i use pytorch as a base image. The same doesnt with conda. There only cpu 😄
a
have you tried pytorch-gpu - https://anaconda.org/conda-forge/pytorch-gpu ?
r
Yes, i think it may be related to my local environment and/or being on Mac (pytorch-(gpu) fails while trying to solve the environment with a message about glibc missing ). This environment solving process is done locally from what i understand ? Will try to test if it works on a Linux environment tomorrow
👍🏼 1