Hi everyone fairly stumped here Also working on running meta Outerbounds #ask-metaflow

Hi everyone... fairly stumped here. Also working ...

incalculable-xylophone-18959

10/28/2024, 7:40 PM

Hi everyone... fairly stumped here. Also working on running metaflow in a local cluster, like @powerful-knife-41200 just mentioned. I have the metaflow metadata service hosted in AWS and most of our data in is in AWS, however we have some GPU servers on prem that we are trying to get working.Current final error is:

Copy code

botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "<https://edflows3tesas58h.s3.us-gov-west-1.amazonaws.com/metaflow/TestNvidiaSmi/data/65/65a311fc27953da4d9dc31af34497126be1fc397>"
Failed to download code package from <s3://edflows3tesas58h/metaflow/TestNvidiaSmi/data/65/65a311fc27953da4d9dc31af34497126be1fc397> after 6 tries.

In the debug logs, I see that it is getting a 404 when calling HEAD for this object:

Copy code

DEBUG:urllib3.connectionpool:<https://edflows3tesas58h.s3.us-gov-west-1.amazonaws.com:443> "HEAD /metaflow/TestNvidiaSmi/data/65/65a311fc27953da4d9dc31af34497126be1fc397 HTTP/11" 404 0

When I look in S3, I see that this object exists and its creation time is right around when the failed HEAD call was. However, prior to this it seems able to PUT/HEAD/GET a number of objects. Previous to this failed call, I see it successfully GET

/metaflow/TestNvidiaSmi/data/4c/4ca058df2ea422cca260c585409d6ac9face7ebe

Not sure if this is related, but it seems like it always retries 6 times, no matter what I set for

METAFLOW_S3_RETRY_COUNT

in the

@environment

decorators or in the calling environment, or in the metaflowconfig. Any help is appreciated!

✅ 1

incalculable-xylophone-18959

10/28/2024, 8:03 PM

I did try the

@retry

decorator and interestingly it seems like it still fails with the same error (different key of course) even when on the subsequent tries and I have verified that the object does exist by then. I'm trying this again with the debug logging to see if there is any difference in the calls/errors during the retries. I left out the somewhat important point that my metaflow metadata service is running in govcloud.

incalculable-xylophone-18959

10/28/2024, 8:15 PM

One other thing is that I see some 404's for files like

metaflow/TestNvidiaSmi/237/nvidia_smi/645/2.task_stderr.log

which don't exist, but

metaflow/TestNvidiaSmi/237/nvidia_smi/645/2.runtime_stderr.log

does exist along with these other names: • 2.attempt.json • 2.runtime • 2.runtime_stdout.log

important-rainbow-87332

10/29/2024, 12:49 AM

I’ve had some interesting connection issues this week to govcloud. This looks like its an permission issue either on the role, or your s3 bucket. I’ve not been able to fully figure out the errors yet. Are you using the EKS deploy?

incalculable-xylophone-18959

10/29/2024, 1:27 PM

@important-rainbow-87332 What I don't get is how it clearly is putting the files there, and it has s3:* permissions so no reason it can't get them also. There is just 1 role in play here and the job is just a simple test job:

Copy code

@environment(
        vars={
            "AWS_DEFAULT_REGION": "us-gov-west-1",
            "AWS_REGION": "us-gov-west-1",
            "AWS_ACCESS_KEY_ID": os.environ["AWS_ACCESS_KEY_ID"],
            "AWS_SECRET_ACCESS_KEY": os.environ["AWS_SECRET_ACCESS_KEY"],
            "METAFLOW_S3_RETRY_COUNT": "10",
            "NVIDIA_VISIBLE_DEVICES": "all"
        }
    )
    @kubernetes(
        image="artifactory.roboticresearch.com/docker/dltp:test",
        image_pull_policy="Never",
        gpu=1,
        tolerations=[{"key": "node-role.kubernetes.io/control-plane", "operator": "Exists", "effect": "NoSchedule"}]
    )
    @resources(gpu=1)
    @retry()
    @step
    def nvidia_smi(self):
        """Make sure that we can run nvidia-smi."""
        subprocess.run(["/usr/bin/nvidia-smi", "-q"])
        self.next(self.end)

hundreds-rainbow-67050

11/01/2024, 5:55 AM

The error is saying it's trying to download the code package, which happens before the step is run (so setting those AWS_ environment variables don't apply). I'm not really familiar with K8S, but whatever role is being used to run the container would be the one that is used to download the code package. Are you able to access the file directly using the role?

important-rainbow-87332

11/07/2024, 7:42 PM

@incalculable-xylophone-18959 I’m experiencing similar deployment issues with EKS in GovCloud. Could you share which EKS version you’re using? Have you identified the root cause of the problem?

incalculable-xylophone-18959

11/07/2024, 7:43 PM

@important-rainbow-87332 Well, the errors above here I don't think are govcloud related but are related to trying to have an on-prem metaflow cluster talk to an AWS-metadata service.

important-rainbow-87332

11/07/2024, 8:37 PM

@incalculable-xylophone-18959 I’m fully on govcloud, and a new cluster is being deployed, i just found a fix for the error on my end. 2 issues, i had to update the s3 endpoint to be govcloud specific compared default from the example, and the second was argo’s iam role did not have s3 permissions nor KMS permissions.

incalculable-xylophone-18959

11/07/2024, 9:20 PM

Nice! I do remember having to do that as well. ITs just been a lot of troubleshooting so hasn't been clear what the necessary steps are

💯 1

2 Views

Open in Slack

Previous Next