Hi everyone... fairly stumped here. Also working ...
# ask-metaflow
i
Hi everyone... fairly stumped here. Also working on running metaflow in a local cluster, like @powerful-knife-41200 just mentioned. I have the metaflow metadata service hosted in AWS and most of our data in is in AWS, however we have some GPU servers on prem that we are trying to get working.Current final error is:
Copy code
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "<https://edflows3tesas58h.s3.us-gov-west-1.amazonaws.com/metaflow/TestNvidiaSmi/data/65/65a311fc27953da4d9dc31af34497126be1fc397>"
Failed to download code package from <s3://edflows3tesas58h/metaflow/TestNvidiaSmi/data/65/65a311fc27953da4d9dc31af34497126be1fc397> after 6 tries.
In the debug logs, I see that it is getting a 404 when calling HEAD for this object:
Copy code
DEBUG:urllib3.connectionpool:<https://edflows3tesas58h.s3.us-gov-west-1.amazonaws.com:443> "HEAD /metaflow/TestNvidiaSmi/data/65/65a311fc27953da4d9dc31af34497126be1fc397 HTTP/11" 404 0
When I look in S3, I see that this object exists and its creation time is right around when the failed HEAD call was. However, prior to this it seems able to PUT/HEAD/GET a number of objects. Previous to this failed call, I see it successfully GET
/metaflow/TestNvidiaSmi/data/4c/4ca058df2ea422cca260c585409d6ac9face7ebe
Not sure if this is related, but it seems like it always retries 6 times, no matter what I set for
METAFLOW_S3_RETRY_COUNT
in the
@environment
decorators or in the calling environment, or in the metaflowconfig. Any help is appreciated!
1
I did try the
@retry
decorator and interestingly it seems like it still fails with the same error (different key of course) even when on the subsequent tries and I have verified that the object does exist by then. I'm trying this again with the debug logging to see if there is any difference in the calls/errors during the retries. I left out the somewhat important point that my metaflow metadata service is running in govcloud.
One other thing is that I see some 404's for files like
metaflow/TestNvidiaSmi/237/nvidia_smi/645/2.task_stderr.log
which don't exist, but
metaflow/TestNvidiaSmi/237/nvidia_smi/645/2.runtime_stderr.log
does exist along with these other names: • 2.attempt.json • 2.runtime • 2.runtime_stdout.log
i
I’ve had some interesting connection issues this week to govcloud. This looks like its an permission issue either on the role, or your s3 bucket. I’ve not been able to fully figure out the errors yet. Are you using the EKS deploy?
i
@important-rainbow-87332 What I don't get is how it clearly is putting the files there, and it has s3:* permissions so no reason it can't get them also. There is just 1 role in play here and the job is just a simple test job:
Copy code
@environment(
        vars={
            "AWS_DEFAULT_REGION": "us-gov-west-1",
            "AWS_REGION": "us-gov-west-1",
            "AWS_ACCESS_KEY_ID": os.environ["AWS_ACCESS_KEY_ID"],
            "AWS_SECRET_ACCESS_KEY": os.environ["AWS_SECRET_ACCESS_KEY"],
            "METAFLOW_S3_RETRY_COUNT": "10",
            "NVIDIA_VISIBLE_DEVICES": "all"
        }
    )
    @kubernetes(
        image="artifactory.roboticresearch.com/docker/dltp:test",
        image_pull_policy="Never",
        gpu=1,
        tolerations=[{"key": "node-role.kubernetes.io/control-plane", "operator": "Exists", "effect": "NoSchedule"}]
    )
    @resources(gpu=1)
    @retry()
    @step
    def nvidia_smi(self):
        """Make sure that we can run nvidia-smi."""
        subprocess.run(["/usr/bin/nvidia-smi", "-q"])
        self.next(self.end)
h
The error is saying it's trying to download the code package, which happens before the step is run (so setting those AWS_ environment variables don't apply). I'm not really familiar with K8S, but whatever role is being used to run the container would be the one that is used to download the code package. Are you able to access the file directly using the role?
i
@incalculable-xylophone-18959 I’m experiencing similar deployment issues with EKS in GovCloud. Could you share which EKS version you’re using? Have you identified the root cause of the problem?
i
@important-rainbow-87332 Well, the errors above here I don't think are govcloud related but are related to trying to have an on-prem metaflow cluster talk to an AWS-metadata service.
i
@incalculable-xylophone-18959 I’m fully on govcloud, and a new cluster is being deployed, i just found a fix for the error on my end. 2 issues, i had to update the s3 endpoint to be govcloud specific compared default from the example, and the second was argo’s iam role did not have s3 permissions nor KMS permissions.
i
Nice! I do remember having to do that as well. ITs just been a lot of troubleshooting so hasn't been clear what the necessary steps are
💯 1