incalculable-xylophone-18959
10/28/2024, 7:40 PMbotocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "<https://edflows3tesas58h.s3.us-gov-west-1.amazonaws.com/metaflow/TestNvidiaSmi/data/65/65a311fc27953da4d9dc31af34497126be1fc397>"
Failed to download code package from <s3://edflows3tesas58h/metaflow/TestNvidiaSmi/data/65/65a311fc27953da4d9dc31af34497126be1fc397> after 6 tries.
In the debug logs, I see that it is getting a 404 when calling HEAD for this object:
DEBUG:urllib3.connectionpool:<https://edflows3tesas58h.s3.us-gov-west-1.amazonaws.com:443> "HEAD /metaflow/TestNvidiaSmi/data/65/65a311fc27953da4d9dc31af34497126be1fc397 HTTP/11" 404 0
When I look in S3, I see that this object exists and its creation time is right around when the failed HEAD call was.
However, prior to this it seems able to PUT/HEAD/GET a number of objects. Previous to this failed call, I see it successfully GET /metaflow/TestNvidiaSmi/data/4c/4ca058df2ea422cca260c585409d6ac9face7ebe
Not sure if this is related, but it seems like it always retries 6 times, no matter what I set for METAFLOW_S3_RETRY_COUNT
in the @environment
decorators or in the calling environment, or in the metaflowconfig.
Any help is appreciated!incalculable-xylophone-18959
10/28/2024, 8:03 PM@retry
decorator and interestingly it seems like it still fails with the same error (different key of course) even when on the subsequent tries and I have verified that the object does exist by then. I'm trying this again with the debug logging to see if there is any difference in the calls/errors during the retries.
I left out the somewhat important point that my metaflow metadata service is running in govcloud.incalculable-xylophone-18959
10/28/2024, 8:15 PMmetaflow/TestNvidiaSmi/237/nvidia_smi/645/2.task_stderr.log
which don't exist, but metaflow/TestNvidiaSmi/237/nvidia_smi/645/2.runtime_stderr.log
does exist along with these other names:
• 2.attempt.json
• 2.runtime
• 2.runtime_stdout.logimportant-rainbow-87332
10/29/2024, 12:49 AMincalculable-xylophone-18959
10/29/2024, 1:27 PM@environment(
vars={
"AWS_DEFAULT_REGION": "us-gov-west-1",
"AWS_REGION": "us-gov-west-1",
"AWS_ACCESS_KEY_ID": os.environ["AWS_ACCESS_KEY_ID"],
"AWS_SECRET_ACCESS_KEY": os.environ["AWS_SECRET_ACCESS_KEY"],
"METAFLOW_S3_RETRY_COUNT": "10",
"NVIDIA_VISIBLE_DEVICES": "all"
}
)
@kubernetes(
image="artifactory.roboticresearch.com/docker/dltp:test",
image_pull_policy="Never",
gpu=1,
tolerations=[{"key": "node-role.kubernetes.io/control-plane", "operator": "Exists", "effect": "NoSchedule"}]
)
@resources(gpu=1)
@retry()
@step
def nvidia_smi(self):
"""Make sure that we can run nvidia-smi."""
subprocess.run(["/usr/bin/nvidia-smi", "-q"])
self.next(self.end)
hundreds-rainbow-67050
11/01/2024, 5:55 AMimportant-rainbow-87332
11/07/2024, 7:42 PMincalculable-xylophone-18959
11/07/2024, 7:43 PMimportant-rainbow-87332
11/07/2024, 8:37 PMincalculable-xylophone-18959
11/07/2024, 9:20 PM