mammoth-rainbow-82717
02/11/2025, 2:35 PMInternal server error
) errors with the Metaflow metadata service. This is happening during pipeline runs. I've not noticed it happening on calls to the client API. The issue seems to be transient, as the pipeline runs through fine when retrying.
We are running the pipelines in EKS with Argo Workflows and use an API gateway to access the metadata service. We are on version 14.11
of Postgres and 2.4.13
of the metadata service.
It seems to be happening on a range of different paths in the service, e.g., it has failed on /flows/SomeTestFlow/runs/argo-pipeline.prod.some.test.flow-1739232420/steps/some_test_step/step
and /flows/SomeOtherTestFlow/runs/argo-pipeline.prod.some.test.flow-1739212340/steps/some_other_test_step/tasks/t-45771ea7/metadata
Given the paths that are throwing errors and that it seems to be a transient issue, I don't believe the 10Mb payload limit on API gateway will be the issue here.
Has anyone got any experience with this issue? Any pointers in how I can debug the problem?
Thanks a lot!square-wire-39606
02/11/2025, 3:42 PMmammoth-rainbow-82717
02/11/2025, 3:46 PMmammoth-rainbow-82717
02/11/2025, 5:09 PMmammoth-rainbow-82717
02/11/2025, 5:12 PMmammoth-rainbow-82717
02/12/2025, 12:37 PMsquare-wire-39606
02/12/2025, 3:27 PMsquare-wire-39606
02/12/2025, 3:27 PMsquare-wire-39606
02/12/2025, 3:28 PMmammoth-rainbow-82717
02/12/2025, 3:29 PMmammoth-rainbow-82717
02/12/2025, 3:30 PMmammoth-rainbow-82717
02/14/2025, 10:39 AMflows/SomeFlow/runs/argo-pipeline.prod.someflow/steps/_parameters/tasks/t-1c05f3ee-param/artifact
• flows/SomeOtherFlow/runs/argo-pipeline.prod.someotherflow/steps/training/tasks/t-c7c069ea/heartbeat
I find the heartbeat one odd, as presumably it does next to nothing, right?
The AWS X-Ray only caught the first example in its sampling. I've not used x-ray before, but I believe it is showing (1) the request from API gateway to the metadata service was successful, but took 10 seconds and (2) the response from API gateway to the client has a 500 error.
I have open an AWS support request on the second point.
With regard to the first, do you know why this request would take such a long time? The CPU utilisation on the metadata service and the underlying RDS instance are pretty low at the time of the request.square-wire-39606
02/14/2025, 7:54 PMsquare-wire-39606
02/14/2025, 7:55 PMmammoth-rainbow-82717
02/16/2025, 7:44 AMmammoth-rainbow-82717
02/16/2025, 7:45 AMmammoth-rainbow-82717
02/16/2025, 7:47 AMmammoth-rainbow-82717
02/18/2025, 6:11 PM