Hi Over the last couple of days we have started noticing som Outerbounds #ask-metaflow

Hi Over the last couple of days we have started n...

mammoth-rainbow-82717

02/11/2025, 2:35 PM

Hi Over the last couple of days we have started noticing some sporadic 500 HTTP (

Internal server error

) errors with the Metaflow metadata service. This is happening during pipeline runs. I've not noticed it happening on calls to the client API. The issue seems to be transient, as the pipeline runs through fine when retrying. We are running the pipelines in EKS with Argo Workflows and use an API gateway to access the metadata service. We are on version

14.11

of Postgres and

2.4.13

of the metadata service. It seems to be happening on a range of different paths in the service, e.g., it has failed on

/flows/SomeTestFlow/runs/argo-pipeline.prod.some.test.flow-1739232420/steps/some_test_step/step

and

/flows/SomeOtherTestFlow/runs/argo-pipeline.prod.some.test.flow-1739212340/steps/some_other_test_step/tasks/t-45771ea7/metadata

Given the paths that are throwing errors and that it seems to be a transient issue, I don't believe the 10Mb payload limit on API gateway will be the issue here. Has anyone got any experience with this issue? Any pointers in how I can debug the problem? Thanks a lot!

square-wire-39606

02/11/2025, 3:42 PM

do you see anything in the service logs that may point to an issue?

mammoth-rainbow-82717

02/11/2025, 3:46 PM

This is just the logs from the pipeline side at the moment. I was going to start looking through the logs on the server side in more detail, but thus far I haven't seen anything obvious.

mammoth-rainbow-82717

02/11/2025, 5:09 PM

So I have found the logs for these requests in API gateway and also the Metaflow metadata service in ECS. It looks like the status code is 200 in the metadata service, but 500 in API gateway. Annoyingly it seems I have omitted the error reason in the logging format for API gateway. I am going to add that in and then wait for the issue to come back up again.

mammoth-rainbow-82717

02/11/2025, 5:12 PM

Oh, it also looks like the issue is happening on the heartbeat method too btw

mammoth-rainbow-82717

02/12/2025, 12:37 PM

Do you have some documentation on the request flow at the beginning of a step? The issue seems to happen pretty consistently at the beginning of a step. Also, would you expect any of the payloads for those requests to have the potential to be large? e.g., the payload increases as a function of the number of historic pipeline runs?

square-wire-39606

02/12/2025, 3:27 PM

At the beginning - we register the task id and some metadata and start the task heartbeat

square-wire-39606

02/12/2025, 3:27 PM

Do you know for which endpoint you are running into issues? By design we write very small pieces of data.

square-wire-39606

02/12/2025, 3:28 PM

The heartbeat method 404 can be safely ignored

mammoth-rainbow-82717

02/12/2025, 3:29 PM

I can make a list of the methods that have been effected. It seems to be a range of them though. Two of them are in the original post.

mammoth-rainbow-82717

02/12/2025, 3:30 PM

Also, the extra error logging on API gateway didn't show anything illuminating, so we are adding AWS X-Ray now so that we can see the full request flow etc.

mammoth-rainbow-82717

02/14/2025, 10:39 AM

OK, so I have two more examples. The paths are as follows: •

flows/SomeFlow/runs/argo-pipeline.prod.someflow/steps/_parameters/tasks/t-1c05f3ee-param/artifact

•

flows/SomeOtherFlow/runs/argo-pipeline.prod.someotherflow/steps/training/tasks/t-c7c069ea/heartbeat

I find the heartbeat one odd, as presumably it does next to nothing, right? The AWS X-Ray only caught the first example in its sampling. I've not used x-ray before, but I believe it is showing (1) the request from API gateway to the metadata service was successful, but took 10 seconds and (2) the response from API gateway to the client has a 500 error. I have open an AWS support request on the second point. With regard to the first, do you know why this request would take such a long time? The CPU utilisation on the metadata service and the underlying RDS instance are pretty low at the time of the request.

square-wire-39606

02/14/2025, 7:54 PM

heartbeats are what are used to paint the state in the UI

square-wire-39606

02/14/2025, 7:55 PM

these calls are all very lightweight and shouldn't take much time - if you trace these calls - where is the time being spent?

mammoth-rainbow-82717

02/16/2025, 7:44 AM

Yeah, this is what I expected, which is why I found it confusing that I am getting errors on these requests too.

mammoth-rainbow-82717

02/16/2025, 7:45 AM

message has been deleted

mammoth-rainbow-82717

02/16/2025, 7:47 AM

This is the trace (shown AWS x-ray) for one of the failed heartbeats. The top row is the API gateway, while the second is the internal metadata service. It is spending 10 seconds on the heartbeat. It says the response status is ok. Oddly there is no response code, which I find suspicious. It feels to me it is not able to reach the service at all, but for some reason is showing the second request status as ok.

mammoth-rainbow-82717

02/18/2025, 6:11 PM

So it turns out it is a transient issue with AWS API gateway. One of their internal hosts was timing out, which was causing the issue. Apparently they have a 99.95% SLA on API gateway and the number of errors we were seeing was well below this threshold. Their recommendation is to add retries. We can do that internally as we are using Metaflow extensions, but those that are not will be effected by this. In particular, I note that Metaflow does not retry on this status code. Such instances will kill pipelines, unless it is on a heartbeat request, so I can imagine some will want to be able to retry in these cases.

8 Views

Open in Slack

Previous Next