Hi Over the last couple of days we have started n...
# ask-metaflow
m
Hi Over the last couple of days we have started noticing some sporadic 500 HTTP (
Internal server error
) errors with the Metaflow metadata service. This is happening during pipeline runs. I've not noticed it happening on calls to the client API. The issue seems to be transient, as the pipeline runs through fine when retrying. We are running the pipelines in EKS with Argo Workflows and use an API gateway to access the metadata service. We are on version
14.11
of Postgres and
2.4.13
of the metadata service. It seems to be happening on a range of different paths in the service, e.g., it has failed on
/flows/SomeTestFlow/runs/argo-pipeline.prod.some.test.flow-1739232420/steps/some_test_step/step
and
/flows/SomeOtherTestFlow/runs/argo-pipeline.prod.some.test.flow-1739212340/steps/some_other_test_step/tasks/t-45771ea7/metadata
Given the paths that are throwing errors and that it seems to be a transient issue, I don't believe the 10Mb payload limit on API gateway will be the issue here. Has anyone got any experience with this issue? Any pointers in how I can debug the problem? Thanks a lot!
s
do you see anything in the service logs that may point to an issue?
m
This is just the logs from the pipeline side at the moment. I was going to start looking through the logs on the server side in more detail, but thus far I haven't seen anything obvious.
So I have found the logs for these requests in API gateway and also the Metaflow metadata service in ECS. It looks like the status code is 200 in the metadata service, but 500 in API gateway. Annoyingly it seems I have omitted the error reason in the logging format for API gateway. I am going to add that in and then wait for the issue to come back up again.
Oh, it also looks like the issue is happening on the heartbeat method too btw
Do you have some documentation on the request flow at the beginning of a step? The issue seems to happen pretty consistently at the beginning of a step. Also, would you expect any of the payloads for those requests to have the potential to be large? e.g., the payload increases as a function of the number of historic pipeline runs?
s
At the beginning - we register the task id and some metadata and start the task heartbeat
Do you know for which endpoint you are running into issues? By design we write very small pieces of data.
The heartbeat method 404 can be safely ignored
m
I can make a list of the methods that have been effected. It seems to be a range of them though. Two of them are in the original post.
Also, the extra error logging on API gateway didn't show anything illuminating, so we are adding AWS X-Ray now so that we can see the full request flow etc.
OK, so I have two more examples. The paths are as follows: •
flows/SomeFlow/runs/argo-pipeline.prod.someflow/steps/_parameters/tasks/t-1c05f3ee-param/artifact
flows/SomeOtherFlow/runs/argo-pipeline.prod.someotherflow/steps/training/tasks/t-c7c069ea/heartbeat
I find the heartbeat one odd, as presumably it does next to nothing, right? The AWS X-Ray only caught the first example in its sampling. I've not used x-ray before, but I believe it is showing (1) the request from API gateway to the metadata service was successful, but took 10 seconds and (2) the response from API gateway to the client has a 500 error. I have open an AWS support request on the second point. With regard to the first, do you know why this request would take such a long time? The CPU utilisation on the metadata service and the underlying RDS instance are pretty low at the time of the request.
s
heartbeats are what are used to paint the state in the UI
these calls are all very lightweight and shouldn't take much time - if you trace these calls - where is the time being spent?
m
Yeah, this is what I expected, which is why I found it confusing that I am getting errors on these requests too.
message has been deleted
This is the trace (shown AWS x-ray) for one of the failed heartbeats. The top row is the API gateway, while the second is the internal metadata service. It is spending 10 seconds on the heartbeat. It says the response status is ok. Oddly there is no response code, which I find suspicious. It feels to me it is not able to reach the service at all, but for some reason is showing the second request status as ok.
So it turns out it is a transient issue with AWS API gateway. One of their internal hosts was timing out, which was causing the issue. Apparently they have a 99.95% SLA on API gateway and the number of errors we were seeing was well below this threshold. Their recommendation is to add retries. We can do that internally as we are using Metaflow extensions, but those that are not will be effected by this. In particular, I note that Metaflow does not retry on this status code. Such instances will kill pipelines, unless it is on a heartbeat request, so I can imagine some will want to be able to retry in these cases.