Hello team, I am encountering a bizarre error in m...
# ask-metaflow
q
Hello team, I am encountering a bizarre error in my self-hosted metaflow deployment. I am on AWS and using recommended terraform deployment setup with only minor changes. Starting today our data scientists started reporting instances of API calls to metadata service failing out of nowhere. For example, accessing API like
Copy code
run = Flow('MyFlow')['sfn-d646da04-bf25-43de-9ef1-75c134da89d7']
Results in at least 2 REST API calls to metadata service, specifically:
Copy code
GET /flows/MyFlow/runs
GET /flows/MyFlow/runs/sfn-d646da04-bf25-43de-9ef1-75c134da89d7
However strangely only the first GET request is resulting in a HTTP 500 “internal server error”. All other API calls are working fine, as well as our UI service is responding ok. I did my own investigation and to my surprise it turns out the service logs for the request actually report a HTTP 200 OK response but clients are receiving a 500 response on their end. The only explanation I can think of is that error must be originating from the API Gateway or any other middleware in the path. This is plausible, our
MyFlow
has more than many thousands of executions in our platform. So it is plausible that the total response size exceeds 10MB, but API Gateway has hard limit of 10MB on payload sizes. 1. Do you guys agree with my assessment? I am unable to prove it. I am able to confirm the hypotheses by directly connecting to the load balancer instead of api gateway, the response size is just over 10MB. 2. Any suggestions on how we can overcome it?
1