The human-centric platform for production ML & AI

Outerbounds

Hey all,
This issue somewhat of a shot in the dark but here it goes: lately we’ve been seeing our logs fetch time in the bundled UI service to increase greatly - almost bordering on gateway timeout by the load balancer. Our log sizes have not changed (infact most steps have just 5-10 lines of logs at most). The response time is consistently over 20 seconds, many a times exceeding beyond a minute or two leading to timeouts. I am sort of bewildered how that would happen, and also uncertain where to start digging in to RCA this properly.

For context we are using the metaflow’s AWS terraform module to setup our infra, the UI backend and metadata service version is the default v1.3.0 (and I don’t think any new release has appeared either). UI backend service logs don’t show anything except in case of ELB timeouts - that timeout occurred which as you may guess is not very helpful. Service health looks good CPU &amp; memory-wise.

We are hitting ~1TB in our metaflow S3 bucket, but given the scalability of S3 I don’t see how/why that would be a concern if at all. Indeed, directly downloading log files from s3 is fast as expected, the issue is only with the service for some reason. Any pointers would be appreciated.