Hi team Wanted to see if anyone has seen this error before w Outerbounds #ask-metaflow

Hi team! Wanted to see if anyone has seen this err...

microscopic-painting-61536

07/09/2024, 10:39 PM

Hi team! Wanted to see if anyone has seen this error before - we have large fan out jobs with intermittent pods that fail, and we see a 400/500 error with this message:

Copy code

failed (code 400): {"message": "need to register run_id and task_id first"}

prehistoric-salesclerk-95013

07/09/2024, 10:46 PM

Commenting to follow the issue, as I've also seen 4xx errors on large fanout jobs but hadn't dug into them yet.

microscopic-painting-61536

07/09/2024, 11:02 PM

So a quick initial look leads me to think the DB is getting throttled because we're seeing 100% CPU utilization spikes on these jobs - maybe a rephrase of this question is if there is a metaflow config setting to set the max number of requests to a database vs just increasing the CPU of the database to infinity

prehistoric-salesclerk-95013

07/09/2024, 11:07 PM

Are you running a cluster with a separate reader pool? That helped us get our db CPU under control for the most part (although I'm still seeing the issue on big fanouts)

microscopic-painting-61536

07/09/2024, 11:10 PM

By separate reader pool, can you elaborate?

microscopic-painting-61536

07/09/2024, 11:11 PM

I want to say yes, but making sure we're talking the same thing

prehistoric-salesclerk-95013

07/09/2024, 11:11 PM

https://github.com/Netflix/metaflow-service/releases/tag/v2.4.4 the

•
USE_SEPARATE_READER_POOL
set to "True"

•
MF_METADATA_DB_READ_REPLICA_HOST
to the host for the readonly connections.

prehistoric-salesclerk-95013

07/09/2024, 11:12 PM

along with, in our case, an Aurora/RDS cluster that manages a rw and ro endpoint, the latter of which I put in the

MF_METADATA_DB_READ_REPLICA_HOST

variable.

prehistoric-salesclerk-95013

07/09/2024, 11:16 PM

And to make sure it took, I can see both the reader (orange) and writer (green) RDS instances getting connections

microscopic-painting-61536

07/10/2024, 4:42 PM

Ah got it - we have not upgraded to that yet, but granted that you're still getting 4xx errors, im not sure that it will solve for us either because our utilization jumps from about 2% to 100% on the large job and is stable otherwise

prehistoric-salesclerk-95013

07/10/2024, 4:54 PM

I'd say going to the split reader/writer pool has greatly helped with our DB cpu issues, I think there's something in the way we kick off a lot of jobs at once that still results in spikes, but this has at least stopped the cpu alerts from firing every day to firing just once a week. Plus I feel more confident that I could add additional (small) instances to the reader pool and further spread that read load. I have not yet invested any serious time into monitoring or examining the DB load.

microscopic-painting-61536

07/10/2024, 9:26 PM

Curious if anyone from the outerbounds team has suggestions here as well if both Lamont and I are hitting this with different setups

square-wire-39606

07/10/2024, 10:25 PM

@microscopic-painting-61536 what’s the size of db that you are using? we have seen this happen when the db is under resourced

ancient-application-36103

07/10/2024, 10:29 PM

also, our recommendation for production setup is to have a dedicated read replica for powering the UI - leaving the primary instance to only power the metadata service.

🙌 1

ancient-application-36103

07/10/2024, 10:29 PM

by design, metaflow avoids reading metadata during flow execution and only writes to the db

🙌 1

prehistoric-salesclerk-95013

07/10/2024, 10:32 PM

Thank you, it did not occur to me to point the UI at the read replica, I'm sure ours are still pointing to the original rw

square-wire-39606

07/10/2024, 10:34 PM

The deployment templates available in oss don’t create read replicas to reduce cost

🙌 1

square-wire-39606

07/10/2024, 10:35 PM

You would also want to ensure that the UI is not configured to speak to the metadata service but a container local version of the metadata service that is in turn configured to speak with the read replica

🙌 1

prehistoric-salesclerk-95013

07/10/2024, 10:43 PM

while I have your attention on the subject, is there a particular replica delay past which issues start to arise with the split reader/writer regime describe above? I'd seen numbers between 8 and 16ms which didn't seem alarming, but I'm curious what kind of threshold alerts we should set to stay ahead of user-facing problems.

microscopic-painting-61536

07/10/2024, 10:59 PM

Thanks so much @ancient-application-36103! We're on a db.m7g.4xlarge so 16 vCPUs - we don't have the read replica or UI <-> metadata service as described either which could also help

microscopic-painting-61536

07/10/2024, 10:59 PM

is there any docs on this?

square-wire-39606

07/11/2024, 6:31 AM

@prehistoric-salesclerk-95013 the replica delay will manifest itself in terms of the UI being a bit behind. Only the UI will be served from the read replica - so a slight delay should be tolerable.

thankyou 1

square-wire-39606

07/11/2024, 6:36 AM

@microscopic-painting-61536 - unfortunately, no docs but setting up replicas should be hopefully straightforward. Also, the production set up would likely look very different for different companies. Not knowing enough about your scale and usage patterns, hard to judge if a 4xlarge instance would be enough.

microscopic-painting-61536

07/11/2024, 4:55 PM

yeah, its odd because CPU hovers around 2% consistently across all of our dozens of models except for a 30 minute period where we have a singular massive fan out

microscopic-painting-61536

07/11/2024, 4:55 PM

and it spikes to 99%

ancient-application-36103

07/11/2024, 5:03 PM

How massive is the fan out? A t3.small should be able to support 10k-50k way concurrent foreach without much trouble - that’s what my development stack looks like.

gifted-helicopter-39432

07/11/2024, 9:18 PM

@square-wire-39606 we usually have about 300 concurrent steps running at 18:00, and about 500 running at 6:00, on average

square-wire-39606

07/11/2024, 9:32 PM

That’s not much scale at all. It’s likely that there are issues with the db setup.

gifted-helicopter-39432

07/11/2024, 9:39 PM

actually was able to capture this in real time:

Copy code

ml-ops-metaflow-service-5576458685-9njk7 metaflow-service INFO:aiohttp.access:127.0.0.6 [11/Jul/2024:21:29:53 +0000] "GET /flows/AutoSegmentsInferenceFlow/runs/argo-autosegments.prod.autosegmentsinferenceflow-1720731600 HTTP/1.1" 200 689 "-" "python-requests/2.32.3"
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service ERROR:AsyncPostgresDB:global:Exception occurred
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service Traceback (most recent call last):
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service   File "/opt/latest/lib/python3.11/site-packages/aiopg/pool.py", line 317, in _acquire
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service     await self._cond.wait()
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service   File "/usr/local/lib/python3.11/asyncio/locks.py", line 267, in wait
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service     await fut
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service asyncio.exceptions.CancelledError
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service During handling of the above exception, another exception occurred:
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service Traceback (most recent call last):
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service   File "/root/services/data/postgres_async_db.py", line 366, in update_row
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service     await self.db.pool.cursor(
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service   File "/opt/latest/lib/python3.11/site-packages/aiopg/pool.py", line 414, in cursor
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service     conn = await self.acquire()
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service            ^^^^^^^^^^^^^^^^^^^^
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service   File "/opt/latest/lib/python3.11/site-packages/aiopg/pool.py", line 307, in _acquire
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service     async with async_timeout.timeout(self._timeout), self._cond:
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service   File "/opt/latest/lib/python3.11/site-packages/async_timeout/__init__.py", line 141, in __aexit__
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service     self._do_exit(exc_type)
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service   File "/opt/latest/lib/python3.11/site-packages/async_timeout/__init__.py", line 228, in _do_exit
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service     raise asyncio.TimeoutError
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service TimeoutError

its simply one Inference flow causing the spike. Its normal dataset had increased by about 10x since the last two months

ancient-application-36103

07/11/2024, 9:41 PM

The timeout is always when registering the run id and task id?

ancient-application-36103

07/11/2024, 9:42 PM

It could be misconfigured indices - we recommend setting up indices only on the read replica for the UI

gifted-helicopter-39432

07/11/2024, 9:49 PM

Not just the registering run id/task id, theres a plethora of logs/timeouts after heartbeats, metadata, etc

20 Views

Open in Slack

Previous Next