Hi team! Wanted to see if anyone has seen this err...
# ask-metaflow
m
Hi team! Wanted to see if anyone has seen this error before - we have large fan out jobs with intermittent pods that fail, and we see a 400/500 error with this message:
Copy code
failed (code 400): {"message": "need to register run_id and task_id first"}
p
Commenting to follow the issue, as I've also seen 4xx errors on large fanout jobs but hadn't dug into them yet.
m
So a quick initial look leads me to think the DB is getting throttled because we're seeing 100% CPU utilization spikes on these jobs - maybe a rephrase of this question is if there is a metaflow config setting to set the max number of requests to a database vs just increasing the CPU of the database to infinity
p
Are you running a cluster with a separate reader pool? That helped us get our db CPU under control for the most part (although I'm still seeing the issue on big fanouts)
m
By separate reader pool, can you elaborate?
I want to say yes, but making sure we're talking the same thing
p
https://github.com/Netflix/metaflow-service/releases/tag/v2.4.4 the
USE_SEPARATE_READER_POOL
set to "True"
MF_METADATA_DB_READ_REPLICA_HOST
to the host for the readonly connections.
along with, in our case, an Aurora/RDS cluster that manages a rw and ro endpoint, the latter of which I put in the
MF_METADATA_DB_READ_REPLICA_HOST
variable.
And to make sure it took, I can see both the reader (orange) and writer (green) RDS instances getting connections
m
Ah got it - we have not upgraded to that yet, but granted that you're still getting 4xx errors, im not sure that it will solve for us either because our utilization jumps from about 2% to 100% on the large job and is stable otherwise
p
I'd say going to the split reader/writer pool has greatly helped with our DB cpu issues, I think there's something in the way we kick off a lot of jobs at once that still results in spikes, but this has at least stopped the cpu alerts from firing every day to firing just once a week. Plus I feel more confident that I could add additional (small) instances to the reader pool and further spread that read load. I have not yet invested any serious time into monitoring or examining the DB load.
m
Curious if anyone from the outerbounds team has suggestions here as well if both Lamont and I are hitting this with different setups
s
@microscopic-painting-61536 what’s the size of db that you are using? we have seen this happen when the db is under resourced
a
also, our recommendation for production setup is to have a dedicated read replica for powering the UI - leaving the primary instance to only power the metadata service.
🙌 1
by design, metaflow avoids reading metadata during flow execution and only writes to the db
🙌 1
p
Thank you, it did not occur to me to point the UI at the read replica, I'm sure ours are still pointing to the original rw
s
The deployment templates available in oss don’t create read replicas to reduce cost
🙌 1
You would also want to ensure that the UI is not configured to speak to the metadata service but a container local version of the metadata service that is in turn configured to speak with the read replica
🙌 1
p
while I have your attention on the subject, is there a particular replica delay past which issues start to arise with the split reader/writer regime describe above? I'd seen numbers between 8 and 16ms which didn't seem alarming, but I'm curious what kind of threshold alerts we should set to stay ahead of user-facing problems.
m
Thanks so much @ancient-application-36103! We're on a db.m7g.4xlarge so 16 vCPUs - we don't have the read replica or UI <-> metadata service as described either which could also help
is there any docs on this?
s
@prehistoric-salesclerk-95013 the replica delay will manifest itself in terms of the UI being a bit behind. Only the UI will be served from the read replica - so a slight delay should be tolerable.
thankyou 1
@microscopic-painting-61536 - unfortunately, no docs but setting up replicas should be hopefully straightforward. Also, the production set up would likely look very different for different companies. Not knowing enough about your scale and usage patterns, hard to judge if a 4xlarge instance would be enough.
m
yeah, its odd because CPU hovers around 2% consistently across all of our dozens of models except for a 30 minute period where we have a singular massive fan out
and it spikes to 99%
a
How massive is the fan out? A t3.small should be able to support 10k-50k way concurrent foreach without much trouble - that’s what my development stack looks like.
g
@square-wire-39606 we usually have about 300 concurrent steps running at 18:00, and about 500 running at 6:00, on average
s
That’s not much scale at all. It’s likely that there are issues with the db setup.
g
actually was able to capture this in real time:
Copy code
ml-ops-metaflow-service-5576458685-9njk7 metaflow-service INFO:aiohttp.access:127.0.0.6 [11/Jul/2024:21:29:53 +0000] "GET /flows/AutoSegmentsInferenceFlow/runs/argo-autosegments.prod.autosegmentsinferenceflow-1720731600 HTTP/1.1" 200 689 "-" "python-requests/2.32.3"
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service ERROR:AsyncPostgresDB:global:Exception occurred
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service Traceback (most recent call last):
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service   File "/opt/latest/lib/python3.11/site-packages/aiopg/pool.py", line 317, in _acquire
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service     await self._cond.wait()
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service   File "/usr/local/lib/python3.11/asyncio/locks.py", line 267, in wait
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service     await fut
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service asyncio.exceptions.CancelledError
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service During handling of the above exception, another exception occurred:
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service Traceback (most recent call last):
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service   File "/root/services/data/postgres_async_db.py", line 366, in update_row
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service     await self.db.pool.cursor(
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service   File "/opt/latest/lib/python3.11/site-packages/aiopg/pool.py", line 414, in cursor
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service     conn = await self.acquire()
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service            ^^^^^^^^^^^^^^^^^^^^
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service   File "/opt/latest/lib/python3.11/site-packages/aiopg/pool.py", line 307, in _acquire
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service     async with async_timeout.timeout(self._timeout), self._cond:
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service   File "/opt/latest/lib/python3.11/site-packages/async_timeout/__init__.py", line 141, in __aexit__
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service     self._do_exit(exc_type)
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service   File "/opt/latest/lib/python3.11/site-packages/async_timeout/__init__.py", line 228, in _do_exit
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service     raise asyncio.TimeoutError
ml-ops-metaflow-service-5576458685-tgxd2 metaflow-service TimeoutError
its simply one Inference flow causing the spike. Its normal dataset had increased by about 10x since the last two months
a
The timeout is always when registering the run id and task id?
It could be misconfigured indices - we recommend setting up indices only on the read replica for the UI
g
Not just the registering run id/task id, theres a plethora of logs/timeouts after heartbeats, metadata, etc