hallowed-soccer-94479
04/08/2025, 5:04 PM2025-04-07 21:08:42.224 UTC [9378]: [1-1] db=metaflow,user=metaflow ERROR: duplicate key value violates unique constraint "steps_v3_flow_id_run_id_step_name_key"
My guess is that this specific error might result from initial write attempts timing out but eventually succeeding, causing subsequent retries to conflict with the unique constraint.
Additionally, we see errors originating from the Metaflow services, such as:
failed (code 400): {"message": "need to register run_id and task_id first"}
Database monitoring during these periods shows significant spikes in query latency, CPU utilization, and wait events related to CPU contention. I have had success reducing these issues by limiting the max-workers for these jobs, and migrating to a database instance with higher CPU resources. Has anyone else run into these same situations? Are there best practices for tuning the Metaflow service or database to perform better?ancient-application-36103
04/08/2025, 5:05 PMhallowed-soccer-94479
04/08/2025, 5:06 PMancient-application-36103
04/08/2025, 5:06 PMancient-application-36103
04/08/2025, 5:06 PMancient-application-36103
04/08/2025, 5:06 PMhallowed-soccer-94479
04/08/2025, 5:07 PMdb-perf-optimized-N-8
instance, the volume size is 20 GB ssdancient-application-36103
04/08/2025, 5:09 PMhallowed-soccer-94479
04/08/2025, 5:10 PMhallowed-soccer-94479
04/08/2025, 5:10 PMancient-application-36103
04/08/2025, 5:12 PMsquare-wire-39606
04/08/2025, 7:20 PMhallowed-soccer-94479
04/08/2025, 7:22 PMsquare-wire-39606
04/08/2025, 7:41 PMancient-application-36103
04/08/2025, 8:08 PM