How to deal with this error ``` Metaflow service error Metad Outerbounds #ask-metaflow

How to deal with this error: ``` Metaflow service ...

important-address-85681

11/07/2024, 9:10 PM

How to deal with this error:

Copy code

Metaflow service error:
    Metadata request (/flows/LogoDetectionFlow/runs/124/steps/process) failed (code 500): "{\"err_msg\": {\"pgerror\": null, \"pgcode\": null, \"diag\": {\"message_primary\": null, \"severity\": null}}}"

Process crashed
Metaflow 2.12.13 executing LogoDetectionFlow for user:vkulikov
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
    Metaflow service error:
    Metadata request (/flows/LogoDetectionFlow) failed (code 500): "{\"err_msg\": {\"type\": \"timeout error\"}}"

✅ 1

important-address-85681

11/07/2024, 9:11 PM

In pods I see [KILLED BY ORCHESTRATOR] 2 225403 Metaflow service error: 3 225403 Metadata request (/flows/ShardedFlowCPU/runs/110/steps/calculate_clip_embedding/tasks/27810/metadata) failed (code 400): {"message": "need to register run_id and task_id first"}

adorable-oxygen-86530

11/08/2024, 7:30 AM

Hi Victor, did you start a lot of parallel processes? I did have the same problem and could not solve it when starting ~50 parallel processes. When I reduced it to something more moderate ~20 error seemed to go away.

important-address-85681

11/08/2024, 9:22 AM

Yes, I have ~348 processes

important-address-85681

11/08/2024, 9:22 AM

I hope to run 512 processes

adorable-oxygen-86530

11/08/2024, 9:23 AM

Well as far as I see it we ran into an resource exhaustion problem here. Are you on-prem on somewhere AWS/GCP ...

important-address-85681

11/08/2024, 9:29 AM

Yes, I need a proper configuration for the backend, maybe k8s scripts for autoscaling horizontal or vertical

adorable-oxygen-86530

11/08/2024, 9:38 AM

Yep - Could be a good solution. Potentially you can comeback and post your solution - would be interested to see what you came up with

important-address-85681

11/12/2024, 8:13 AM

Well this is strange. The average usage of backend is really low (like 0.1vCPU), but at peak it cannot handle the load. So I fixed the problem by increasing the resources (autoscale doesn't help). But a good solution would be to fix the problem inside of the backend service

adorable-oxygen-86530

11/12/2024, 8:43 AM

Ah pretty neat. What kind of resources did you increase? Memory? CPU? tmp-fs? I generally share your opinion here. I had the same impression that the backend should be able to handle those peaks a bit better

important-address-85681

11/12/2024, 8:50 AM

Memory and CPUs (but it seems the memory is the most important because I see OOM Errors in the past). And K8S doesn't support swap files so the memory needs to be managed carefully.

important-address-85681

11/12/2024, 8:51 AM

For my task 384 worker 4Gb ram seems ok for backend and for db

adorable-oxygen-86530

11/12/2024, 8:57 AM

4GB of RAM for teh metaflow-service or which service did you update?

16 Views

Open in Slack

Previous Next