How to deal with this error: ``` Metaflow service ...
# ask-metaflow
i
How to deal with this error:
Copy code
Metaflow service error:
    Metadata request (/flows/LogoDetectionFlow/runs/124/steps/process) failed (code 500): "{\"err_msg\": {\"pgerror\": null, \"pgcode\": null, \"diag\": {\"message_primary\": null, \"severity\": null}}}"

Process crashed
Metaflow 2.12.13 executing LogoDetectionFlow for user:vkulikov
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
    Metaflow service error:
    Metadata request (/flows/LogoDetectionFlow) failed (code 500): "{\"err_msg\": {\"type\": \"timeout error\"}}"
1
In pods I see [KILLED BY ORCHESTRATOR] 2 225403 Metaflow service error: 3 225403 Metadata request (/flows/ShardedFlowCPU/runs/110/steps/calculate_clip_embedding/tasks/27810/metadata) failed (code 400): {"message": "need to register run_id and task_id first"}
a
Hi Victor, did you start a lot of parallel processes? I did have the same problem and could not solve it when starting ~50 parallel processes. When I reduced it to something more moderate ~20 error seemed to go away.
i
Yes, I have ~348 processes
I hope to run 512 processes
a
Well as far as I see it we ran into an resource exhaustion problem here. Are you on-prem on somewhere AWS/GCP ...
i
Yes, I need a proper configuration for the backend, maybe k8s scripts for autoscaling horizontal or vertical
a
Yep - Could be a good solution. Potentially you can comeback and post your solution - would be interested to see what you came up with
i
Well this is strange. The average usage of backend is really low (like 0.1vCPU), but at peak it cannot handle the load. So I fixed the problem by increasing the resources (autoscale doesn't help). But a good solution would be to fix the problem inside of the backend service
a
Ah pretty neat. What kind of resources did you increase? Memory? CPU? tmp-fs? I generally share your opinion here. I had the same impression that the backend should be able to handle those peaks a bit better
i
Memory and CPUs (but it seems the memory is the most important because I see OOM Errors in the past). And K8S doesn't support swap files so the memory needs to be managed carefully.
For my task 384 worker 4Gb ram seems ok for backend and for db
a
4GB of RAM for teh metaflow-service or which service did you update?