Hi all! I'm getting issues scaling up my pipeline ...
# ask-metaflow
i
Hi all! I'm getting issues scaling up my pipeline from 300 workers to 512. Maybe someone more experienced can provide some advises. Stack is GKE, autopilot cluster, vCPU jobs. 1. My backend or DB getting DDoS and crashes (is there any way to fix it on K8S cluster, scaling vertically/horizontally)? How to select optimal configuration for their pods? 2. I don't understand the RAM consumption when increasing up to 40Gb. When I do
python my_flow.py package show
there are only 10 py files from my project and metaflow from .venv. What else is used? Maybe I need to start host machine in docker env or somehow minimalistic enviroment?
1
s
Hi! Can you help me with the bottleneck that you are running into?
i
Hi! So I have 2 issues: getting out of RAM on host and crash of backend, I guess it is because of non-optimal configuration
Copy code
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7aa3f0dd7730>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /packages/c6/a1/efde68064c56d9f41954d4013c6a80dd6b449e81e2d9f91c6ec8989c9d02/google_cloud_secret_manager-2.20.2-py2.py3-none-any.whl.metadata
2024-11-07 04:26:10.171 [100/calculate_clip_embedding/25372 (pid 3584772)] bash: line 1: [: -le: unary operator expected
2024-11-07 04:26:10.171 [100/calculate_clip_embedding/25372 (pid 3584772)] bash: line 1: [: -gt: unary operator expected
2024-11-07 04:26:10.171 [100/calculate_clip_embedding/25372 (pid 3584772)] tar: job.tar: Cannot open: No such file or directory
2024-11-07 04:26:10.171 [100/calculate_clip_embedding/25372 (pid 3584772)] tar: Error is not recoverable: exiting now
2024-11-07 04:26:10.171 [100/calculate_clip_embedding/25372 (pid 3584772)] /usr/local/bin/python: Error while finding module specification for 'metaflow.mflog.save_logs' (ModuleNotFoundError: No module named 'metaflow')
2024-11-07 04:26:10.171 [100/calculate_clip_embedding/25372 (pid 3584772)] (exit code 2). This could be a transient error. Use @retry to retry.
In some pods, others seems to behave well, and after retry some of them retries successfuly