Hi all I m getting issues scaling up my pipeline from 300 wo Outerbounds #ask-metaflow

Hi all! I'm getting issues scaling up my pipeline ...

important-address-85681

11/05/2024, 7:45 PM

Hi all! I'm getting issues scaling up my pipeline from 300 workers to 512. Maybe someone more experienced can provide some advises. Stack is GKE, autopilot cluster, vCPU jobs. 1. My backend or DB getting DDoS and crashes (is there any way to fix it on K8S cluster, scaling vertically/horizontally)? How to select optimal configuration for their pods? 2. I don't understand the RAM consumption when increasing up to 40Gb. When I do

python my_flow.py package show

there are only 10 py files from my project and metaflow from .venv. What else is used? Maybe I need to start host machine in docker env or somehow minimalistic enviroment?

✅ 1

square-wire-39606

11/06/2024, 6:44 PM

Hi! Can you help me with the bottleneck that you are running into?

important-address-85681

11/06/2024, 6:50 PM

Hi! So I have 2 issues: getting out of RAM on host and crash of backend, I guess it is because of non-optimal configuration

important-address-85681

11/07/2024, 4:28 AM

Copy code

WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7aa3f0dd7730>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /packages/c6/a1/efde68064c56d9f41954d4013c6a80dd6b449e81e2d9f91c6ec8989c9d02/google_cloud_secret_manager-2.20.2-py2.py3-none-any.whl.metadata
2024-11-07 04:26:10.171 [100/calculate_clip_embedding/25372 (pid 3584772)] bash: line 1: [: -le: unary operator expected
2024-11-07 04:26:10.171 [100/calculate_clip_embedding/25372 (pid 3584772)] bash: line 1: [: -gt: unary operator expected
2024-11-07 04:26:10.171 [100/calculate_clip_embedding/25372 (pid 3584772)] tar: job.tar: Cannot open: No such file or directory
2024-11-07 04:26:10.171 [100/calculate_clip_embedding/25372 (pid 3584772)] tar: Error is not recoverable: exiting now
2024-11-07 04:26:10.171 [100/calculate_clip_embedding/25372 (pid 3584772)] /usr/local/bin/python: Error while finding module specification for 'metaflow.mflog.save_logs' (ModuleNotFoundError: No module named 'metaflow')
2024-11-07 04:26:10.171 [100/calculate_clip_embedding/25372 (pid 3584772)] (exit code 2). This could be a transient error. Use @retry to retry.

important-address-85681

11/07/2024, 4:31 AM

In some pods, others seems to behave well, and after retry some of them retries successfuly

Open in Slack

Previous Next