stale-vr-93035
10/26/2024, 12:51 AM"meta-llama/Meta-Llama-3-8B-Instruct"
on 2 A100 GPUs.
line 94, in run_vllm
2024-10-26 00:47:39.730 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] llm = LLM(
2024-10-26 00:47:39.730 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] File "/opt/outerbounds/fastbakery/lib/python/site-packages/vllm/entrypoints/llm.py", line 178, in __init__
2024-10-26 00:47:39.730 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] self.llm_engine = LLMEngine.from_engine_args(
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] File "/opt/outerbounds/fastbakery/lib/python/site-packages/vllm/engine/llm_engine.py", line 557, in from_engine_args
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] engine = cls(
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] File "/opt/outerbounds/fastbakery/lib/python/site-packages/vllm/engine/llm_engine.py", line 324, in __init__
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] self.model_executor = executor_class(
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] File "/opt/outerbounds/fastbakery/lib/python/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] super().__init__(args, kwargs)
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] File "/opt/outerbounds/fastbakery/lib/python/site-packages/vllm/executor/executor_base.py", line 47, in __init__
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] self._init_executor()
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] File "/opt/outerbounds/fastbakery/lib/python/site-packages/vllm/executor/multiproc_gpu_executor.py", line 124, in _init_executor
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] self._run_workers("init_device")
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] File "/opt/outerbounds/fastbakery/lib/python/site-packages/vllm/executor/multiproc_gpu_executor.py", line 199, in _run_workers
2024-10-26 00:47:48.830 [604/run_vllm/1998 (pid 132707)] Kubernetes error:
2024-10-26 00:47:48.830 [604/run_vllm/1998 (pid 132707)] Worker pods failed. This could be a transient error. Use @retry to retry.
2024-10-26 00:47:49.180 [604/run_vllm/1998 (pid 132707)]
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] driver_worker_output = driver_worker_method(args, kwargs)
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] File "/opt/outerbounds/fastbakery/lib/python/site-packages/vllm/worker/worker.py", line 176, in init_device
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] init_worker_distributed_environment(self.parallel_config, self.rank,
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] File "/opt/outerbounds/fastbakery/lib/python/site-packages/vllm/worker/worker.py", line 448, in init_worker_distributed_environment
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] init_distributed_environment(parallel_config.world_size, rank,
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] File "/opt/outerbounds/fastbakery/lib/python/site-packages/vllm/distributed/parallel_state.py", line 854, in init_distributed_environment
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] torch.distributed.init_process_group(
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] File "/opt/outerbounds/fastbakery/lib/python/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] return func(args, kwargs)
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] File "/opt/outerbounds/fastbakery/lib/python/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] func_return = func(args, kwargs)
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] File "/opt/outerbounds/fastbakery/lib/python/site-packages/torch/distributed/distributed_c10d.py", line 1361, in init_process_group
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] store, rank, world_size = next(rendezvous_iterator)
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] File "/opt/outerbounds/fastbakery/lib/python/site-packages/torch/distributed/rendezvous.py", line 211, in _tcp_rendezvous_handler
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout, use_libuv)
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] File "/opt/outerbounds/fastbakery/lib/python/site-packages/torch/distributed/rendezvous.py", line 185, in _create_c10d_store
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] return TCPStore(
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n] torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 1/2 clients joined.
2024-10-26 00:47:39.731 [604/run_vllm/1998 (pid 132707)] [pod js-ab76a2-control-0-0-zpg4n]
2024-10-26 00:47:49.303 [604/run_vllm/1998 (pid 132707)] Task failed.
2024-10-26 00:47:49.514 Workflow failed.
2024-10-26 00:47:49.514 Terminating 0 active tasks...
ancient-application-36103
10/26/2024, 12:52 AMstale-vr-93035
10/26/2024, 12:52 AMlittle-apartment-49355
10/28/2024, 7:19 PMmetaflow_ray
task