Hi Metaflow we are noticing that ec2 compute instances are l Outerbounds #dev-metaflow

Hi Metaflow, we are noticing that ec2 compute inst...

cool-father-45885

09/10/2021, 6:01 PM

Hi Metaflow, we are noticing that ec2 compute instances are left running for long periods of time (these are instances orchestrated by AWS Batch) even when there are no tasks running on the Batch Compute Environment. Has anyone else encountered this issue?

✅ 1

ancient-application-36103

09/10/2021, 6:08 PM

That's unfortunately the behavior of AWS Batch Compute Environment where it keeps the instance for a while even when the job queue is empty. What are your settings for minVCPUs and desiredVCPUs?

cool-father-45885

09/10/2021, 6:10 PM

minVCPU ->8 desired is currently 1440 despite only a single job running having Metaflow Batch requests of 16vcpu and 10gb memory maxVCPU -> 1500

cool-father-45885

09/10/2021, 6:10 PM

We have many EC2 instances that were launched many weeks ago left running

ancient-application-36103

09/10/2021, 6:24 PM

How AWS Batch computes desired vCPUs is a bit opaque. Let me see if @worried-machine-92008 or @purple-engineer-56290 can shed some light on this.

cool-father-45885

09/10/2021, 6:29 PM

Sounds good, thanks

cool-father-45885

09/10/2021, 6:30 PM

Sounds like I might need to open a ticket with AWS for this one

worried-machine-92008

09/10/2021, 7:01 PM

@square-wire-39606 "for a while" should be a max of say, 4 minutes

👍 1

worried-machine-92008

09/10/2021, 7:01 PM

with most nodes scaling down in less than 2

worried-machine-92008

09/10/2021, 7:02 PM

Batch computes desired vCPU by the amount of jobs in RUNNING + RUNNABLE (not all jobs in Runnable, but several thousands)

worried-machine-92008

09/10/2021, 7:02 PM

we do that in 2 minute cycles, roughly

worried-machine-92008

09/10/2021, 7:04 PM

@cool-father-45885 have you reached out to AWS support?

worried-machine-92008

09/10/2021, 7:05 PM

I'll be more clear in saying that i don't know of any cause right now that would lead batch to lead nodes alive like that. Batch is very aggressive in terminating unused nodes. That doesn't mean it can't happen, just that there's no cause that we've encountered

cool-father-45885

09/10/2021, 7:38 PM

@worried-machine-92008 thanks for the reply. I have not reached out to AWS support yet. I wanted to see what the Metaflow team thought first. I suspect this is a Batch/ECS issue however as I am seeing instances left alive for weeks at a time even when job queue for the compute environment in question is empty (no running and no runnable)

worried-machine-92008

09/10/2021, 7:42 PM

Yeah there's very much an issue that the Batch team can and should look into.

worried-machine-92008

09/10/2021, 7:43 PM

if you create a case and provide me with a case ID i can make sure we get eyes on

cool-father-45885

09/10/2021, 7:45 PM

Sounds good, will update here

plain-baker-2104

10/26/2021, 2:36 PM

@cool-father-45885 Have you installed container insights for innstances? This caused this exact problem for us. https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-ECS-instancelevel.html check the ECS dashboard and see what tasks are running on those instances. you should see that they are only cwagent? shutting down the service will kill the instances too. This was confirmed by aws support and we got a refund.

cool-father-45885

10/26/2021, 4:42 PM

@plain-baker-2104 hey thanks for the message. I discovered this was caused by a daemon service that was installed onto the instances in our ECS cluster by another team without my knowledge. Most likely that container insights would have caught it!

Open in Slack

Previous Next