Hi Metaflow, we are noticing that ec2 compute inst...
# dev-metaflow
c
Hi Metaflow, we are noticing that ec2 compute instances are left running for long periods of time (these are instances orchestrated by AWS Batch) even when there are no tasks running on the Batch Compute Environment. Has anyone else encountered this issue?
1
a
That's unfortunately the behavior of AWS Batch Compute Environment where it keeps the instance for a while even when the job queue is empty. What are your settings for minVCPUs and desiredVCPUs?
c
minVCPU ->8 desired is currently 1440 despite only a single job running having Metaflow Batch requests of 16vcpu and 10gb memory maxVCPU -> 1500
We have many EC2 instances that were launched many weeks ago left running
a
How AWS Batch computes desired vCPUs is a bit opaque. Let me see if @worried-machine-92008 or @purple-engineer-56290 can shed some light on this.
c
Sounds good, thanks
Sounds like I might need to open a ticket with AWS for this one
w
@square-wire-39606 "for a while" should be a max of say, 4 minutes
👍 1
with most nodes scaling down in less than 2
Batch computes desired vCPU by the amount of jobs in RUNNING + RUNNABLE (not all jobs in Runnable, but several thousands)
we do that in 2 minute cycles, roughly
@cool-father-45885 have you reached out to AWS support?
I'll be more clear in saying that i don't know of any cause right now that would lead batch to lead nodes alive like that. Batch is very aggressive in terminating unused nodes. That doesn't mean it can't happen, just that there's no cause that we've encountered
c
@worried-machine-92008 thanks for the reply. I have not reached out to AWS support yet. I wanted to see what the Metaflow team thought first. I suspect this is a Batch/ECS issue however as I am seeing instances left alive for weeks at a time even when job queue for the compute environment in question is empty (no running and no runnable)
w
Yeah there's very much an issue that the Batch team can and should look into.
if you create a case and provide me with a case ID i can make sure we get eyes on
c
Sounds good, will update here
p
@cool-father-45885 Have you installed container insights for innstances? This caused this exact problem for us. https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-ECS-instancelevel.html check the ECS dashboard and see what tasks are running on those instances. you should see that they are only cwagent? shutting down the service will kill the instances too. This was confirmed by aws support and we got a refund.
c
@plain-baker-2104 hey thanks for the message. I discovered this was caused by a daemon service that was installed onto the instances in our ECS cluster by another team without my knowledge. Most likely that container insights would have caught it!