Hello all I m new to metaflow and I m facing a small challen Outerbounds #dev-metaflow

Hello all, I'm new to metaflow and I'm facing a sm...

cool-action-93848

11/27/2023, 4:34 PM

Hello all, I'm new to metaflow and I'm facing a small challenge: I created a new stack on AWS with CloudFormation and one of the steps in my flow is getting data from a postgres database (RDS). The problem is that the connection is refused because the I don't know how to whitelist the batch job on the postgres inbound security rule. Could anyone point me in the right direction please?

average-beach-28850

11/27/2023, 6:48 PM

when Batch compute env is created it looks like we use VPC default security group there https://github.com/outerbounds/metaflow-tools/blob/46f34293b0f4b6a8b9458e4472d36360e8ca729b/aws/cloudformation/metaflow-cfn-template.yml#L1394 , you'd need to allow that SG on RDS inbound security rule

average-beach-28850

11/27/2023, 6:51 PM

thats assuming that RDS instance in the same VPC. If not, you'd need to either 1. peer those VPCs first 2. alternatively, create a different compute environment in a private subnet, with NAT gateway and allow connections from NAT gateway public IP to RDS instance In that second scenario you need the NAT gateway bit and private subnet since otherwise you won't have a fixed public IP to add to the inbound security group

cool-action-93848

11/28/2023, 8:16 AM

Thank you for the answer! It seems the easiest solution is to create the peering between the two VPCs

cool-action-93848

11/28/2023, 4:53 PM

After some testing I was not able to connect the two VPCs because now any batch job that I would like to run gets stuck on RUNNABLE status. I tried to troubleshoot the batch job with AWSSupport automation and I got the following message: It seems like the container instance doesn't have communication with ECS service endpoint. Container instances need access to communicate with the Amazon ECS service endpoint. This can be through an interface VPC endpoint or through your container instances having public IP addresses. For more information about interface VPC endpoints, see https://docs.aws.amazon.com/AmazonECS/latest/developerguide/vpc-endpoints.html If you do not have an interface VPC endpoint configured and your container instances do not have public IP addresses, then they must use network address translation (NAT) to provide this access. For more information, see NAT gateways in the Amazon VPC User Guide (https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html) and HTTP proxy configuration in this guide https://docs.aws.amazon.com/AmazonECS/latest/developerguide/http_proxy_config.html. For more information, see https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-public-private-vpc.html

cool-action-93848

11/28/2023, 4:55 PM

The funny thing is that everything was working correctly when I first ran the step function (all batch jobs were completed until the one that is failing because of the connection issue with the database)

average-beach-28850

11/28/2023, 5:31 PM

hmm I believe subnets created by the cloudformation template do have public addresses so they shouldn't need anything too special

average-beach-28850

11/28/2023, 5:33 PM

One area I'd look into is: you're supposed to set up routing tables in a particular way after peering VPCs. If thats not set up right, it can lead to many mysterious issues where things can't connect to each other

cool-action-93848

11/28/2023, 5:33 PM

I did set up the VPCs peering as instructed in the aws guide. I also tried to remove it to see if the RUNNABLE issue would be solved but it didn't

average-beach-28850

11/28/2023, 5:36 PM

Did it work at all before?

cool-action-93848

11/28/2023, 5:36 PM

yes! All was working when I executed the firs 6-7 step function executions

cool-action-93848

11/28/2023, 5:37 PM

My flow has a start, a load_models, and then a get_prediction steps. The first two finished correctly all the times

cool-action-93848

11/28/2023, 5:38 PM

Now if I start a new step function execution, the first step is stuck on RUNNABLE and nothing goes forward

average-beach-28850

11/28/2023, 5:44 PM

Ok unfortunately that RUNNABLE is pretty tricky to debug on Batch since there's little visibility into whats going on under the hood. So all is left is looking for clues. I'd look at ec2 instances in that compute env, are they in public subnets created by the cloudformation template? do they have public IP addresses assigned? triple check for leftover routing rules from VPC peering?

cool-action-93848

11/28/2023, 5:44 PM

Thank you for trying to help. I already did all of these things you are suggesting!

cool-action-93848

11/28/2023, 5:46 PM

All the configurations are as intended and I deleted all routing routes from the peering together with the peering itself.

cool-action-93848

11/28/2023, 5:46 PM

the only thing left to do is to recreate the whole stack and see if it happens even without creating a VPC peering

cool-action-93848

11/28/2023, 5:47 PM

if that is the case, then it would seem like the instance resources get filled up somehow and can't be freed for new batch jobs (that is my guess)

average-beach-28850

11/28/2023, 5:55 PM

ah thats a good thing to check for, is there anything running at all? Another quirk of batch is that if you request more resources than instances in the compute env can provide, jobs get stuck in RUNNABLE

average-beach-28850

11/28/2023, 5:55 PM

moreover if you have just 1 task in the queue that requests too much, it blocks the rest of the queue, even if those tasks could've been scheduled otherwise

cool-action-93848

11/28/2023, 5:57 PM

well there is always only 1 job in the queue as I'm running a single flow

cool-action-93848

11/28/2023, 5:58 PM

and the compute environment is the one from the template for cloudformation. it creates 2 instances that are up and running.

average-beach-28850

11/28/2023, 6:00 PM

gotcha.. and just to confirm, they have enough cpu/mem available for that job?

cool-action-93848

11/29/2023, 8:41 AM

yes they have

Open in Slack

Previous Next