Hello all, I'm new to metaflow and I'm facing a sm...
# dev-metaflow
c
Hello all, I'm new to metaflow and I'm facing a small challenge: I created a new stack on AWS with CloudFormation and one of the steps in my flow is getting data from a postgres database (RDS). The problem is that the connection is refused because the I don't know how to whitelist the batch job on the postgres inbound security rule. Could anyone point me in the right direction please?
a
when Batch compute env is created it looks like we use VPC default security group there https://github.com/outerbounds/metaflow-tools/blob/46f34293b0f4b6a8b9458e4472d36360e8ca729b/aws/cloudformation/metaflow-cfn-template.yml#L1394 , you'd need to allow that SG on RDS inbound security rule
thats assuming that RDS instance in the same VPC. If not, you'd need to either 1. peer those VPCs first 2. alternatively, create a different compute environment in a private subnet, with NAT gateway and allow connections from NAT gateway public IP to RDS instance In that second scenario you need the NAT gateway bit and private subnet since otherwise you won't have a fixed public IP to add to the inbound security group
c
Thank you for the answer! It seems the easiest solution is to create the peering between the two VPCs
After some testing I was not able to connect the two VPCs because now any batch job that I would like to run gets stuck on RUNNABLE status. I tried to troubleshoot the batch job with AWSSupport automation and I got the following message: It seems like the container instance doesn't have communication with ECS service endpoint. Container instances need access to communicate with the Amazon ECS service endpoint. This can be through an interface VPC endpoint or through your container instances having public IP addresses. For more information about interface VPC endpoints, see https://docs.aws.amazon.com/AmazonECS/latest/developerguide/vpc-endpoints.html If you do not have an interface VPC endpoint configured and your container instances do not have public IP addresses, then they must use network address translation (NAT) to provide this access. For more information, see NAT gateways in the Amazon VPC User Guide (https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html) and HTTP proxy configuration in this guide https://docs.aws.amazon.com/AmazonECS/latest/developerguide/http_proxy_config.html. For more information, see https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-public-private-vpc.html
The funny thing is that everything was working correctly when I first ran the step function (all batch jobs were completed until the one that is failing because of the connection issue with the database)
a
hmm I believe subnets created by the cloudformation template do have public addresses so they shouldn't need anything too special
One area I'd look into is: you're supposed to set up routing tables in a particular way after peering VPCs. If thats not set up right, it can lead to many mysterious issues where things can't connect to each other
c
I did set up the VPCs peering as instructed in the aws guide. I also tried to remove it to see if the RUNNABLE issue would be solved but it didn't
a
Did it work at all before?
c
yes! All was working when I executed the firs 6-7 step function executions
My flow has a start, a load_models, and then a get_prediction steps. The first two finished correctly all the times
Now if I start a new step function execution, the first step is stuck on RUNNABLE and nothing goes forward
a
Ok unfortunately that RUNNABLE is pretty tricky to debug on Batch since there's little visibility into whats going on under the hood. So all is left is looking for clues. I'd look at ec2 instances in that compute env, are they in public subnets created by the cloudformation template? do they have public IP addresses assigned? triple check for leftover routing rules from VPC peering?
c
Thank you for trying to help. I already did all of these things you are suggesting!
All the configurations are as intended and I deleted all routing routes from the peering together with the peering itself.
the only thing left to do is to recreate the whole stack and see if it happens even without creating a VPC peering
if that is the case, then it would seem like the instance resources get filled up somehow and can't be freed for new batch jobs (that is my guess)
a
ah thats a good thing to check for, is there anything running at all? Another quirk of batch is that if you request more resources than instances in the compute env can provide, jobs get stuck in RUNNABLE
moreover if you have just 1 task in the queue that requests too much, it blocks the rest of the queue, even if those tasks could've been scheduled otherwise
c
well there is always only 1 job in the queue as I'm running a single flow
and the compute environment is the one from the template for cloudformation. it creates 2 instances that are up and running.
a
gotcha.. and just to confirm, they have enough cpu/mem available for that job?
c
yes they have