The human-centric platform for production ML & AI

Outerbounds

We are self-hosting metaflow deployment on EC2 based AWS Batch cluster, but it is well known that EC2 machines have slightly less memory available for containers to reserve due to some memory being reserved for the OS by the container agent. For example, a 4GB ec2 instance may only have 3800MB available for a container to use. This means that the ECS orchestrator will not place a step with `@resources(memory=4096)` into such a instance, and in case of batch it will just request a larger instance size (say with 8GB memory). This leads to high degree of underutilisation of instances so manually specify memory sizes such as 3800 instead of 4096 to optimize placement. I am curious how others deal with this problem, especially because I do not want my data scientists to be thinking about these things when writing down resource requirements, and also because we are now looking to use a hybrid of Fargate &amp; EC2 compute environments where fargate does not support memory requirements like 3800 (it needs a whole number like 4096) so some edits need to be made to resource requirements before the flow can be run on either. Ideally I would want them to be seamlessly interchangeable.