Hi my production AWS ECS cluster was facing issues...
# ask-metaflow
c
Hi my production AWS ECS cluster was facing issues with a
DockerTimeoutError
yesterday, our AWS EBS Burst Balance reached 0%(image 1). Based on this thread and a couple others I did the following: 1. Set METAFLOW_DEFAULT_CONTAINER_REGISTRY to
public.ecr.aws/docker/library/
2. Move from GP2 to GP3 with the following custom AWS Launch Template(AWS EBS):
Copy code
ebs {
      volume_size           = 100
      delete_on_termination = true
      encrypted             = true
      volume_type           = "gp3"
      iops                  = 3000
      throughput            = 125
    }
But I am still noticing large StorageWriteBytes that are probably going to cause more issues from a single pipeline(image 2). Here is a bit of the code for the pipeline:
Copy code
class Flow(FlowSpec):
    @step
    def start(self):
        self.next(self.process_video, foreach="chunks")

    @resources(memory=8_000)
    @pip(
        libraries={
            "mux-python": "3.7.1",
            "opencv-python-headless": "4.6.0.66",
            "openai": "0.27.0",
            "scikit-learn": "1.2.1",
        }
    )
    @step
    def process_video(self):
        self.next(self.join)

    @step
    def join(self, inputs):
        self.next(self.extract_keywords)

    @resources(memory=32_000, cpu=4)
    @pip(
        libraries={
            "mux-python": "3.7.1",
            "opencv-python-headless": "4.6.0.66",
            "wheel": "0.38.4",
            "setuptools": "65.6.3",
            "spacy": "3.4.3",
            "nltk": "3.7",
            "keybert": "0.7.0",
            "pytextrank": "3.2.4",
            "rake_nltk": "1.0.6",
            "yake": "0.4.8",
        }
    )
    @download_nlp_libraries() 
    @step
    def extract_keywords(self):
        self.next(self.end)
I noticed the owner of the Flow using
@pip
more than
@conda
even extending it to create a
@download_nlp_libraries
which basically installs Spacy and NLTK extentions such as
"stopwords", "punkt", "averaged_perceptron_tagger", "wordnet", "omw-1.4"
. The pipeline does a lot of processing on videos and text data leveraging 3rd party APIs. I am struggling to understand what could be writing so much to storage? Is it the usage of the NLP libraries via
@pip
especially since its done within a
foreach
? Could creating a docker image and using it be a solution to this?
1