Hi my production AWS ECS cluster was facing issues with a `D Outerbounds #ask-metaflow

Hi my production AWS ECS cluster was facing issues...

curved-island-17262

03/23/2023, 1:21 AM

Hi my production AWS ECS cluster was facing issues with a

DockerTimeoutError

yesterday, our AWS EBS Burst Balance reached 0%(image 1). Based on this thread and a couple others I did the following: 1. Set METAFLOW_DEFAULT_CONTAINER_REGISTRY to

public.ecr.aws/docker/library/

2. Move from GP2 to GP3 with the following custom AWS Launch Template(AWS EBS):

Copy code

ebs {
      volume_size           = 100
      delete_on_termination = true
      encrypted             = true
      volume_type           = "gp3"
      iops                  = 3000
      throughput            = 125
    }

But I am still noticing large StorageWriteBytes that are probably going to cause more issues from a single pipeline(image 2). Here is a bit of the code for the pipeline:

Copy code

class Flow(FlowSpec):
    @step
    def start(self):
        self.next(self.process_video, foreach="chunks")

    @resources(memory=8_000)
    @pip(
        libraries={
            "mux-python": "3.7.1",
            "opencv-python-headless": "4.6.0.66",
            "openai": "0.27.0",
            "scikit-learn": "1.2.1",
        }
    )
    @step
    def process_video(self):
        self.next(self.join)

    @step
    def join(self, inputs):
        self.next(self.extract_keywords)

    @resources(memory=32_000, cpu=4)
    @pip(
        libraries={
            "mux-python": "3.7.1",
            "opencv-python-headless": "4.6.0.66",
            "wheel": "0.38.4",
            "setuptools": "65.6.3",
            "spacy": "3.4.3",
            "nltk": "3.7",
            "keybert": "0.7.0",
            "pytextrank": "3.2.4",
            "rake_nltk": "1.0.6",
            "yake": "0.4.8",
        }
    )
    @download_nlp_libraries() 
    @step
    def extract_keywords(self):
        self.next(self.end)

I noticed the owner of the Flow using

@pip

more than

@conda

even extending it to create a

@download_nlp_libraries

which basically installs Spacy and NLTK extentions such as

"stopwords", "punkt", "averaged_perceptron_tagger", "wordnet", "omw-1.4"

. The pipeline does a lot of processing on videos and text data leveraging 3rd party APIs. I am struggling to understand what could be writing so much to storage? Is it the usage of the NLP libraries via

@pip

especially since its done within a

foreach

? Could creating a docker image and using it be a solution to this?

✅ 1

3 Views

Open in Slack

Previous Next