The human-centric platform for production ML & AI

Outerbounds

Hello! I am having an issue where it seems to have memory leakage when running the Flow remotely on AWS Batch. More especifically when storing data as artifact, such as `self.df = df`, it seems to spike the memory usage!

I have already increased the `@resource` to use 30Gb memory and it is still failing with `OutOfMemory` error.

Funny enough, I can run the flow locally and it almost bricks my computer, but it does complete fine. The laptop only has 16Gb ram and had other programs running, such as Chrome, Teams, Outlook etc.

Note: the <http://dataframe.info|dataframe.info>() shows usage of about 3Gb, for ~7 million rows with 100+ columns.

It feels like storing artifact is causing duplication of memory instead of just pointing to the variable reference.

Am I missing something? Is there a best practice/pattern to follow?

My flow is defined as the following steps:
1. *Start*: pandas.read_sql with a simple `select * from …` Snowflake
2. *Transform*: applies additional feature engineering, such as one hot encoding (get_dummies) -&gt; it runs fine, running out of memory on the last part where I store it as class attribute to be used on subsequent steps
3. *Train Models* in parallel
4. *Select best model* and store it as artifact