Hi metaflow team I wanted to reach out to better understand Outerbounds #ask-metaflow

Hi metaflow team, I wanted to reach out to better ...

bulky-portugal-95315

10/08/2024, 6:14 PM

Hi metaflow team, I wanted to reach out to better understand a discrepancy we’re seeing during model training using metaflow between an EC2 and using EC2 on batch. Using the same CPU/memory on a

r6i.32xlarge

, and running the same code, one running batch and one running on local EC2, we’re seeing that the runtime of the model training (xgboost) is significantly different and wanted to see if the outerbounds team had any information as to why we might be seeing this. These are the training time results when running the training flow on the EC2 without batch:

Copy code

n_jobs = 64

finish loading from csv
loading data end, start to boost trees
finish training, took 0:01:04.146891
finish training, took 0:01:02.424279
finish training, took 0:01:01.467109
finish training, took 0:00:58.373016
finish training, took 0:00:59.026629

=================================

n_jobs = 128

finish loading from csv
loading data end, start to boost trees
finish training, took 0:00:58.847973
finish training, took 0:00:56.754532
finish training, took 0:01:04.924065
finish training, took 0:01:02.561990
finish training, took 0:00:59.712103

These are the training time results when running on batch:

Copy code

n_jobs = 64

finish loading from csv
loading data end, start to boost trees
finish training, took 0:01:27.271721
finish training, took 0:01:24.243834
finish training, took 0:01:24.683689
finish training, took 0:01:24.754809
finish training, took 0:01:27.739151

=================================

n_jobs = 128

finish loading from csv
loading data end, start to boost trees
finish training, took 0:01:15.287833
finish training, took 0:01:15.704936
finish training, took 0:01:16.092436
finish training, took 0:01:16.177293
finish training, took 0:01:17.530019

we see roughly a 20-30 minute increase in training time when using batch vs when running locally. Is this typical and expected? Thanks! p.s.: we ran this using a simple flow using open source data, so we can share the flow and data CSV if needed to reproduce the results

ancient-application-36103

10/08/2024, 6:21 PM

@bulky-portugal-95315 what are the resources that were allocated to the metaflow job? there are multiple reasons for any slowness or speedup. when running locally on EC2, was it a metaflow flow that was being executed?

bulky-portugal-95315

10/08/2024, 6:23 PM

correct, both were the same metaflow flow, one

--with batch

and one without that arg. the one running without batch was on a

32xlarge

with 128 vcpu/1024GB memory. batch resource request was 128/800000

ancient-application-36103

10/08/2024, 6:27 PM

there is a difference in memory allocated for the @batch job. while aws batch allows for bursting resources, it is dependent on other workloads running on the same instance. also - were you able to ensure that you have the same dependencies installed for your workloads on ec2 and @batch

hundreds-rainbow-67050

10/08/2024, 6:53 PM

you might also want to check that the shared memory settings are the same on EC2 and Batch (IIRC there's a big difference with the defaults)

bulky-portugal-95315

10/08/2024, 7:59 PM

yeah i wouldn’t expect memory allocation to drastically affect model training time here though, which is more CPU heavy. the same dependencies are installed for both on a new EC2 instance (clean slate)

bulky-portugal-95315

10/08/2024, 7:59 PM

@hundreds-rainbow-67050 oh that’s interesting, was not aware there was a difference in shared memory settings between the two, but in this case, since we’re only running one process, does the shared memory have a big effect on workloads?

3 Views

Open in Slack

Previous Next