bulky-portugal-95315
10/08/2024, 6:14 PMr6i.32xlarge
, and running the same code, one running batch and one running on local EC2, we’re seeing that the runtime of the model training (xgboost) is significantly different and wanted to see if the outerbounds team had any information as to why we might be seeing this.
These are the training time results when running the training flow on the EC2 without batch:
n_jobs = 64
finish loading from csv
loading data end, start to boost trees
finish training, took 0:01:04.146891
finish training, took 0:01:02.424279
finish training, took 0:01:01.467109
finish training, took 0:00:58.373016
finish training, took 0:00:59.026629
=================================
n_jobs = 128
finish loading from csv
loading data end, start to boost trees
finish training, took 0:00:58.847973
finish training, took 0:00:56.754532
finish training, took 0:01:04.924065
finish training, took 0:01:02.561990
finish training, took 0:00:59.712103
These are the training time results when running on batch:
n_jobs = 64
finish loading from csv
loading data end, start to boost trees
finish training, took 0:01:27.271721
finish training, took 0:01:24.243834
finish training, took 0:01:24.683689
finish training, took 0:01:24.754809
finish training, took 0:01:27.739151
=================================
n_jobs = 128
finish loading from csv
loading data end, start to boost trees
finish training, took 0:01:15.287833
finish training, took 0:01:15.704936
finish training, took 0:01:16.092436
finish training, took 0:01:16.177293
finish training, took 0:01:17.530019
we see roughly a 20-30 minute increase in training time when using batch vs when running locally. Is this typical and expected? Thanks!
p.s.: we ran this using a simple flow using open source data, so we can share the flow and data CSV if needed to reproduce the resultsancient-application-36103
10/08/2024, 6:21 PMbulky-portugal-95315
10/08/2024, 6:23 PM--with batch
and one without that arg.
the one running without batch was on a 32xlarge
with 128 vcpu/1024GB memory. batch resource request was 128/800000ancient-application-36103
10/08/2024, 6:27 PMhundreds-rainbow-67050
10/08/2024, 6:53 PMbulky-portugal-95315
10/08/2024, 7:59 PMbulky-portugal-95315
10/08/2024, 7:59 PM