Hey all I was investigation some metaflow related performanc Outerbounds #dev-metaflow

Hey all, I was investigation some metaflow related...

quick-lighter-52296

01/16/2023, 11:23 AM

Hey all, I was investigation some metaflow related performance issues in the UI backend, when I enabled performance insights and slow query logging on the RDS instance. I am finding that many queries are defaulting to a full table sequential scan resulting in absurd run times (order of minutes) for our DB size. My investigation is underway but one of the key findings so far is that the migration

20211202100726_add_str_id_indices.sql

added indexes on

run_id

column (which is good) but with the

WHERE run_id IS NOT NULL AND task_name IS NOT NULL

clauses everywhere. This is confusing the postgresql query planner when the query doesn’t explicitly constrain either of

run_id

task_name

to be non-null. Postgresql planner thus assumes them to be nullable (as in their schema), and refuses to use the index (which will only give non-null values). Example query (this is not fabricated but a real query on the database), feel free to verify using

EXPLAIN ANALYZE

Copy code

SELECT * FROM (
                SELECT
                    flow_id,run_number,run_id,step_name,task_id,task_name,name,location,ds_type,sha,type,content_type,user_name,attempt_id,ts_epoch,tags,system_tags
                FROM artifact_v3
                
            ) T
            WHERE flow_id = ? AND run_id = ? AND ("ds_type" = ?)
            
            LIMIT ?

The query planner:

Copy code

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1000.00..400451.09 rows=1 width=495)
   ->  Gather  (cost=1000.00..400451.09 rows=1 width=495)
         Workers Planned: 2
         ->  Parallel Seq Scan on artifact_v3  (cost=0.00..399450.99 rows=1 width=495)
               Filter: (((flow_id)::text = 'Billing'::text) AND ((run_id)::text = 'sfn-da4553c4-3513-44ca-a731-e2e0f65c8db9'::text) AND ((ds_type)::text = 'local'::text))
(5 rows)

Indeed there are a number of such queries to all tables, grinding the DB to slog. Is there any permanent solution in our dev pipeline?

square-wire-39606

01/16/2023, 3:16 PM

Good observation. The

NOT NULL

clauses shouldn't be needed given that the table schema already marks these columns as non nullable. It should be easy to verify the impact of the index if we remove the null constraint. Would you be open to contributing a PR (dropping and rebuilding

tasks_v3_idx_flow_id_run_id_step_name_task_name

runs_v3_idx_str_ids_primary_key

steps_v3_idx_str_ids_primary_key

(this can be dropped entirely actually but that's a conversation for another day),

metadata_v3_idx_str_ids_primary_key

artifact_v3_idx_str_ids_primary_key

quick-lighter-52296

01/17/2023, 5:30 AM

The
NOT NULL
clauses shouldn’t be needed given that the table schema already marks these columns as non nullable

Good point, I didn’t notice that. Then it’s somewhat weird why the query planner would refuse to use the index.

Would you be open to contributing a PR

Yes! I would be happy to raise a PR for this today, do you want me to create a Github issue first, or does a PR directly work?

quick-lighter-52296

01/18/2023, 7:33 AM

@square-wire-39606 I have a few more follow-up questions with the index schema design regarding sub-optimality on large table sizes. Hoping you can shed some light on some of these - 1. Looking at

metadata_v3

table the index

metadata_v3_idex_str_ids_primary_key

seem to also include

id

which ruins the usefulness of the entire index because

id

is effectively unique for this table per row. Also you earlier made the comment that

task_name

and

run_id

are non-nullable in the schema but it’s not the case schema-wise (in practice this may well be the case though I am not the authority here to comment on this), do you still agree to drop the

WHERE

constraints from the index? I think we should, and also re-write the

metadata_v3

index to NOT include

id

quick-lighter-52296

01/18/2023, 7:43 AM

2. In tables such as

artifact_v3

and

metadata_v3

there are 2 columns:

task_name

and

task_id

. I am not sure what’s the difference between the 2, but I’ve seen some queries referencing former whereas some the latter in their constraints. Obviously the index is either on one or the other. What do you think could be the best way here? Originally I thought this is a non-issue as the filter

(flow_id, run_id, step_name)

should already narrow down results to a reasonable number until I ran some queries on our prod metaflow RDS instance only to discover there are many such combinations which have upwards of 20k artifacts per step (step is a for-each step so first it fans out into ~1000 tasks each of which write 30-40 artifacts each).

2 Views

Open in Slack

Previous Next