Hey all, I was investigation some metaflow related...
# dev-metaflow
q
Hey all, I was investigation some metaflow related performance issues in the UI backend, when I enabled performance insights and slow query logging on the RDS instance. I am finding that many queries are defaulting to a full table sequential scan resulting in absurd run times (order of minutes) for our DB size. My investigation is underway but one of the key findings so far is that the migration
20211202100726_add_str_id_indices.sql
added indexes on
run_id
column (which is good) but with the
WHERE run_id IS NOT NULL AND task_name IS NOT NULL
clauses everywhere. This is confusing the postgresql query planner when the query doesn’t explicitly constrain either of
run_id
or
task_name
to be non-null. Postgresql planner thus assumes them to be nullable (as in their schema), and refuses to use the index (which will only give non-null values). Example query (this is not fabricated but a real query on the database), feel free to verify using
EXPLAIN ANALYZE
:
Copy code
SELECT * FROM (
                SELECT
                    flow_id,run_number,run_id,step_name,task_id,task_name,name,location,ds_type,sha,type,content_type,user_name,attempt_id,ts_epoch,tags,system_tags
                FROM artifact_v3
                
            ) T
            WHERE flow_id = ? AND run_id = ? AND ("ds_type" = ?)
            
            LIMIT ?
The query planner:
Copy code
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1000.00..400451.09 rows=1 width=495)
   ->  Gather  (cost=1000.00..400451.09 rows=1 width=495)
         Workers Planned: 2
         ->  Parallel Seq Scan on artifact_v3  (cost=0.00..399450.99 rows=1 width=495)
               Filter: (((flow_id)::text = 'Billing'::text) AND ((run_id)::text = 'sfn-da4553c4-3513-44ca-a731-e2e0f65c8db9'::text) AND ((ds_type)::text = 'local'::text))
(5 rows)
Indeed there are a number of such queries to all tables, grinding the DB to slog. Is there any permanent solution in our dev pipeline?
s
Good observation. The
NOT NULL
clauses shouldn't be needed given that the table schema already marks these columns as non nullable. It should be easy to verify the impact of the index if we remove the null constraint. Would you be open to contributing a PR (dropping and rebuilding
tasks_v3_idx_flow_id_run_id_step_name_task_name
,
runs_v3_idx_str_ids_primary_key
,
steps_v3_idx_str_ids_primary_key
(this can be dropped entirely actually but that's a conversation for another day),
metadata_v3_idx_str_ids_primary_key
,
artifact_v3_idx_str_ids_primary_key
?
q
The
NOT NULL
clauses shouldn’t be needed given that the table schema already marks these columns as non nullable
Good point, I didn’t notice that. Then it’s somewhat weird why the query planner would refuse to use the index.
Would you be open to contributing a PR
Yes! I would be happy to raise a PR for this today, do you want me to create a Github issue first, or does a PR directly work?
@square-wire-39606 I have a few more follow-up questions with the index schema design regarding sub-optimality on large table sizes. Hoping you can shed some light on some of these - 1. Looking at
metadata_v3
table the index
metadata_v3_idex_str_ids_primary_key
seem to also include
id
which ruins the usefulness of the entire index because
id
is effectively unique for this table per row. Also you earlier made the comment that
task_name
and
run_id
are non-nullable in the schema but it’s not the case schema-wise (in practice this may well be the case though I am not the authority here to comment on this), do you still agree to drop the
WHERE
constraints from the index? I think we should, and also re-write the
metadata_v3
index to NOT include
id
2. In tables such as
artifact_v3
and
metadata_v3
there are 2 columns:
task_name
and
task_id
. I am not sure what’s the difference between the 2, but I’ve seen some queries referencing former whereas some the latter in their constraints. Obviously the index is either on one or the other. What do you think could be the best way here? Originally I thought this is a non-issue as the filter
(flow_id, run_id, step_name)
should already narrow down results to a reasonable number until I ran some queries on our prod metaflow RDS instance only to discover there are many such combinations which have upwards of 20k artifacts per step (step is a for-each step so first it fans out into ~1000 tasks each of which write 30-40 artifacts each).