user
05/31/2023, 3:23 AMdb.m5.2xlarge
instance. The following query on a 200 million row artifact_v3
takes up to four minutes to run:
SELECT * FROM ( SELECT flow_id,run_number,run_id,step_name,task_id,task_name,name,location,ds_type,sha,type,content_type,user_name,attempt_id,ts_epoch,tags,system_tags FROM artifact_v3 ) T WHERE flow_id = '[REDACTED]' AND run_id = '[REDACTED]' AND ("ds_type" = 'local') LIMIT 1;
While acknowledging that searching 200 million rows would be potentially slow, we’ve noticed via EXPLAIN
and EXPLAIN ANALYZE
that the query does not actually make use of the indexes defined on artifacts_v3
. There is an index:
"artifact_v3_idx_str_ids_primary_key" btree (flow_id, run_id, step_name, task_name, attempt_id, name) WHERE run_id IS NOT NULL AND task_name IS NOT NULL
but for PostgreSQL to use the index we need to add AND task_name IS NOT NULL
to the query.
In general, would we ever expect either run_id
or task_name
to be null?