public inbox for [email protected]
help / color / mirror / Atom feedFrom: Willow Chargin <[email protected]>
To: [email protected]
Subject: Functionally dependent columns in SELECT DISTINCT
Date: Thu, 12 Sep 2024 22:20:10 -0700
Message-ID: <CAALRJs5ne=gPYG=FdeY-G0p9QyjXHhPhAyAsJ_evdCwG3vuhug@mail.gmail.com> (raw)
Hello! Postgres lets us omit columns from a GROUP BY clause if they are
functionally dependent on a grouped key, which is a nice quality-of-life
feature. I'm wondering if a similar relaxation could be permitted for
the SELECT DISTINCT list?
I have a query where I want to find the most recent few items from a
table that match some complex condition, where the condition involves
joining other tables. Here's an example, with two approaches:
-- Store some data: an "item" has one or more "parts".
CREATE TABLE items(id int PRIMARY KEY, create_time timestamptz);
CREATE TABLE parts(part_id int PRIMARY KEY, item_id int);
INSERT INTO items(id, create_time)
SELECT i, now() - make_interval(secs => i)
FROM generate_series(1, 1000000) s(i);
INSERT INTO parts(item_id, part_id)
SELECT items.id, 2 * items.id + delta
FROM items, (VALUES(0), (1)) delta(delta);
CREATE INDEX ON items(create_time DESC);
CREATE INDEX ON parts(item_id);
ANALYZE items, parts;
-- Suppose we want to find the most recent few items with a part
-- whose part ID is threeven. Two approaches:
-- SELECT DISTINCT: fast, but superfluous column:
EXPLAIN ANALYZE
SELECT DISTINCT items.id, create_time
FROM items JOIN parts ON items.id = parts.item_id
WHERE part_id % 3 = 0
ORDER BY create_time DESC
LIMIT 5;
-- 4ms:
-- parallel index scan on items_create_time_idx
-- -> nested loop index scan parts_item_id_idx
-- -> incremental sort -> gather merge -> unique -> limit
-- GROUP BY: slow, but functional dependency recognized:
EXPLAIN ANALYZE
SELECT items.id
FROM items JOIN parts ON items.id = parts.item_id
WHERE part_id % 3 = 0
GROUP BY items.id
ORDER BY create_time DESC
LIMIT 5;
-- 400ms:
-- parallel seq scan on parts
-- -> parallel hash join on item_id via seq scan on items
-- -> sort -> group -> gather merge -> group -> sort -> limit
These timings are Postgres 14.5 on a Linux i7-1165G7. With Postgres 16.3
on an Apple M3 Pro, the shape is the same: the GROUP BY is about 300ms,
and the SELECT DISTINCT is way faster still, at 0.07ms. (It declines to
parallelize, which seems to help.)
I want to use the faster approach, and it works without issue so far.
But that extra column in the SELECT list is a bit inconvenient.
My questions are:
- Do I understand right that these kinds of queries are equivalent?
- If so, does the SQL standard permit Postgres to recognize functional
dependency in this case, so that users may omit the order column
column from the `SELECT DISTINCT` list? (I don't have a copy of the
standard to check myself.)
- Might future versions of Postgres allow this?
thanks!
~Willow
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected]
Subject: Re: Functionally dependent columns in SELECT DISTINCT
In-Reply-To: <CAALRJs5ne=gPYG=FdeY-G0p9QyjXHhPhAyAsJ_evdCwG3vuhug@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox