Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1soyjQ-00FJhR-8C for pgsql-general@arkaria.postgresql.org; Fri, 13 Sep 2024 05:20:33 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1soyjP-00BM7e-62 for pgsql-general@arkaria.postgresql.org; Fri, 13 Sep 2024 05:20:31 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1soyjO-00BM7V-QM for pgsql-general@lists.postgresql.org; Fri, 13 Sep 2024 05:20:30 +0000 Received: from mail-lf1-f54.google.com ([209.85.167.54]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1soyjH-000udV-P1 for pgsql-general@postgresql.org; Fri, 13 Sep 2024 05:20:29 +0000 Received: by mail-lf1-f54.google.com with SMTP id 2adb3069b0e04-5365c512b00so2219641e87.3 for ; Thu, 12 Sep 2024 22:20:23 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726204822; x=1726809622; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=7TWa3ZOz2ycmTty+jmzQRNzIvI+Y1hFXSfCYg7FEPk0=; b=Zg1bxmRALGwAr3XHxiVaRQlCyVTYjExkcTUugmpeN0Ok3u2vhz5/pJDtiKtWcUxfKm BcNah+8/74lzuIaaGuhN1Le+kbIb3UKk7FpG+gWqvcIJZxNs+MpqtQVjJKthknfPqu2h 9B3dU8yH6QNhVX00LMAfHxaNtGXJ4jLcaHaQnNh662LPD2xTPC0lgH5m7BvcyIMhuH2e IWvj0fFNxLVnjKXrqA5silMSdukK7bIItyWcsXXFp3ncht6Knb0bwFjpuZbPG58ASZKl tvVPFxKXrzWBKr5CjdyUV3NYuHpMb9lggCInesuKkZQU4WY2vjmAGZ0GuSS3MLx6HKUQ j8eg== X-Gm-Message-State: AOJu0Yyu9vy64aCXKUaNAEB108sHEtLCNuxq7t0UfqmiuibRtGFHVRRa TMTvPILbpfG9/2p0gtCAelk/ksHuFBvj6ClmZWeb0qBfRwy/4L38j3+6Y7BgzmlQQSHXYiPa6GC MGteG6d1lUwsKy1hPDHLuts9+ri2xma9GVjFGbYBo1HY7DQoltC0= X-Google-Smtp-Source: AGHT+IE3qik+DYWWNKNEANmbOyx+H4EI9Gt0ufbo07a/1gaTFVhr3dA9WYLteWy4wSkdk5K4Z38aFiqkjkAfv7piBe0= X-Received: by 2002:a05:6512:b1e:b0:530:da96:a986 with SMTP id 2adb3069b0e04-53678feb0eemr3388583e87.47.1726204821120; Thu, 12 Sep 2024 22:20:21 -0700 (PDT) MIME-Version: 1.0 From: Willow Chargin Date: Thu, 12 Sep 2024 22:20:10 -0700 Message-ID: Subject: Functionally dependent columns in SELECT DISTINCT To: pgsql-general@postgresql.org Content-Type: text/plain; charset="UTF-8" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hello! Postgres lets us omit columns from a GROUP BY clause if they are functionally dependent on a grouped key, which is a nice quality-of-life feature. I'm wondering if a similar relaxation could be permitted for the SELECT DISTINCT list? I have a query where I want to find the most recent few items from a table that match some complex condition, where the condition involves joining other tables. Here's an example, with two approaches: -- Store some data: an "item" has one or more "parts". CREATE TABLE items(id int PRIMARY KEY, create_time timestamptz); CREATE TABLE parts(part_id int PRIMARY KEY, item_id int); INSERT INTO items(id, create_time) SELECT i, now() - make_interval(secs => i) FROM generate_series(1, 1000000) s(i); INSERT INTO parts(item_id, part_id) SELECT items.id, 2 * items.id + delta FROM items, (VALUES(0), (1)) delta(delta); CREATE INDEX ON items(create_time DESC); CREATE INDEX ON parts(item_id); ANALYZE items, parts; -- Suppose we want to find the most recent few items with a part -- whose part ID is threeven. Two approaches: -- SELECT DISTINCT: fast, but superfluous column: EXPLAIN ANALYZE SELECT DISTINCT items.id, create_time FROM items JOIN parts ON items.id = parts.item_id WHERE part_id % 3 = 0 ORDER BY create_time DESC LIMIT 5; -- 4ms: -- parallel index scan on items_create_time_idx -- -> nested loop index scan parts_item_id_idx -- -> incremental sort -> gather merge -> unique -> limit -- GROUP BY: slow, but functional dependency recognized: EXPLAIN ANALYZE SELECT items.id FROM items JOIN parts ON items.id = parts.item_id WHERE part_id % 3 = 0 GROUP BY items.id ORDER BY create_time DESC LIMIT 5; -- 400ms: -- parallel seq scan on parts -- -> parallel hash join on item_id via seq scan on items -- -> sort -> group -> gather merge -> group -> sort -> limit These timings are Postgres 14.5 on a Linux i7-1165G7. With Postgres 16.3 on an Apple M3 Pro, the shape is the same: the GROUP BY is about 300ms, and the SELECT DISTINCT is way faster still, at 0.07ms. (It declines to parallelize, which seems to help.) I want to use the faster approach, and it works without issue so far. But that extra column in the SELECT list is a bit inconvenient. My questions are: - Do I understand right that these kinds of queries are equivalent? - If so, does the SQL standard permit Postgres to recognize functional dependency in this case, so that users may omit the order column column from the `SELECT DISTINCT` list? (I don't have a copy of the standard to check myself.) - Might future versions of Postgres allow this? thanks! ~Willow