Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1t7yqR-001L0G-AF for pgsql-general@arkaria.postgresql.org; Mon, 04 Nov 2024 15:18:18 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1t7yqO-002Mm5-OZ for pgsql-general@arkaria.postgresql.org; Mon, 04 Nov 2024 15:18:17 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1t7yqO-002Mlw-Dg for pgsql-general@lists.postgresql.org; Mon, 04 Nov 2024 15:18:16 +0000 Received: from sss.pgh.pa.us ([68.162.161.243]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1t7yqH-0009qc-F5 for pgsql-general@postgresql.org; Mon, 04 Nov 2024 15:18:16 +0000 Received: from sss1.sss.pgh.pa.us (localhost [127.0.0.1]) by sss.pgh.pa.us (8.15.2/8.15.2) with ESMTP id 4A4FI4ag2631314; Mon, 4 Nov 2024 10:18:04 -0500 From: Tom Lane To: David Rowley cc: ma lz , "pgsql-general@postgresql.org" Subject: Re: Why not do distinct before SetOp In-reply-to: References: Comments: In-reply-to David Rowley message dated "Tue, 05 Nov 2024 00:09:35 +1300" MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-ID: <2631312.1730733484.1@sss.pgh.pa.us> Content-Transfer-Encoding: quoted-printable Date: Mon, 04 Nov 2024 10:18:04 -0500 Message-ID: <2631313.1730733484@sss.pgh.pa.us> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk David Rowley writes: > On Mon, 4 Nov 2024 at 22:52, ma lz wrote: >> select distinct a from t1 intersect select distinct a from t1; =E2=80= =94 this is faster than origin sql > No, the planner does not attempt that optimisation. INTERSECT really > isn't very well optimised. It's not really obvious to me why adding DISTINCT would make it faster. Seems like having two layers of plan nodes checking for duplicate rows ought to be a loss. Maybe we need to do some micro-optimization in or near LookupTupleHashEntry. A different idea that occurred to me while looking at this is: why have we got all this machinery to add and check a flag column, rather than arranging things so that the two input relations are "outer" and "inner" children of the SetOp? It's possible some of the performance difference reported here is due to having to pass more tuples through the SubqueryScan node (with its projection to add the flag) and Append node, but we could remove those steps entirely. > If we did want to improve this area, I think the first thing we'd want > to do is use standard join types rather than HashSetOp Intersect to > implement INTERSECT (without ALL). To do that efficiently, we'd need > to do a bit more work on the standard join types to have them > efficiently support IS NOT DISTINCT FROM clauses as the join keys. Maybe. It'd be a big project, but we do get complaints every so often about IS NOT DISTINCT FROM predicates not being efficient, so the benefits would be wider than just INTERSECT. regards, tom lane