public inbox for [email protected]  
help / color / mirror / Atom feed
Re: Why not do distinct before SetOp
4+ messages / 2 participants
[nested] [flat]

* Re: Why not do distinct before SetOp
@ 2024-11-04 15:18  Tom Lane <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Tom Lane @ 2024-11-04 15:18 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: ma lz <[email protected]>; pgsql-general

David Rowley <[email protected]> writes:
> On Mon, 4 Nov 2024 at 22:52, ma lz <[email protected]> wrote:
>> select distinct a from t1 intersect select distinct a from t1;    — this is faster than origin sql

> No, the planner does not attempt that optimisation. INTERSECT really
> isn't very well optimised.

It's not really obvious to me why adding DISTINCT would make it
faster.  Seems like having two layers of plan nodes checking for
duplicate rows ought to be a loss.  Maybe we need to do some
micro-optimization in or near LookupTupleHashEntry.

A different idea that occurred to me while looking at this is:
why have we got all this machinery to add and check a flag
column, rather than arranging things so that the two input
relations are "outer" and "inner" children of the SetOp?
It's possible some of the performance difference reported here
is due to having to pass more tuples through the SubqueryScan
node (with its projection to add the flag) and Append node,
but we could remove those steps entirely.

> If we did want to improve this area, I think the first thing we'd want
> to do is use standard join types rather than HashSetOp Intersect to
> implement INTERSECT (without ALL).  To do that efficiently, we'd need
> to do a bit more work on the standard join types to have them
> efficiently support IS NOT DISTINCT FROM clauses as the join keys.

Maybe.  It'd be a big project, but we do get complaints every so
often about IS NOT DISTINCT FROM predicates not being efficient,
so the benefits would be wider than just INTERSECT.

			regards, tom lane






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Why not do distinct before SetOp
@ 2024-11-05 22:11  David Rowley <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: David Rowley @ 2024-11-05 22:11 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: ma lz <[email protected]>; pgsql-general

On Tue, 5 Nov 2024 at 04:18, Tom Lane <[email protected]> wrote:
> A different idea that occurred to me while looking at this is:
> why have we got all this machinery to add and check a flag
> column, rather than arranging things so that the two input
> relations are "outer" and "inner" children of the SetOp?

I've no idea why it's not like that. The current design is quite
strange and feels dated. It might be worth making that change as even
if we gave joins better support for IS NOT DISTINCT FROM and made
INTERSECT use INNER JOIN instead and EXCEPT use anti join, we'd still
need nodeSetOp.c for INTERSECT ALL and EXCEPT ALL.

> It's possible some of the performance difference reported here
> is due to having to pass more tuples through the SubqueryScan
> node (with its projection to add the flag) and Append node,
> but we could remove those steps entirely.

Seems plausible.

> > If we did want to improve this area, I think the first thing we'd want
> > to do is use standard join types rather than HashSetOp Intersect to
> > implement INTERSECT (without ALL).  To do that efficiently, we'd need
> > to do a bit more work on the standard join types to have them
> > efficiently support IS NOT DISTINCT FROM clauses as the join keys.
>
> Maybe.  It'd be a big project, but we do get complaints every so
> often about IS NOT DISTINCT FROM predicates not being efficient,
> so the benefits would be wider than just INTERSECT.

Yeah, I agree. I think that's step 1 towards making INTERSECT (without
ALL) and EXCEPT (without ALL) better and it would probably make a few
other people happy who use IS NOT DISTINCT FROM in their join
conditions.

David






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Why not do distinct before SetOp
@ 2024-11-05 23:09  Tom Lane <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Tom Lane @ 2024-11-05 23:09 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: ma lz <[email protected]>; pgsql-general

David Rowley <[email protected]> writes:
> On Tue, 5 Nov 2024 at 04:18, Tom Lane <[email protected]> wrote:
>> A different idea that occurred to me while looking at this is:
>> why have we got all this machinery to add and check a flag
>> column, rather than arranging things so that the two input
>> relations are "outer" and "inner" children of the SetOp?

> I've no idea why it's not like that. The current design is quite
> strange and feels dated. It might be worth making that change as even
> if we gave joins better support for IS NOT DISTINCT FROM and made
> INTERSECT use INNER JOIN instead and EXCEPT use anti join, we'd still
> need nodeSetOp.c for INTERSECT ALL and EXCEPT ALL.

Yeah.  We'd still need it, and besides which it seems like a fairly
small project, unlike the other thing which could take multiple
years to get to an acceptable state.

Of course, I might be overestimating the performance benefit we'd get.
But I'm tempted to give it a try.

			regards, tom lane






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Why not do distinct before SetOp
@ 2024-11-05 23:38  David Rowley <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 0 replies; 4+ messages in thread

From: David Rowley @ 2024-11-05 23:38 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: ma lz <[email protected]>; pgsql-general

On Wed, 6 Nov 2024 at 12:09, Tom Lane <[email protected]> wrote:
> Of course, I might be overestimating the performance benefit we'd get.
> But I'm tempted to give it a try.

I'm glad. I'm curious to see if you're right about the projection
overhead of the flags. If you're right, it seems like a not too
difficult optimisation to get in.

I like the idea as it also gets rid of the flag cruft from
prepunion.c, which saves having to add flags conditionally if
INTERCEPT/EXCEPT were ever made to use joins.

David






^ permalink  raw  reply  [nested|flat] 4+ messages in thread


end of thread, other threads:[~2024-11-05 23:38 UTC | newest]

Thread overview: 4+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-11-04 15:18 Re: Why not do distinct before SetOp Tom Lane <[email protected]>
2024-11-05 22:11 ` David Rowley <[email protected]>
2024-11-05 23:09   ` Tom Lane <[email protected]>
2024-11-05 23:38     ` David Rowley <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox