public inbox for [email protected]  
help / color / mirror / Atom feed
Re: Convert NOT IN sublinks to anti-joins when safe
4+ messages / 3 participants
[nested] [flat]

* Re: Convert NOT IN sublinks to anti-joins when safe
@ 2026-02-04 14:59  David Geier <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: David Geier @ 2026-02-04 14:59 UTC (permalink / raw)
  To: Richard Guo <[email protected]>; Pg Hackers <[email protected]>

Hi!

On 04.02.2026 10:47, Richard Guo wrote:
> On Tue, Feb 3, 2026 at 4:12 PM Richard Guo <[email protected]> wrote:
>> This topic has been discussed several times in the past.  Due to the
>> semantic mismatch regarding NULL handling, NOT IN is not ordinarily
>> equivalent to an anti-join.  However, if we can prove that neither the
>> outer expressions nor the subquery outputs can yield NULL values, it
>> should be safe to convert NOT IN to an anti-join.

I used to play around with query rewrites of the same form, turning NOT
IN into NOT EXISTS. As you rightfully pointed out, the straight forward
rewrite only works if neither the outer expression nor the sub-query
cannot yield NULLs, limiting the optimization considerably. Both cases
can be fixed by amending the basic form of the rewrite.

Basic form:

SELECT t1.c1 FROM t1 WHERE t1.c1 NOT IN (SELECT t2.c1 FROM t2)

=>

SELECT t1.c1 FROM t1 WHERE
  NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1)

If the sub-select can yield NULLs, the rewrite can be fixed by adding an
OR t2.c1 IS NULL clause, such as:

SELECT t1.c1 FROM t1 WHERE
  NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1 OR t2.c1 IS NULL)

which is equivalent to the following SQL which avoids the OR:

SELECT t1.c1 FROM t1 WHERE
  NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1) AND
  NOT EXISTS (SELECT 1 FROM t2 WHERE t2.c1 IS NULL)

If the outer expression can yield NULLs, the rewrite can be fixed by
adding a t1.c1 IS NOT NULL clause, such as:

SELECT t1.c1 FROM T1 WHERE
  t1.c1 IS NOT NULL AND
  NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1)

Both fixes can of course be combined yielding:

SELECT t1.c1 FROM T1 WHERE
  t1.c1 IS NOT NULL AND
  NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1) AND
  NOT EXISTS (SELECT 1 FROM t2 WHERE t2.c1 IS NULL)

What's our today's take on doing more involved transformations inside
the planner to support such cases? It would greatly open up the scope of
the optimization.

> I've noticed a loose end in the v1 patch.
> 
> The semantic gap between NOT IN and anti-join actually exists whenever
> the operator returns NULL.  For NOT IN, if (A op B) returns NULL, then
> NOT (NULL) evaluates to NULL (effectively false), and the row is
> discarded.  In contrast, for an anti-join, if (A op B) returns NULL,
> it implies no match was found, and the anti-join logic dictates that
> the row should be kept.
> 
> To guarantee that (A op B) never returns NULL, the current patch
> verifies that both A and B are non-nullable.  However, this is not
> sufficient.  The "op" might be an operator that returns NULL on
> non-null inputs.
> 
> On the other hand, if "op" does not return NULL on NULL inputs, like
> IS DISTINCT FROM, we technically would not even need to require that A
> and B are non-nullable.
> 
> Is there a convenient way to verify that an operator never returns
> NULL on non-null inputs?  Would it be sufficient to insist that the
> operator belongs to btree opclass (assuming that the strict ordering
> requirements of btree imply this safety)?

There's lots of code that e.g. calls FunctionCall2[Coll]() on operators,
where that function raises an ERROR in case the returned value is NULL.
Hence, I would say it's safe to make that assumption.

Otherwise, you could restrict the check to only builtin operators. For
those we know they have the expected semantics.

--
David Geier






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Convert NOT IN sublinks to anti-joins when safe
@ 2026-02-05 06:09  Richard Guo <[email protected]>
  parent: David Geier <[email protected]>
  0 siblings, 2 replies; 4+ messages in thread

From: Richard Guo @ 2026-02-05 06:09 UTC (permalink / raw)
  To: David Geier <[email protected]>; +Cc: Pg Hackers <[email protected]>

On Wed, Feb 4, 2026 at 11:59 PM David Geier <[email protected]> wrote:
> If the sub-select can yield NULLs, the rewrite can be fixed by adding an
> OR t2.c1 IS NULL clause, such as:
>
> SELECT t1.c1 FROM t1 WHERE
>   NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1 OR t2.c1 IS NULL)

I'm not sure if this rewrite results in a better plan.  The OR clause
would force a nested loop join, which could be much slower than a
hashed-subplan plan.

> If the outer expression can yield NULLs, the rewrite can be fixed by
> adding a t1.c1 IS NOT NULL clause, such as:
>
> SELECT t1.c1 FROM T1 WHERE
>   t1.c1 IS NOT NULL AND
>   NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1)

This rewrite doesn't seem correct to me.  If t2 is empty, you would
incorrectly lose the NULL rows from t1 in the final result.

> What's our today's take on doing more involved transformations inside
> the planner to support such cases? It would greatly open up the scope of
> the optimization.

As mentioned in my initial email, the goal of this patch is not to
handle every possible case, but rather only to handle the basic form
where both sides of NOT IN are provably non-nullable.  This keeps the
code complexity to a minimum, and I believe this would cover the most
common use cases in real world.

- Richard






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Convert NOT IN sublinks to anti-joins when safe
@ 2026-02-05 07:29  wenhui qiu <[email protected]>
  parent: Richard Guo <[email protected]>
  1 sibling, 0 replies; 4+ messages in thread

From: wenhui qiu @ 2026-02-05 07:29 UTC (permalink / raw)
  To: Richard Guo <[email protected]>; +Cc: David Geier <[email protected]>; Pg Hackers <[email protected]>

HI Richard
> As mentioned in my initial email, the goal of this patch is not to
> handle every possible case, but rather only to handle the basic form
> where both sides of NOT IN are provably non-nullable.  This keeps the
> code complexity to a minimum, and I believe this would cover the most
> common use cases in real world.
Agree +1  ,The current path already covers common scenarios and is no less
comprehensive than other databases.I'm already quite pleased that it can be
merged.
Having tested a certain widely used open-source database, I found it unable
to process the following query: `SELECT * FROM join1 WHERE id NOT IN
(SELECT id FROM join2 WHERE id IS NOT NULL);` Note that join2 allows null
values for id.


Thanks


On Thu, Feb 5, 2026 at 2:09 PM Richard Guo <[email protected]> wrote:

> On Wed, Feb 4, 2026 at 11:59 PM David Geier <[email protected]> wrote:
> > If the sub-select can yield NULLs, the rewrite can be fixed by adding an
> > OR t2.c1 IS NULL clause, such as:
> >
> > SELECT t1.c1 FROM t1 WHERE
> >   NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1 OR t2.c1 IS NULL)
>
> I'm not sure if this rewrite results in a better plan.  The OR clause
> would force a nested loop join, which could be much slower than a
> hashed-subplan plan.
>
> > If the outer expression can yield NULLs, the rewrite can be fixed by
> > adding a t1.c1 IS NOT NULL clause, such as:
> >
> > SELECT t1.c1 FROM T1 WHERE
> >   t1.c1 IS NOT NULL AND
> >   NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1)
>
> This rewrite doesn't seem correct to me.  If t2 is empty, you would
> incorrectly lose the NULL rows from t1 in the final result.
>
> > What's our today's take on doing more involved transformations inside
> > the planner to support such cases? It would greatly open up the scope of
> > the optimization.
>
> As mentioned in my initial email, the goal of this patch is not to
> handle every possible case, but rather only to handle the basic form
> where both sides of NOT IN are provably non-nullable.  This keeps the
> code complexity to a minimum, and I believe this would cover the most
> common use cases in real world.
>
> - Richard
>
>
>


^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Convert NOT IN sublinks to anti-joins when safe
@ 2026-03-04 09:33  Richard Guo <[email protected]>
  parent: Richard Guo <[email protected]>
  1 sibling, 0 replies; 4+ messages in thread

From: Richard Guo @ 2026-03-04 09:33 UTC (permalink / raw)
  To: David Geier <[email protected]>; +Cc: Pg Hackers <[email protected]>

On Mon, Mar 2, 2026 at 9:50 PM David Geier <[email protected]> wrote:
> The very last rewrite combines both cases. The rewritten query then
> looks like:
>
> SELECT t1.c1 FROM T1 WHERE
>   t1.c1 IS NOT NULL AND
>   NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1) AND
>   NOT EXISTS (SELECT 1 FROM t2 WHERE t2.c1 IS NULL)

I'm still not convinced this rewrite is correct.  As I mentioned
earlier, it breaks down if t2 is empty while t1 contains NULL rows.
For example:

CREATE TABLE t1 (c1 int);
CREATE TABLE t2 (c1 int);
INSERT INTO t1 VALUES (1), (NULL);

SELECT t1.c1 FROM t1 WHERE t1.c1 NOT IN (SELECT t2.c1 FROM t2);
 c1
----
  1

(2 rows)

SELECT t1.c1 FROM T1 WHERE
  t1.c1 IS NOT NULL AND
  NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1) AND
  NOT EXISTS (SELECT 1 FROM t2 WHERE t2.c1 IS NULL);
 c1
----
  1
(1 row)

> Seems reasonable to start with the non-NULL variant, though there are
> certainly cases where there's no PK / unique index on the relevant columns.

Yeah.  I don't know how to optimize nullable NOT IN clauses.  It seems
quite difficult to handle safely purely via query transformations.
Maybe we can explore adding a dedicated Null-Aware Anti-Join execution
node, much like Oracle's approach.  But that is definitely beyond the
scope of this current patch.

- Richard





^ permalink  raw  reply  [nested|flat] 4+ messages in thread


end of thread, other threads:[~2026-03-04 09:33 UTC | newest]

Thread overview: 4+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-02-04 14:59 Re: Convert NOT IN sublinks to anti-joins when safe David Geier <[email protected]>
2026-02-05 06:09 ` Richard Guo <[email protected]>
2026-02-05 07:29   ` wenhui qiu <[email protected]>
2026-03-04 09:33   ` Richard Guo <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox