Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tOAKN-004XTx-6U for pgsql-general@arkaria.postgresql.org; Thu, 19 Dec 2024 06:48:07 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1tOAKM-008ONH-F7 for pgsql-general@arkaria.postgresql.org; Thu, 19 Dec 2024 06:48:06 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tOAKL-008ON8-UL for pgsql-general@lists.postgresql.org; Thu, 19 Dec 2024 06:48:05 +0000 Received: from mail-ed1-x529.google.com ([2a00:1450:4864:20::529]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1tOAKH-000XZj-U7 for pgsql-general@postgresql.org; Thu, 19 Dec 2024 06:48:05 +0000 Received: by mail-ed1-x529.google.com with SMTP id 4fb4d7f45d1cf-5d3f28881d6so553718a12.1 for ; Wed, 18 Dec 2024 22:48:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=adriangb-com.20230601.gappssmtp.com; s=20230601; t=1734590881; x=1735195681; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=JkYO1BVRbfDNeh6/XXtAE0RAeA50K6MOAzYlprgmRJQ=; b=hn2Go3rVGkN3XNlzg2aLU8HSPcLRYvaHkQoXgKoBI1ABfo4XScJNLP3MYvHyQDKkPd YNVABtzQwFZWFYAE5V/jy1sugOyqRQx466Av+bR3UtHCr15ES8zq8dKsm4XpCoewNWeO 4+EAXULmNzLiGHBOR1tahXE9pYa8rspOmYFhCOy0YsHj4SvMg46NDwTsI9tOK3iUSKkF lp/b3wuY68vJkQA57wnzDI5LHx7i4Egd9uygTX0iI+2/3OBJT3kKQKcT6WTak71zCUKr y9uV54qt1VyFzsXQcYnbm60YhAZQLvjCLmK1Q3r5T+q25y51h2SOpqXFaT0ESkwKe4Wd id6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734590881; x=1735195681; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=JkYO1BVRbfDNeh6/XXtAE0RAeA50K6MOAzYlprgmRJQ=; b=O/EJ/u8ZoGjpe2LguJb3/IaE/2k7OihKnIMG9hWmhKEk6uSwOrOu0inaG08dc6UakY kWGj2WRJ/SebXSs64vKbJR5EWRyrlDM1RraWOMj+qTeL4elJzinTw+aHPHeuF9ukp2NP WWUtvjm/gbIhP63aRPpLuwgnucphJwiYeqxChrPeiLemNnFucD1KVR4WAEzShOayAdFy LYV5K4YMrilimW42Ypno1PaKNuJiM3+AfqqKteWi85Amy0h1JkuyzvycoUI+ABwcG5fO IFdSVQwrwAUkJwKFiSIlHI76RM1gMqvUy6VmLbz6Yaj1LF+lMDxemWdZg8AbgFHQP2Da Xqkg== X-Forwarded-Encrypted: i=1; AJvYcCU93SJ2cRqmdCfM8etcdwqlm1fPMzIhiZyglovHFRkf+YWwrP73Fd2ZWT5ihvd2ONCF382cGeAgQUpBeFhc@postgresql.org X-Gm-Message-State: AOJu0YyzlEAWZRnraavec/gNaUbvNZ5bo2MpJ4tU1befSbC21ptD2S4q fUuNwYzWKUGKGQIqlrNbFiKADPqXPdCP3nq34Yz4pVrgK3IJjxYDSsShF9nd8NBAx8oGzISjc62 NQDU5cJsw+8SDLLRL8jzs4jdpHkJwjArkniWgLA== X-Gm-Gg: ASbGncvFaY14Za4HQ4nWCA+iqMhthCDY4LHa770SpgZ9r13b3o23U+untRHyMv7q341 FuWqxu9kliDVYU80RlBXUtChNzsfuxsihzCqHmHKG X-Google-Smtp-Source: AGHT+IHzbuSdpmyerum5FdVEC881YVQY84LLfasAxTURPuVN/Qc4GqnKCDT9DYkHrjC4tGYK8Kuk4hQyHXI7j5y0mI4= X-Received: by 2002:a05:6402:2346:b0:5d1:1064:326a with SMTP id 4fb4d7f45d1cf-5d8025ca14dmr4179772a12.15.1734590881080; Wed, 18 Dec 2024 22:48:01 -0800 (PST) MIME-Version: 1.0 References: <3062509.1734583088@sss.pgh.pa.us> In-Reply-To: From: Adrian Garcia Badaracco Date: Thu, 19 Dec 2024 00:47:50 -0600 Message-ID: Subject: Re: Wrapping a where clause to preserve rows with nulls To: Tom Lane Cc: "David G. Johnston" , "pgsql-general@postgresql.org" Content-Type: multipart/alternative; boundary="0000000000004c51fc062999e9fc" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000004c51fc062999e9fc Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Well, there is a wrinkle: if the predicate returns `false` but one of the columns is null then the whole thing ends up `true` when I'd want it to be `false`. Say col_a =3D [1] and col_b =3D [null]: WHERE (col_a < 1 AND col_b > 1) OR col_a IS NULL OR col_b IS NULL -> WHERE (false AND null) OR false OR true -> WHERE false OR false OR true -> true. That's still a pretty good solution for now. On Wed, Dec 18, 2024 at 10:41=E2=80=AFPM Adrian Garcia Badaracco < adrian@adriangb.com> wrote: > Thank you for the great idea Tom. While yes I can't modify the original > WHERE clause I do think I'll be able to introspect it or get the system > generating it to tell me which columns it references and then add an OR x > is NULL OR y is NULL ... > > For context, just in case it's interesting, I store Parquet statistics in > a Postgres table and run the output of this thing on them: > https://github.com/apache/datafusion/blob/f92442ea8e8944c78f8e40d6648d049= ff8e335ec/datafusion/physical-optimizer/src/pruning.rs#L146-L456 > Hence why I can't really control the WHERE clause (at least not without > re-implementing a bunch of finicky error prone code). > > On Wed, Dec 18, 2024 at 10:38=E2=80=AFPM Tom Lane wro= te: > >> "David G. Johnston" writes: >> > On Wednesday, December 18, 2024, Adrian Garcia Badaracco < >> > adrian@adriangb.com> wrote: >> >> Is there any way to include the rows where the predicate evaluates to >> null >> >> while still using an index? >> >> > ... A btree index, which handles =3D, can=E2=80=99t be told to behave >> > differently and so cannot fulfill your desire to produce rows where th= e >> > stored value is null; it can only produce those equal to 5000. >> >> Not in a single scan, no. But multiple scans are possible: >> >> regression=3D# create table t (id int unique); >> CREATE TABLE >> regression=3D# explain select * from t where id =3D 5000 or id is null; >> QUERY PLAN >> >> >> ------------------------------------------------------------------------= ------ >> Bitmap Heap Scan on t (cost=3D8.42..18.98 rows=3D14 width=3D4) >> Recheck Cond: ((id IS NULL) OR (id =3D 5000)) >> -> BitmapOr (cost=3D8.42..8.42 rows=3D14 width=3D0) >> -> Bitmap Index Scan on t_id_key (cost=3D0.00..4.25 rows=3D13 >> width=3D0) >> Index Cond: (id IS NULL) >> -> Bitmap Index Scan on t_id_key (cost=3D0.00..4.16 rows=3D1 >> width=3D0) >> Index Cond: (id =3D 5000) >> (7 rows) >> >> The OP was quite unclear about what semantics he wants for >> multiple-variable WHERE clauses, but maybe something like this >> would work: >> >> WHERE (original-clause) OR x IS NULL OR y IS NULL OR ... >> >> where each variable mentioned in original-clause is allowed >> to also be NULL. Or perhaps what is wanted is >> >> WHERE (original-clause) OR (x IS NULL AND y IS NULL AND ...) >> >> ?? >> >> regards, tom lane >> > --0000000000004c51fc062999e9fc Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Well, there is a wrinkle: if the predicate returns `false`= but one of the columns is null then the whole thing ends up `true` when I&= #39;d want it to be `false`. Say col_a =3D [1] and col_b =3D [null]:
WHERE (col_a < 1 AND col_b > 1) OR col_a IS NULL OR col_= b IS NULL -> WHERE (false AND null) OR false OR true -> WHERE false O= R false OR true -> true.

That's still a pre= tty good solution for now.

On Wed, Dec 18, 2024 = at 10:41=E2=80=AFPM Adrian Garcia Badaracco <adrian@adriangb.com> wrote:
Thank you for the great i= dea Tom. While yes I can't modify the original WHERE clause I do think = I'll be able to introspect it or get the system generating it to tell m= e which columns it references and then add an OR x is NULL OR y is NULL ...=

For context, just in case it's interesting, I store= Parquet statistics in a Postgres table and run the output of this thing on= them:=C2=A0https://github.com/apache/datafusion/blob/f92= 442ea8e8944c78f8e40d6648d049ff8e335ec/datafusion/physical-optimizer/src/pru= ning.rs#L146-L456
Hence why I can't really control the WH= ERE clause (at least not without re-implementing a bunch of finicky=C2=A0er= ror prone code).

On Wed, Dec 18, 2024 at 10:38=E2=80=AFPM Tom Lane <= ;tgl@sss.pgh.pa.us> wrote:
&q= uot;David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Wednesday, December 18, 2024, Adrian Garcia Badaracco <
> adrian@adrian= gb.com> wrote:
>> Is there any way to include the rows where the predicate evaluates= to null
>> while still using an index?

> ... A btree index, which handles =3D, can=E2=80=99t be told to behave<= br> > differently and so cannot fulfill your desire to produce rows where th= e
> stored value is null; it can only produce those equal to 5000.

Not in a single scan, no.=C2=A0 But multiple scans are possible:

regression=3D# create table t (id int unique);
CREATE TABLE
regression=3D# explain select * from t where id =3D 5000 or id is null;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 QUERY PLAN=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0
---------------------------------------------------------------------------= ---
=C2=A0Bitmap Heap Scan on t=C2=A0 (cost=3D8.42..18.98 rows=3D14 width=3D4)<= br> =C2=A0 =C2=A0Recheck Cond: ((id IS NULL) OR (id =3D 5000))
=C2=A0 =C2=A0->=C2=A0 BitmapOr=C2=A0 (cost=3D8.42..8.42 rows=3D14 width= =3D0)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0->=C2=A0 Bitmap Index Scan on t_id_key= =C2=A0 (cost=3D0.00..4.25 rows=3D13 width=3D0)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Cond: (id IS N= ULL)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0->=C2=A0 Bitmap Index Scan on t_id_key= =C2=A0 (cost=3D0.00..4.16 rows=3D1 width=3D0)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Cond: (id =3D = 5000)
(7 rows)

The OP was quite unclear about what semantics he wants for
multiple-variable WHERE clauses, but maybe something like this
would work:

WHERE (original-clause) OR x IS NULL OR y IS NULL OR ...

where each variable mentioned in original-clause is allowed
to also be NULL.=C2=A0 Or perhaps what is wanted is

WHERE (original-clause) OR (x IS NULL AND y IS NULL AND ...)

??

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 regards, tom lane
--0000000000004c51fc062999e9fc--