From: Chao Li <li.evan.chao@gmail.com>
Message-Id: <5C5EE031-F086-4353-A17A-DA563CD24DDD@gmail.com>
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_1F8A00E3-65F4-42DB-85EB-E5FEE9E5CD80"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3826.700.81\))
Subject: Re: Improve hash join's handling of tuples with null join keys
Date: Mon, 18 Aug 2025 10:48:08 +0800
In-Reply-To: <616751.1755276726@sss.pgh.pa.us>
Cc: pgsql-hackers@lists.postgresql.org
To: Tom Lane <tgl@sss.pgh.pa.us>
References: <3061845.1746486714@sss.pgh.pa.us>
 <496221.1748882849@sss.pgh.pa.us>
 <175507656113.993.1381684440543440253.pgcf@coridan.postgresql.org>
 <544D7C83-CECE-44E7-B5D7-530E9318D231@gmail.com>
 <616751.1755276726@sss.pgh.pa.us>
Archived-At: 
 <https://www.postgresql.org/message-id/5C5EE031-F086-4353-A17A-DA563CD24DDD%40gmail.com>
Precedence: bulk


--Apple-Mail=_1F8A00E3-65F4-42DB-85EB-E5FEE9E5CD80
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8


> On Aug 16, 2025, at 00:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>=20
> Chao Li <li.evan.chao@gmail.com> writes:
>> With this patch, =E2=80=9Cisnull=E2=80=9D now becomes true because of =
the change of strict op. Then the outer null join key tuple must be =
stored in a tuplestore. When an outer table contains a lot of null join =
key tuples, then the tuplestore could bump to very large, in that case, =
it would be hard to say this patch really benefits.
>=20
> What's your point?  If we don't divert those tuples into the
> tuplestore, then they will end up in the main hash table instead,
> and the consequences of bloat there are far worse.

I might not state clearly. For this comments, I meant the outer table. =
For example:

SELECT a.*, b.* from a RIGHT JOIN b on a.id =3D b.a_id;

Let=E2=80=99s say table a is used to build hash, table b is the outer =
table.

And say, table b has 1000 tuples whose a_id are NULL.

Before this patch, when fetching such a tuple (a_id is null) from table =
b, the tuple will be returned to parent node immediately.=20

With this tuple, all of such tuples will be put into =
hj_NullOuterTupleStore, and only be returned after all non-null tuples =
are processed.

My comment was trying to say that if there are a lot of null join key =
tuples in outer table, then hj_NullOuterTupleStore might use a lot of =
memory or swap data to disk, which might lead to performance burden. So, =
I was thinking we could keep the original logic for outer table, and =
return null join key tuples immediately.


>=20
>> Based on this patch, if we are doing a left join, and outer table is =
empty, then all tuples from the inner table should be returned. In that =
case, we can skip building a hash table, instead, we can put all inner =
table tuples into hashtable.innerNullTupleStore. Building a tuplestore =
should be cheaper than building a hash table, so this way makes a little =
bit more performance improvement.
>=20
> I think that would make the logic completely unintelligible.  Also,
> a totally-empty input relation is not a common situation.  We try to
> optimize such cases when it's simple to do so, but we shouldn't let
> that drive the fundamental design.
>=20

I absolutely agree we should not touch the fundamental design for the =
tiny optimization, that=E2=80=99s why I mentioned =E2=80=9Cbased on this =
patch=E2=80=9D.

With this patch, you have introduced a change in MultiExecPrivateHash():

		else if (node->keep_null_tuples)
		{
			/* null join key, but we must save tuple to be =
emitted later */
			if (node->null_tuple_store =3D=3D NULL)
				node->null_tuple_store =3D =
ExecHashBuildNullTupleStore(hashtable);
			tuplestore_puttupleslot(node->null_tuple_store, =
slot);
		}

We can simply added a new flag to HashTable, say named =
skip_building_hash. Upon right join (join to the hash side), and outer =
table is empty, set the flag to true, then in the =
MultiExecPrivateHash(), if skip_building_hash is true, directly put all =
tuples into node->null_tuple_store without building a hash table.

Then in ExecHashJoinImpl(), after "(void) MultiExecProcNode()" is =
called, if hashtable->skip_building_hash is true, directly set =
node->hj_JoinState =3D HJ_FILL_INNER_NULL_TUPLES.

So, the tiny optimization is totally based on this patch, it depends on =
the HashTable.null_tuple_store (if you take this comment, then maybe =
rename this variable) and the new state HJ_FILL_INNER_NULL_TUPLES.

Best regards,
=3D=3D
Chao Li (Evan)
--------------------
HighGo Software Co., Ltd.
https://www.highgo.com/


--Apple-Mail=_1F8A00E3-65F4-42DB-85EB-E5FEE9E5CD80
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"content-type" content=3D"text/html; =
charset=3Dutf-8"></head><body style=3D"overflow-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: after-white-space;"><span><br =
id=3D"lineBreakAtBeginningOfMessage"></span><span><br><blockquote =
type=3D"cite">On Aug 16, 2025, at 00:52, Tom Lane =
&lt;tgl@sss.pgh.pa.us&gt; wrote:<br><br =
class=3D"Apple-interchange-newline">Chao Li =
&lt;li.evan.chao@gmail.com&gt; writes:<br><blockquote type=3D"cite">With =
this patch, =E2=80=9Cisnull=E2=80=9D now becomes true because of the =
change of strict op. Then the outer null join key tuple must be stored =
in a tuplestore. When an outer table contains a lot of null join key =
tuples, then the tuplestore could bump to very large, in that case, it =
would be hard to say this patch really =
benefits.<br></blockquote><br>What's your point? &nbsp;If we don't =
divert those tuples into the<br>tuplestore, then they will end up in the =
main hash table instead,<br>and the consequences of bloat there are far =
worse.<br></blockquote><br>I might not state clearly. For this comments, =
I meant the outer table. For example:<br><br>SELECT a.*, b.* from a =
RIGHT JOIN b on a.id&nbsp;=3D b.a_id;<br><br>Let=E2=80=99s say table a =
is used to build hash, table b is the outer table.<br><br>And say, table =
b has 1000 tuples whose a_id are NULL.<br><br>Before this patch, when =
fetching such a tuple (a_id is null) from table b, the tuple will be =
returned to parent node immediately.&nbsp;<br><br>With this tuple, all =
of such tuples will be put into hj_NullOuterTupleStore, and only be =
returned after all non-null tuples are processed.<br><br>My comment was =
trying to say that if there are a lot of null join key tuples in outer =
table, then hj_NullOuterTupleStore might use a lot of memory or swap =
data to disk, which might lead to performance burden. So, I was thinking =
we could keep the original logic for outer table, and return null join =
key tuples immediately.<br><br><br><blockquote =
type=3D"cite"><br><blockquote type=3D"cite">Based on this patch, if we =
are doing a left join, and outer table is empty, then all tuples from =
the inner table should be returned. In that case, we can skip building a =
hash table, instead, we can put all inner table tuples into =
hashtable.innerNullTupleStore. Building a tuplestore should be cheaper =
than building a hash table, so this way makes a little bit more =
performance improvement.<br></blockquote><br>I think that would make the =
logic completely unintelligible. &nbsp;Also,<br>a totally-empty input =
relation is not a common situation. &nbsp;We try to<br>optimize such =
cases when it's simple to do so, but we shouldn't let<br>that drive the =
fundamental design.<br><br></blockquote><br></span><div>I absolutely =
agree we should not touch the fundamental design for the tiny =
optimization, that=E2=80=99s why I mentioned =E2=80=9Cbased on this =
patch=E2=80=9D.</div><div><br></div><div>With this patch, you have =
introduced a change in =
MultiExecPrivateHash():</div><div><br></div><div><div><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">		=
</span>else if (node-&gt;keep_null_tuples)</div><div><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">		=
</span>{</div><div><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">			</span>/* null join key, =
but we must save tuple to be emitted later */</div><div><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">			=
</span>if (node-&gt;null_tuple_store =3D=3D NULL)</div><div><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">				=
</span>node-&gt;null_tuple_store =3D =
ExecHashBuildNullTupleStore(hashtable);</div><div><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">			=
</span>tuplestore_puttupleslot(node-&gt;null_tuple_store, =
slot);</div><div><span class=3D"Apple-tab-span" style=3D"white-space:pre">=
		</span>}</div></div><div><br></div><div>We can simply =
added a new flag to HashTable, say named skip_building_hash. Upon right =
join (join to the hash side), and outer table is empty, set the flag to =
true, then in the MultiExecPrivateHash(), if skip_building_hash is true, =
directly put all tuples into node-&gt;null_tuple_store without building =
a hash table.</div><div><br></div><div>Then in ExecHashJoinImpl(), after =
"(void) MultiExecProcNode()" is called, if =
hashtable-&gt;skip_building_hash is true, directly set =
node-&gt;hj_JoinState =3D =
HJ_FILL_INNER_NULL_TUPLES.</div><div><br></div><div>So, the tiny =
optimization is totally based on this patch, it depends on the =
HashTable.null_tuple_store (if you take this comment, then maybe rename =
this variable) and the new state =
HJ_FILL_INNER_NULL_TUPLES.</div><div><br></div><div>Best =
regards,</div><div>=3D=3D</div><div>
<div>Chao Li (Evan)<br>--------------------<br>HighGo Software Co., =
Ltd.<br>https://www.highgo.com/</div><div><br></div><br =
class=3D"Apple-interchange-newline">

</div>
<br><div></div></body></html>=

--Apple-Mail=_1F8A00E3-65F4-42DB-85EB-E5FEE9E5CD80--