Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1unpvD-00Bj68-Ky for pgsql-hackers@arkaria.postgresql.org; Mon, 18 Aug 2025 02:48:33 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1unpvC-00CYg8-Om for pgsql-hackers@arkaria.postgresql.org; Mon, 18 Aug 2025 02:48:31 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1unpvC-00CYfz-Er for pgsql-hackers@lists.postgresql.org; Mon, 18 Aug 2025 02:48:31 +0000 Received: from mail-pj1-x1030.google.com ([2607:f8b0:4864:20::1030]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1unpv8-000R8h-21 for pgsql-hackers@lists.postgresql.org; Mon, 18 Aug 2025 02:48:30 +0000 Received: by mail-pj1-x1030.google.com with SMTP id 98e67ed59e1d1-32326de9f4eso3111497a91.2 for ; Sun, 17 Aug 2025 19:48:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1755485304; x=1756090104; darn=lists.postgresql.org; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:from:to:cc:subject:date:message-id:reply-to; bh=e9u90aL7Nh3YIPbwxD+nwcJcx/9UQJMSakUuFkXY63k=; b=HmeSz69CoWNEoHsnMiAG+LKAYDCwHlHZw1jXbBitPUnj0EIs+qaoICm6E8iSiqqZ7L 23st+woSxcYlgBFgy0uZ2ri+3DqGrVKBBlPl8xTq+wVrPx1v3Ng10m9hz8u2a/STR/iD ywnmM6qqkOjgL7z6Hg+qVFD22sJTA0+zXcxv50+03IuGy2vYfxM2HyMFNqNznR9ICnaU jz8Pm90xk7zvVBJqBqRCPGQ8sZhJ0Zq4CZgnEMG1KMP+PKFtP07FkMO79DAmSAd7ZLVP zfDM05o+Wrk6fcAxi145bWM69CGKnFekNUEb6jeTqJeL6yQyAXnSfSF74y9+wLai22Xc 1JDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755485304; x=1756090104; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=e9u90aL7Nh3YIPbwxD+nwcJcx/9UQJMSakUuFkXY63k=; b=RubZ/6CZJwJQSbPcKvA+DqE0mQ/5Hj4E7yR5vLBWC94dX7wVBkryQ2F6Du4a1M7/lc gfRl0oBRx6TXPenLteDYimGVJ5HWgGp9gnosh/o4sUERUIi/KaOUNBvh2PzYK1a1vMgX /iEQN/OWtZKLIv2hNu2JfpCTtNYuVnxgqhrO5gfY84A0ngedezBLt1ZzD/eqLuIjzTIL J6jEraF5r5YPags3MOlCFA5OauL8HhupAPNZst6Upwz7eqzPp2w4fBDwrj8MuQPiTqSK 3uR+CbuukU/KpcA7allq2qKMt48FWt2oh0QpqGPOeOlpcoG6X9ddLf6Qw8sL9AWyppx1 tHag== X-Gm-Message-State: AOJu0YyXSFbTwbYPMkHmUu8aUIbV3IVgcqy7b3MxW/mKOk1nFaWlgWya 1Li6Zc37ljEGlkOv/KCqYGFXAR2xUtNHscI4xhg2GwcWqlt2iH1yqDFKXwUJgcgVi4k= X-Gm-Gg: ASbGncsYpaONzik2EepXwpjD5NZEjiNrpNh2vwqqbt4sbJ70qquUr7Qls3kj/kxv5NI Zxdrd/F1WXJwRFQ/ocTJ7jsaySaVW129hww5ikrk0STmkAnevFiiVOoBiZuA4Ab5NMAKeow14Mr LHjbLWWfomKr0qgmjwRYe6NDqY7twAItX8k/rD5sWF4i79Q7bdCaH865qUHosACkPTBl815C2li QrU/QlcbRJ1BwBDAQvTTGV7N7w1FEygxr4pDSPDg3VonPXvvilZCYjhbFlOLDSKrojzZ3J5+HQW dC2N1bGExbbOcA+/xnUtm8AMyV7hppQ2G9Pr5lQjVN7DS5zQf6r+YtiNbUSMVUMjgxGp7O9w6Bm re7OkbCfkWbyA8D51NbmjDJg4mXTJ14Dq X-Google-Smtp-Source: AGHT+IGNTstMgZeCDZAJYhFdVC6jTicQvJ5fCnpKBkCsxljTkeHs2FAWBWfbz/xOJdErwiSG9RwuFw== X-Received: by 2002:a17:90b:538e:b0:31e:6f0a:6a1a with SMTP id 98e67ed59e1d1-323421640e4mr13745874a91.3.1755485303762; Sun, 17 Aug 2025 19:48:23 -0700 (PDT) Received: from smtpclient.apple ([142.171.105.12]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-32331166700sm9585871a91.24.2025.08.17.19.48.22 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 17 Aug 2025 19:48:23 -0700 (PDT) From: Chao Li Message-Id: <5C5EE031-F086-4353-A17A-DA563CD24DDD@gmail.com> Content-Type: multipart/alternative; boundary="Apple-Mail=_1F8A00E3-65F4-42DB-85EB-E5FEE9E5CD80" Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3826.700.81\)) Subject: Re: Improve hash join's handling of tuples with null join keys Date: Mon, 18 Aug 2025 10:48:08 +0800 In-Reply-To: <616751.1755276726@sss.pgh.pa.us> Cc: pgsql-hackers@lists.postgresql.org To: Tom Lane References: <3061845.1746486714@sss.pgh.pa.us> <496221.1748882849@sss.pgh.pa.us> <175507656113.993.1381684440543440253.pgcf@coridan.postgresql.org> <544D7C83-CECE-44E7-B5D7-530E9318D231@gmail.com> <616751.1755276726@sss.pgh.pa.us> X-Mailer: Apple Mail (2.3826.700.81) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --Apple-Mail=_1F8A00E3-65F4-42DB-85EB-E5FEE9E5CD80 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On Aug 16, 2025, at 00:52, Tom Lane wrote: >=20 > Chao Li writes: >> With this patch, =E2=80=9Cisnull=E2=80=9D now becomes true because of = the change of strict op. Then the outer null join key tuple must be = stored in a tuplestore. When an outer table contains a lot of null join = key tuples, then the tuplestore could bump to very large, in that case, = it would be hard to say this patch really benefits. >=20 > What's your point? If we don't divert those tuples into the > tuplestore, then they will end up in the main hash table instead, > and the consequences of bloat there are far worse. I might not state clearly. For this comments, I meant the outer table. = For example: SELECT a.*, b.* from a RIGHT JOIN b on a.id =3D b.a_id; Let=E2=80=99s say table a is used to build hash, table b is the outer = table. And say, table b has 1000 tuples whose a_id are NULL. Before this patch, when fetching such a tuple (a_id is null) from table = b, the tuple will be returned to parent node immediately.=20 With this tuple, all of such tuples will be put into = hj_NullOuterTupleStore, and only be returned after all non-null tuples = are processed. My comment was trying to say that if there are a lot of null join key = tuples in outer table, then hj_NullOuterTupleStore might use a lot of = memory or swap data to disk, which might lead to performance burden. So, = I was thinking we could keep the original logic for outer table, and = return null join key tuples immediately. >=20 >> Based on this patch, if we are doing a left join, and outer table is = empty, then all tuples from the inner table should be returned. In that = case, we can skip building a hash table, instead, we can put all inner = table tuples into hashtable.innerNullTupleStore. Building a tuplestore = should be cheaper than building a hash table, so this way makes a little = bit more performance improvement. >=20 > I think that would make the logic completely unintelligible. Also, > a totally-empty input relation is not a common situation. We try to > optimize such cases when it's simple to do so, but we shouldn't let > that drive the fundamental design. >=20 I absolutely agree we should not touch the fundamental design for the = tiny optimization, that=E2=80=99s why I mentioned =E2=80=9Cbased on this = patch=E2=80=9D. With this patch, you have introduced a change in MultiExecPrivateHash(): else if (node->keep_null_tuples) { /* null join key, but we must save tuple to be = emitted later */ if (node->null_tuple_store =3D=3D NULL) node->null_tuple_store =3D = ExecHashBuildNullTupleStore(hashtable); tuplestore_puttupleslot(node->null_tuple_store, = slot); } We can simply added a new flag to HashTable, say named = skip_building_hash. Upon right join (join to the hash side), and outer = table is empty, set the flag to true, then in the = MultiExecPrivateHash(), if skip_building_hash is true, directly put all = tuples into node->null_tuple_store without building a hash table. Then in ExecHashJoinImpl(), after "(void) MultiExecProcNode()" is = called, if hashtable->skip_building_hash is true, directly set = node->hj_JoinState =3D HJ_FILL_INNER_NULL_TUPLES. So, the tiny optimization is totally based on this patch, it depends on = the HashTable.null_tuple_store (if you take this comment, then maybe = rename this variable) and the new state HJ_FILL_INNER_NULL_TUPLES. Best regards, =3D=3D Chao Li (Evan) -------------------- HighGo Software Co., Ltd. https://www.highgo.com/ --Apple-Mail=_1F8A00E3-65F4-42DB-85EB-E5FEE9E5CD80 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8

On Aug 16, 2025, at 00:52, Tom Lane = <tgl@sss.pgh.pa.us> wrote:

Chao Li = <li.evan.chao@gmail.com> writes:
With = this patch, =E2=80=9Cisnull=E2=80=9D now becomes true because of the = change of strict op. Then the outer null join key tuple must be stored = in a tuplestore. When an outer table contains a lot of null join key = tuples, then the tuplestore could bump to very large, in that case, it = would be hard to say this patch really = benefits.

What's your point?  If we don't = divert those tuples into the
tuplestore, then they will end up in the = main hash table instead,
and the consequences of bloat there are far = worse.

I might not state clearly. For this comments, = I meant the outer table. For example:

SELECT a.*, b.* from a = RIGHT JOIN b on a.id =3D b.a_id;

Let=E2=80=99s say table a = is used to build hash, table b is the outer table.

And say, table = b has 1000 tuples whose a_id are NULL.

Before this patch, when = fetching such a tuple (a_id is null) from table b, the tuple will be = returned to parent node immediately. 

With this tuple, all = of such tuples will be put into hj_NullOuterTupleStore, and only be = returned after all non-null tuples are processed.

My comment was = trying to say that if there are a lot of null join key tuples in outer = table, then hj_NullOuterTupleStore might use a lot of memory or swap = data to disk, which might lead to performance burden. So, I was thinking = we could keep the original logic for outer table, and return null join = key tuples immediately.



Based on this patch, if we = are doing a left join, and outer table is empty, then all tuples from = the inner table should be returned. In that case, we can skip building a = hash table, instead, we can put all inner table tuples into = hashtable.innerNullTupleStore. Building a tuplestore should be cheaper = than building a hash table, so this way makes a little bit more = performance improvement.

I think that would make the = logic completely unintelligible.  Also,
a totally-empty input = relation is not a common situation.  We try to
optimize such = cases when it's simple to do so, but we shouldn't let
that drive the = fundamental design.


I absolutely = agree we should not touch the fundamental design for the tiny = optimization, that=E2=80=99s why I mentioned =E2=80=9Cbased on this = patch=E2=80=9D.

With this patch, you have = introduced a change in = MultiExecPrivateHash():

= else if (node->keep_null_tuples)
= {
/* null join key, = but we must save tuple to be emitted later */
= if (node->null_tuple_store =3D=3D NULL)
= node->null_tuple_store =3D = ExecHashBuildNullTupleStore(hashtable);
= tuplestore_puttupleslot(node->null_tuple_store, = slot);
= }

We can simply = added a new flag to HashTable, say named skip_building_hash. Upon right = join (join to the hash side), and outer table is empty, set the flag to = true, then in the MultiExecPrivateHash(), if skip_building_hash is true, = directly put all tuples into node->null_tuple_store without building = a hash table.

Then in ExecHashJoinImpl(), after = "(void) MultiExecProcNode()" is called, if = hashtable->skip_building_hash is true, directly set = node->hj_JoinState =3D = HJ_FILL_INNER_NULL_TUPLES.

So, the tiny = optimization is totally based on this patch, it depends on the = HashTable.null_tuple_store (if you take this comment, then maybe rename = this variable) and the new state = HJ_FILL_INNER_NULL_TUPLES.

Best = regards,
=3D=3D
Chao Li (Evan)
--------------------
HighGo Software Co., = Ltd.
https://www.highgo.com/



= --Apple-Mail=_1F8A00E3-65F4-42DB-85EB-E5FEE9E5CD80--