Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uEB8x-0046Vh-Jk for pgsql-hackers@arkaria.postgresql.org; Sun, 11 May 2025 18:11:20 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uEB8w-00Amvk-Ma for pgsql-hackers@arkaria.postgresql.org; Sun, 11 May 2025 18:11:18 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uEB8w-00Amvb-Bn for pgsql-hackers@lists.postgresql.org; Sun, 11 May 2025 18:11:18 +0000 Received: from mail-oo1-xc2e.google.com ([2607:f8b0:4864:20::c2e]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1uEB8t-001Nyz-2C for pgsql-hackers@lists.postgresql.org; Sun, 11 May 2025 18:11:18 +0000 Received: by mail-oo1-xc2e.google.com with SMTP id 006d021491bc7-607d5b7fbeaso417368eaf.3 for ; Sun, 11 May 2025 11:11:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746987074; x=1747591874; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Q7fJqmRqgUYYTFVkY0HjHDyROnR08bRtu+ZEx2Nf6Ec=; b=KLhNucWuLbFxk4Oh8UhjabLHKV9C/SUhNJjB48f/QMzd8+wU2ZjjvY2SKmjyId8SWX +f6m7Z1k9WENqBRMkuCFYSd9ZiRrYcgNrqC+CIzvBrAEVY1Gl0UgQkabU+rPB2wadJ9g Ig93W0EVtaCbA2wTBuKl+anlzgQMkt3elkcHh8qnQlJAckXM2D6mJPRL9198ld8hJLVS 3MvxwVNW6YAffSdYjn+cfKmL9LPzc+uTJjtuhZnjTbMxqStLkAY9RTSue8aVkbG14FZi ioGWAxHxCXcvm0Pj2PrkH3J30vsIu7WJ+xpP+W6CSxzjMLuEY2v5UYaUgb6sYMLuLPNy yhTQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746987074; x=1747591874; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Q7fJqmRqgUYYTFVkY0HjHDyROnR08bRtu+ZEx2Nf6Ec=; b=w1rqVMuMrPERBq31c0W+XbcBX4nSvnRoINuxu6cGtJykmJL7859qZlKeGjs7uZXtmB 6EVoAdi6rpcmzSUHzNQgYhCjne6D5Vl2gM3rVudhfGRfzE2a3kNRcP0DtOIC2fQN/Ebk P0Zc2m8JChokwRJ/r/a4QzBXaGowGkXfQnD1VLFug1OHhSyjfdRNV90gMsji7FMspsUN rvQzFkYb7VqOX1taQwrI+aUtYtW55ct3ixuD/o4AY5KsM4pCWRfvwj46SxewS5MT371b QJPyX+9vCRMaroLwMJLDbotfjRk2c1PMB3UPv9rAqPRBNfEOxxkhYRIDK2A6AmzXBDZi QE/Q== X-Forwarded-Encrypted: i=1; AJvYcCXcJZWMWvOHag/QG+N6TPR9stujkYDf7bMckoDNuzA5OATQyBq/yUDZmoBhvtuFAOnkxpCkSmYm6anXLo+J@lists.postgresql.org X-Gm-Message-State: AOJu0YxQelSUUCfiZlRWsWIkk3ykGz0uTkxWDcI2LoeyzjTcjgTpLnVF 7jJ4eG5YeHD0g+yNr4nV1Dj5l54JW2yGxItVfoyTX8OXclE6gTyEdqSVUCx+tt01kya3tochOa1 1bX6LuDMVaLwGxPn2fxPGgHQ7xWBVR4/5+HyDNg== X-Gm-Gg: ASbGncvrLWd0FxGiRMxbbShAVxFwT0T3Z1kBdhYoj/w+NAz+Jfv7cRtT21HbNWInHR6 mSs0+BB0Kn/fk2hsXDgDKah275qNj6nr5eqJaJPIustUL/YKpjX/XTrmXVFYd+Opf/YIoNHgX+o kWHlf+QxUtX9QNld77FDsQb8So2ZKCH2U5 X-Google-Smtp-Source: AGHT+IG0gGiySsWkv/a2laz3Jxj0Ej3qoKUw+AVWT4G7/T6PScwQmeJiOKLtp22awt+pm7mvpWnKuyzh6CNTrkq4dqY= X-Received: by 2002:a4a:c885:0:b0:606:3a5d:c7ad with SMTP id 006d021491bc7-6083fda9768mr1735999eaf.0.1746987074050; Sun, 11 May 2025 11:11:14 -0700 (PDT) MIME-Version: 1.0 References: <3061845.1746486714@sss.pgh.pa.us> In-Reply-To: From: Thomas Munro Date: Sun, 11 May 2025 14:10:38 -0400 X-Gm-Features: AX0GCFujUZpJIm-JJjbClerYav7v7l7XXrpRUUjP_MW7GLRdgnw7ZjEc1ZubpOI Message-ID: Subject: Re: Improve hash join's handling of tuples with null join keys To: Tomas Vondra Cc: Tom Lane , pgsql-hackers@lists.postgresql.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Tue, May 6, 2025 at 12:12=E2=80=AFPM Tomas Vondra wrot= e: > On 5/6/25 01:11, Tom Lane wrote: > > The attached patch is a response to the discussion at [1], where > > it emerged that lots of rows with null join keys can send a hash > > join into too-many-batches hell, if they are on the outer side > > of the join so that they must be null-extended not just discarded. > > This isn't really surprising given that such rows will certainly > > end up in the same hash bucket, and no amount of splitting can > > reduce the size of that bucket. (I'm a bit surprised that the > > growEnabled heuristic didn't kick in, but it seems it didn't, > > at least not up to several million batches.) Good idea. I haven't reviewed it properly, but one observation is that trapping the null-keys tuples in per-worker tuple stores creates unfairness. That could be fixed by using a SharedTuplestore instead, but unfortunately SharedTuplestore always spills to disk at the moment, so maybe I should think about how to give it some memory for small sets like regular Tuplestore. Will look more closely after Montreal. > I don't think that's too surprising - growEnabled depends on all tuples > getting into the same batch during a split. But even if there are many > duplicate values, real-world data sets often have a couple more tuples > that just happen to fall into that bucket too. And then during the split > some get into one batch and some get into another. Yeah. > My personal experience is that the growEnabled heuristics is overly > sensitive, and probably does not trigger very often. It can also get set > to "true" to early, but that's (much) harder to hit. > > I have suggested to make growEnabled less strict in [2], i.e. to > calculate the threshold as percentage of the batch, and not disable > growth permanently. But it was orthogonal to what that thread did. +1, I also wrote a thread with a draft patch like that at some point, which I'll also try to dig up after Montreal. > But more importantly, wasn't the issue discussed in [1] about parallel > hash joins? I got quite confused while reading the thread ... I'm asking > because growEnabled is checked only in ExecHashIncreaseNumBatches, not > in ExecParallelHashIncreaseNumBatches. So AFAICS the parallel hash joins > don't use growEnabled at all, no? There is an equivalent mechanism, but it's slightly more complicated as it requires consensus (see code near PHJ_GROW_BATCHES_DECIDE). It's possible that it's not working quite as well as it should. It's definitely less deterministic in some edge cases since tuples are packed into chunks differently so the memory used can vary slightly run-to-run, but the tuple count should be stable. I've made a note to review that logic again too. Note also that v16 is the first that could put NULLs in a shared memory hash table (11c2d6fdf5 enabled Parallel Hash Right|Full Join), while non-P HJ has had that for a long time, but it also couldn't be used in a parallel query, so I guess it's possible that this stuff is coming up now because it wasn't often picked for problems that would generate interesting numbers of NULLs likely to exceed limits given available plans in older releases. See also related bug fix in 98c7c7152, spotted soon after this plan type escaped into the field. While thinking about that, I wanted to note that we have more things to improve in PHRJ: (1) Parallelism of unmatched scan: a short but not entirely satisfying patch was already shared on the PHRJ thread but not committed with the main feature. I already had some inklings of how to do much better which I recently described in a bit more detail on the PBHS thread in vapourware form, where parallel fairness came up again. "La perfection est le mortel ennemi du bien" or whatever it is they say in the language of Montreal, but really the easy patch for unmatched scan parallelism wasn't actually bon enough, because it was non-deterministic how many processes could participate due to deadlock avoidance arcana, creating run-to-run variation that I'd expect Tom=C3=A1= =C5=A1 to find empirically and reject in one of his benchmarking expeditions :-). (2) Bogus asymmetries in estimations/planning: I wrote some analysis of why we don't use PHRJ as much as we could/should near Richard Guo's work on anti/semi joins which went in around the same time. My idea there is to try to debogify the parallel degree logic more generally, it's just that PHRJ brought key aspects of it into relief for me, ie bogosity of the rule-based "driving table" concept. I'll try to write these projects up on the wiki, instead of in random threads :-) In other words if you just use local Tuplestores as you showed it would actually be an improvement in fairness over the status quo due to (1) not being solved yet... but it will be solved, hence mentioning it in this context.