public inbox for [email protected]
help / color / mirror / Atom feedRe: Proposal: Adding compression of temporary files
13+ messages / 3 participants
[nested] [flat]
* Re: Proposal: Adding compression of temporary files
@ 2025-01-04 23:43 Filip Janus <[email protected]>
2025-04-22 07:17 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Filip Janus @ 2025-01-04 23:43 UTC (permalink / raw)
To: Tomas Vondra <[email protected]>; +Cc: pgsql-hackers
I apologize for multiple messages, but I found a small bug in the previous
version.
-Filip-
so 4. 1. 2025 v 23:40 odesílatel Filip Janus <[email protected]> napsal:
> Even though i started with lz4, I added also pglz support and enhanced
> memory management based on provided review.
>
>
>
> -Filip-
>
>
> čt 28. 11. 2024 v 12:32 odesílatel Filip Janus <[email protected]> napsal:
>
>>
>> I've added a regression test for lz4 compression if the server is
>> compiled with the "--with-lz4" option.
>>
>> -Filip-
>>
>>
>> ne 24. 11. 2024 v 15:53 odesílatel Filip Janus <[email protected]>
>> napsal:
>>
>>>
>>>
>>> -Filip-
>>>
>>>
>>> st 20. 11. 2024 v 1:35 odesílatel Tomas Vondra <[email protected]> napsal:
>>>
>>>> Hi,
>>>>
>>>> On 11/18/24 22:58, Filip Janus wrote:
>>>> > ...
>>>> > Hi all,
>>>> > Postgresql supports data compression nowadays, but the
>>>> compression of
>>>> > temporary files has not been implemented yet. The huge queries
>>>> can
>>>> > produce a significant amount of temporary data that needs to
>>>> > be stored on disk
>>>> > and cause many expensive I/O operations.
>>>> > I am attaching a proposal of the patch to enable temporary files
>>>> > compression for
>>>> > hashjoins for now. Initially, I've chosen the LZ4 compression
>>>> > algorithm. It would
>>>> > probably make better sense to start with pglz, but I realized it
>>>> late.
>>>> >
>>>>
>>>> Thanks for the idea & patch. I agree this might be quite useful for
>>>> workloads generating a lot of temporary files for stuff like sorts etc.
>>>> I think it will be interesting to think about the trade offs, i.e. how
>>>> to pick the compression level - at some point the compression ratio
>>>> stops improving while paying more and more CPU time. Not sure what the
>>>> right choice is, so using default seems fine.
>>>>
>>>> I agree it'd be better to start with pglz, and only then add lz4 etc.
>>>> Firstly, pglz is simply the built-in compression, supported everywhere.
>>>> And it's also simpler to implement, I think.
>>>>
>>>> > # Future possible improvements
>>>> > Reducing the number of memory allocations within the dumping and
>>>> > loading of
>>>> > the buffer. I have two ideas for solving this problem. I would
>>>> > either add a buffer into
>>>> > struct BufFile or provide the buffer as an argument from the
>>>> caller.
>>>> > For the sequential
>>>> > execution, I would prefer the second option.
>>>> >
>>>>
>>>> Yes, this would be good. Doing a palloc+pfree for each compression is
>>>> going to be expensive, especially because these buffers are going to be
>>>> large - likely larger than 8kB. Which means it's not cached in the
>>>> memory context, etc.
>>>>
>>>> Adding it to the BufFile is not going to fly, because that doubles the
>>>> amount of memory per file. And we already have major issues with hash
>>>> joins consuming massive amounts of memory. But at the same time the
>>>> buffer is only needed during compression, and there's only one at a
>>>> time. So I agree with passing a single buffer as an argument.
>>>>
>>>> > # Future plan/open questions
>>>> > In the future, I would like to add support for pglz and zstd.
>>>> > Further, I plan to
>>>> > extend the support of the temporary file compression also for
>>>> > sorting, gist index creation, etc.
>>>> >
>>>> > Experimenting with the stream mode of compression algorithms. The
>>>> > compression
>>>> > ratio of LZ4 in block mode seems to be satisfying, but the stream
>>>> > mode could
>>>> > produce a better ratio, but it would consume more memory due to
>>>> the
>>>> > requirement to store
>>>> > context for LZ4 stream compression.
>>>> >
>>>>
>>>> One thing I realized is that this only enables temp file compression for
>>>> a single place - hash join spill files. AFAIK this is because compressed
>>>> files don't support random access, and the other places might need that.
>>>>
>>>> Is that correct? The patch does not explain this anywhere. If that's
>>>> correct, the patch probably should mention this in a comment for the
>>>> 'compress' argument added to BufFileCreateTemp(), so that it's clear
>>>> when it's legal to set compress=true.
>>>>
>>>
>>> I will add the description there.
>>>
>>>
>>>> Which other places might compress temp files? Surely hash joins are not
>>>> the only place that could benefit from this, right?
>>>>
>>>
>>> Yes, you are definitely right. I have chosen the hash joins as a POC
>>> because
>>> there are no seeks besides seeks at the beginning of the buffer.
>>> I have focused on hashjoins, but there are definitely also other places
>>> where
>>> the compression could be used. I want to add support in other places
>>> in the feature.
>>>
>>>
>>>> Another thing is testing. If I run regression tests, it won't use
>>>> compression at all, because the GUC has "none" by default, right? But we
>>>> need some testing, so how would we do that? One option would be to add a
>>>> regression test that explicitly sets the GUC and does a hash join, but
>>>> that won't work with lz4 (because that may not be enabled).
>>>
>>>
>>> Right, it's "none" by default. My opinion is that we would like to test
>>> every supported compression method, so I will try to add environment
>>> variable as
>>> you recommended.
>>>
>>>
>>>>
>>>> Another option might be to add a PG_TEST_xxx environment variable that
>>>> determines compression to use. Something like PG_TEST_USE_UNIX_SOCKETS.
>>>> But perhaps there's a simpler way.
>>>>
>>>> > # Benchmark
>>>> > I prepared three different databases to check expectations. Each
>>>> > dataset is described below. My testing demonstrates that my patch
>>>> > improves the execution time of huge hash joins.
>>>> > Also, my implementation should not
>>>> > negatively affect performance within smaller queries.
>>>> > The usage of memory needed for temporary files was reduced in
>>>> every
>>>> > execution without a significant impact on execution time.
>>>> >
>>>> > *## Dataset A:*
>>>> > Tables*
>>>> > *
>>>> > table_a(bigint id,text data_text,integer data_number) - 10000000
>>>> rows
>>>> > table_b(bigint id, integer ref_id, numeric data_value, bytea
>>>> > data_blob) - 10000000 rows
>>>> > Query: SELECT * FROM table_a a JOIN table_b b ON a.id <http://
>>>> > a.id> = b.id <http://b.id;;
>>>> >
>>>> > The tables contain highly compressible data.
>>>> > The query demonstrated a reduction in the usage of the temporary
>>>> > files ~20GB -> 3GB, based on this reduction also caused the
>>>> execution
>>>> > time of the query to be reduced by about ~10s.
>>>> >
>>>> >
>>>> > *## Dataset B:*
>>>> > Tables:*
>>>> > *
>>>> > table_a(integer id, text data_blob) - 1110000 rows
>>>> > table_b(integer id, text data_blob) - 10000000 rows
>>>> > Query: SELECT * FROM table_a a JOIN table_b b ON a.id <http://
>>>> > a.id> = b.id <http://b.id;;
>>>> >
>>>> > The tables contain less compressible data. data_blob was generated
>>>> > by a pseudo-random generator.
>>>> > In this case, the data reduction was only ~50%. Also, the
>>>> execution
>>>> > time was reduced
>>>> > only slightly with the enabled compression.
>>>> >
>>>> > The second scenario demonstrates no overhead in the case of
>>>> enabled
>>>> > compression and extended work_mem to avoid temp file usage.
>>>> >
>>>> > *## Dataset C:*
>>>> > Tables
>>>> > customers (integer,text,text,text,text)
>>>> > order_items(integer,integer,integer,integer,numeric(10,2))
>>>> > orders(integer,integer,timestamp,numeric(10,2))
>>>> > products(integer,text,text,numeric(10,2),integer)
>>>> >
>>>> > Query: SELECT p.product_id, p.name <http://p.name;, p.price,
>>>> > SUM(oi.quantity) AS total_quantity, AVG(oi.price) AS
>>>> avg_item_price
>>>> > FROM eshop.products p JOIN eshop.order_items oi ON p.product_id =
>>>> > oi.product_id JOIN
>>>> > eshop.orders o ON oi.order_id = o.order_id WHERE o.order_date >
>>>> > '2020-01-01' AND p.price > 50
>>>> > GROUP BY p.product_id, p.name <http://p.name;, p.price HAVING
>>>> > SUM(oi.quantity) > 1000
>>>> > ORDER BY total_quantity DESC LIMIT 100;
>>>> >
>>>> > This scenario should demonstrate a more realistic usage of the
>>>> database.
>>>> > Enabled compression slightly reduced the temporary memory usage,
>>>> but
>>>> > the execution
>>>> > time wasn't affected by compression.
>>>> >
>>>> >
>>>> > +------------+-------------------------+-----------------------
>>>> > +------------------------------+
>>>> > | Dataset | Compression. | temp_bytes | Execution
>>>> > Time (ms) |
>>>> > +------------+-------------------------+-----------------------
>>>> > +----------------------------- +
>>>> > | A | Yes | 3.09 GiB
>>>>
>>>> > | 22s586ms | work_mem = 4MB
>>>> > | | No | 21.89 GiB
>>>>
>>>> > | 35s | work_mem = 4MB
>>>> > +------------+-------------------------+-----------------------
>>>> > +----------------------------------------
>>>> > | B | Yes | 333 MB
>>>>
>>>> > | 1815.545 ms | work_mem = 4MB
>>>> > | | No | 146 MB
>>>>
>>>> > | 1500.460 ms | work_mem = 4MB
>>>> > | | Yes | 0 MB
>>>>
>>>> > | 3262.305 ms | work_mem = 80MB
>>>> > | | No | 0 MB
>>>>
>>>> > | 3174.725 ms | work_mem = 80MB
>>>> > +-------------+------------------------+------------------------
>>>> > +-------------------------------------
>>>> > | C | Yes | 40
>>>> MB
>>>> > | 1011.020 ms | work_mem = 1MB
>>>> > | | No | 53
>>>> > MB | 1034.142 ms | work_mem = 1MB
>>>> > +------------+------------------------+------------------------
>>>> > +--------------------------------------
>>>> >
>>>> >
>>>>
>>>> Thanks. I'll try to do some benchmarks on my own.
>>>>
>>>> Are these results fro ma single run, or an average of multiple runs?
>>>
>>>
>>> It is average from multiple runs.
>>>
>>> Do
>>>> you maybe have a script to reproduce this, including the data
>>>> generation?
>>>
>>>
>>> I am attaching my SQL file for database preparation. I also did further
>>> testing
>>> with two other machines( see attachment huge_tables.rtf ).
>>>
>>>>
>>>> Also, can you share some information about the machine used for this? I
>>>> expect the impact to strongly depends on memory pressure - if the temp
>>>> file fits into page cache (and stays there), it may not benefit from the
>>>> compression, right?
>>>>
>>>
>>> If it fits into the page cache due to compression, I would consider it
>>> as a benefit from compression.
>>> I performed further testing on machines with different memory sizes.
>>> Both experiments showed that compression was beneficial for execution
>>> time.
>>> The execution time reduction was more significant in the case of the
>>> machine that had
>>> less memory available.
>>>
>>> Tests were performed on:
>>> MacBook PRO M3 36GB - MacOs
>>> Virtual machine ARM64 10GB/ 6CPU - Fedora 39
>>>
>>>
>>>>
>>>> regards
>>>>
>>>> --
>>>> Tomas Vondra
>>>>
>>>>
Attachments:
[application/octet-stream] 0001-This-commit-adds-support-for-temporary-files-compres.patch (77.6K, 3-0001-This-commit-adds-support-for-temporary-files-compres.patch)
download | inline diff:
From fd4e42c830bdf8231ac7b6ae21326c38baacdc34 Mon Sep 17 00:00:00 2001
From: Filip <[email protected]>
Date: Thu, 24 Oct 2024 12:15:10 +0200
Subject: [PATCH 1/3] This commit adds support for temporary files compression,
it can be used only for hashjoins now.
It also adds GUC parameter temp_file_compression that enables this functionality.
For now, it supports just lz4 algorithms. In the future, it
could also be implemented pglz and zstd support.
---
src/Makefile.global.in | 1 +
src/backend/access/gist/gistbuildbuffers.c | 2 +-
src/backend/backup/backup_manifest.c | 2 +-
src/backend/executor/nodeHashjoin.c | 2 +-
src/backend/storage/file/buffile.c | 176 ++-
src/backend/utils/misc/guc_tables.c | 23 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/backend/utils/sort/logtape.c | 2 +-
src/backend/utils/sort/tuplestore.c | 2 +-
src/include/storage/buffile.h | 13 +-
src/test/regress/GNUmakefile | 4 +
src/test/regress/expected/join_hash_lz4.out | 1166 +++++++++++++++++
src/test/regress/expected/jsonb_jsonpath.out | 2 +-
src/test/regress/sql/join_hash_lz4.sql | 626 +++++++++
14 files changed, 2006 insertions(+), 16 deletions(-)
create mode 100644 src/test/regress/expected/join_hash_lz4.out
create mode 100644 src/test/regress/sql/join_hash_lz4.sql
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 42f50b4976..06e701fe9c 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -193,6 +193,7 @@ with_ldap = @with_ldap@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
+with_lz4 = @with_lz4@
with_system_tzdata = @with_system_tzdata@
with_uuid = @with_uuid@
with_zlib = @with_zlib@
diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
index 4c2301da00..9b3b00142a 100644
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -54,7 +54,7 @@ gistInitBuildBuffers(int pagesPerBuffer, int levelStep, int maxLevel)
* Create a temporary file to hold buffer pages that are swapped out of
* memory.
*/
- gfbb->pfile = BufFileCreateTemp(false);
+ gfbb->pfile = BufFileCreateTemp(false, false);
gfbb->nFileBlocks = 0;
/* Initialize free page management. */
diff --git a/src/backend/backup/backup_manifest.c b/src/backend/backup/backup_manifest.c
index a2e2f86332..f8a3e1f0f4 100644
--- a/src/backend/backup/backup_manifest.c
+++ b/src/backend/backup/backup_manifest.c
@@ -65,7 +65,7 @@ InitializeBackupManifest(backup_manifest_info *manifest,
manifest->buffile = NULL;
else
{
- manifest->buffile = BufFileCreateTemp(false);
+ manifest->buffile = BufFileCreateTemp(false, false);
manifest->manifest_ctx = pg_cryptohash_create(PG_SHA256);
if (pg_cryptohash_init(manifest->manifest_ctx) < 0)
elog(ERROR, "failed to initialize checksum of backup manifest: %s",
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 2f7170604d..1b5c6448ef 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -1434,7 +1434,7 @@ ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
{
MemoryContext oldctx = MemoryContextSwitchTo(hashtable->spillCxt);
- file = BufFileCreateTemp(false);
+ file = BufFileCreateTemp(false, true);
*fileptr = file;
MemoryContextSwitchTo(oldctx);
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index a27f51f622..818ef39d5c 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -54,6 +54,16 @@
#include "storage/fd.h"
#include "utils/resowner.h"
+#ifdef USE_LZ4
+#include <lz4.h>
+#endif
+
+#define NO_LZ4_SUPPORT() \
+ ereport(ERROR, \
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), \
+ errmsg("compression method lz4 not supported"), \
+ errdetail("This functionality requires the server to be built with lz4 support.")))
+
/*
* We break BufFiles into gigabyte-sized segments, regardless of RELSEG_SIZE.
* The reason is that we'd like large BufFiles to be spread across multiple
@@ -62,6 +72,8 @@
#define MAX_PHYSICAL_FILESIZE 0x40000000
#define BUFFILE_SEG_SIZE (MAX_PHYSICAL_FILESIZE / BLCKSZ)
+int temp_file_compression = TEMP_NONE_COMPRESSION;
+
/*
* This data structure represents a buffered file that consists of one or
* more physical files (each accessed through a virtual file descriptor
@@ -95,7 +107,7 @@ struct BufFile
off_t curOffset; /* offset part of current pos */
int pos; /* next read/write position in buffer */
int nbytes; /* total # of valid bytes in buffer */
-
+ bool compress; /* State of usege file compression */
/*
* XXX Should ideally us PGIOAlignedBlock, but might need a way to avoid
* wasting per-file alignment padding when some users create many files.
@@ -127,6 +139,7 @@ makeBufFileCommon(int nfiles)
file->curOffset = 0;
file->pos = 0;
file->nbytes = 0;
+ file->compress = false;
return file;
}
@@ -188,9 +201,17 @@ extendBufFile(BufFile *file)
* Note: if interXact is true, the caller had better be calling us in a
* memory context, and with a resource owner, that will survive across
* transaction boundaries.
+ *
+ * If compress is true the temporary files will be compressed before
+ * writing on disk.
+ *
+ * Note: The compression does not support random access. Only the hash joins
+ * use it for now. The seek operation other than seek to the beginning of the
+ * buffile will corrupt temporary data offsets.
+ *
*/
BufFile *
-BufFileCreateTemp(bool interXact)
+BufFileCreateTemp(bool interXact, bool compress)
{
BufFile *file;
File pfile;
@@ -212,6 +233,15 @@ BufFileCreateTemp(bool interXact)
file = makeBufFile(pfile);
file->isInterXact = interXact;
+ if (temp_file_compression != TEMP_NONE_COMPRESSION)
+ {
+#ifdef USE_LZ4
+ file->compress = compress;
+#else
+ NO_LZ4_SUPPORT();
+#endif
+ }
+
return file;
}
@@ -275,6 +305,7 @@ BufFileCreateFileSet(FileSet *fileset, const char *name)
file->files[0] = MakeNewFileSetSegment(file, 0);
file->readOnly = false;
+
return file;
}
@@ -455,13 +486,72 @@ BufFileLoadBuffer(BufFile *file)
INSTR_TIME_SET_ZERO(io_start);
/*
- * Read whatever we can get, up to a full bufferload.
+ * Load data as it is stored in the temporary file
*/
- file->nbytes = FileRead(thisfile,
+ if (!file->compress)
+ {
+
+ /*
+ * Read whatever we can get, up to a full bufferload.
+ */
+ file->nbytes = FileRead(thisfile,
file->buffer.data,
sizeof(file->buffer),
file->curOffset,
WAIT_EVENT_BUFFILE_READ);
+ /*
+ * Read and decompress data from the temporary file
+ * The first reading loads size of the compressed block
+ * Second reading loads compressed data
+ */
+ } else {
+ int nread;
+ int nbytes;
+
+ nread = FileRead(thisfile,
+ &nbytes,
+ sizeof(nbytes),
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+ /* if not EOF let's continue */
+ if (nread > 0)
+ {
+ /*
+ * A long life buffer would make sence to limit number of
+ * memory allocations
+ */
+ char * buff;
+
+ /*
+ * Read compressed data, curOffset differs with pos
+ * It reads less data than it returns to caller
+ * So the curOffset must be advanced here based on compressed size
+ */
+ file->curOffset+=sizeof(nbytes);
+
+ buff = palloc(nbytes);
+
+ nread = FileRead(thisfile,
+ buff,
+ nbytes,
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+
+#ifdef USE_LZ4
+ file->nbytes = LZ4_decompress_safe(buff,
+ file->buffer.data,nbytes,sizeof(file->buffer));
+ file->curOffset += nread;
+#endif
+
+ if (file->nbytes < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("compressed lz4 data is corrupt")));
+ pfree(buff);
+ }
+
+ }
+
if (file->nbytes < 0)
{
file->nbytes = 0;
@@ -494,9 +584,56 @@ static void
BufFileDumpBuffer(BufFile *file)
{
int wpos = 0;
- int bytestowrite;
+ int bytestowrite = 0;
File thisfile;
+
+ /* Save nbytes value because the size changes due to compression */
+ int nbytesOriginal = file->nbytes;
+
+ bool compression = false;
+
+ char * DataToWrite;
+ DataToWrite = file->buffer.data;
+
+ /*
+ * Prepare compressed data to write
+ * size of compressed block needs to be added at the beggining of the
+ * compressed data
+ */
+
+
+ if (file->compress) {
+ int cBufferSize = 0;
+ char * cData;
+ int cSize = 0;
+#ifdef USE_LZ4
+ cBufferSize = LZ4_compressBound(file->nbytes);
+#endif
+ /*
+ * A long life buffer would make sence to limit number of
+ * memory allocations
+ */
+ compression = true;
+ cData = palloc(cBufferSize + sizeof(int));
+#ifdef USE_LZ4
+ /*
+ * Using stream compression would lead to the slight improvement in
+ * compression ratio
+ */
+ cSize = LZ4_compress_default(file->buffer.data,
+ cData + sizeof(int),file->nbytes, cBufferSize);
+#endif
+
+ /* Write size of compressed block in front of compressed data
+ * It's used to determine amount of data to read within
+ * decompression process
+ */
+ memcpy(cData,&cSize,sizeof(int));
+ file->nbytes=cSize + sizeof(int);
+ DataToWrite = cData;
+ }
+
/*
* Unlike BufFileLoadBuffer, we must dump the whole buffer even if it
* crosses a component-file boundary; so we need a loop.
@@ -535,7 +672,7 @@ BufFileDumpBuffer(BufFile *file)
INSTR_TIME_SET_ZERO(io_start);
bytestowrite = FileWrite(thisfile,
- file->buffer.data + wpos,
+ DataToWrite + wpos,
bytestowrite,
file->curOffset,
WAIT_EVENT_BUFFILE_WRITE);
@@ -564,7 +701,19 @@ BufFileDumpBuffer(BufFile *file)
* logical file position, ie, original value + pos, in case that is less
* (as could happen due to a small backwards seek in a dirty buffer!)
*/
- file->curOffset -= (file->nbytes - file->pos);
+
+
+ if (!file->compress)
+ file->curOffset -= (file->nbytes - file->pos);
+ else
+ if (nbytesOriginal - file->pos != 0)
+ /* curOffset must be corrected also if compression is
+ * enabled, nbytes was changed by compression but we
+ * have to use the original value of nbytes
+ */
+ file->curOffset-=bytestowrite;
+
+
if (file->curOffset < 0) /* handle possible segment crossing */
{
file->curFile--;
@@ -577,6 +726,9 @@ BufFileDumpBuffer(BufFile *file)
*/
file->pos = 0;
file->nbytes = 0;
+
+ if (compression)
+ pfree(DataToWrite);
}
/*
@@ -602,8 +754,14 @@ BufFileReadCommon(BufFile *file, void *ptr, size_t size, bool exact, bool eofOK)
{
if (file->pos >= file->nbytes)
{
- /* Try to load more data into buffer. */
- file->curOffset += file->pos;
+ /* Try to load more data into buffer.
+ *
+ * curOffset is moved within BufFileLoadBuffer
+ * because stored data size differs from loaded/
+ * decompressed size
+ * */
+ if (!file->compress)
+ file->curOffset += file->pos;
file->pos = 0;
file->nbytes = 0;
BufFileLoadBuffer(file);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 686309db58..3821caf763 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -73,6 +73,7 @@
#include "replication/syncrep.h"
#include "storage/bufmgr.h"
#include "storage/bufpage.h"
+#include "storage/buffile.h"
#include "storage/large_object.h"
#include "storage/pg_shmem.h"
#include "storage/predicate.h"
@@ -454,6 +455,17 @@ static const struct config_enum_entry default_toast_compression_options[] = {
#endif
{NULL, 0, false}
};
+/*
+ * pglz and zstd support should be added as future enhancement
+ *
+ */
+static const struct config_enum_entry temp_file_compression_options[] = {
+ {"no", TEMP_NONE_COMPRESSION, false},
+#ifdef USE_LZ4
+ {"lz4", TEMP_LZ4_COMPRESSION, false},
+#endif
+ {NULL, 0, false}
+};
static const struct config_enum_entry wal_compression_options[] = {
{"pglz", WAL_COMPRESSION_PGLZ, false},
@@ -4856,6 +4868,17 @@ struct config_enum ConfigureNamesEnum[] =
NULL, NULL, NULL
},
+ {
+ {"temp_file_compression", PGC_USERSET, CLIENT_CONN_STATEMENT,
+ gettext_noop("Sets the default compression method for compressible values."),
+ NULL
+ },
+ &temp_file_compression,
+ TEMP_NONE_COMPRESSION,
+ temp_file_compression_options,
+ NULL, NULL, NULL
+ },
+
{
{"default_transaction_isolation", PGC_USERSET, CLIENT_CONN_STATEMENT,
gettext_noop("Sets the transaction isolation level of each new transaction."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 667e0dc40a..e9c0b36352 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -177,6 +177,7 @@
#max_notify_queue_pages = 1048576 # limits the number of SLRU pages allocated
# for NOTIFY / LISTEN queue
+#temp_file_compression = 'no' # enables temporary files compression
# - Kernel Resources -
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 44b30e86ad..af43b3ebb1 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -592,7 +592,7 @@ LogicalTapeSetCreate(bool preallocate, SharedFileSet *fileset, int worker)
lts->pfile = BufFileCreateFileSet(&fileset->fs, filename);
}
else
- lts->pfile = BufFileCreateTemp(false);
+ lts->pfile = BufFileCreateTemp(false, false);
return lts;
}
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index a720d70200..a952f0f4f5 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -860,7 +860,7 @@ tuplestore_puttuple_common(Tuplestorestate *state, void *tuple)
*/
oldcxt = MemoryContextSwitchTo(state->context->parent);
- state->myfile = BufFileCreateTemp(state->interXact);
+ state->myfile = BufFileCreateTemp(state->interXact, false);
MemoryContextSwitchTo(oldcxt);
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index 5f6d7c8e3f..486b552e31 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -32,11 +32,22 @@
typedef struct BufFile BufFile;
+typedef enum
+{
+ TEMP_NONE_COMPRESSION,
+#ifdef USE_LZ4
+ TEMP_LZ4_COMPRESSION
+#endif
+} TempCompression;
+
+extern PGDLLIMPORT int temp_file_compression;
+
+
/*
* prototypes for functions in buffile.c
*/
-extern BufFile *BufFileCreateTemp(bool interXact);
+extern BufFile *BufFileCreateTemp(bool interXact, bool compress);
extern void BufFileClose(BufFile *file);
extern pg_nodiscard size_t BufFileRead(BufFile *file, void *ptr, size_t size);
extern void BufFileReadExact(BufFile *file, void *ptr, size_t size);
diff --git a/src/test/regress/GNUmakefile b/src/test/regress/GNUmakefile
index 9003435aab..859eb79bd7 100644
--- a/src/test/regress/GNUmakefile
+++ b/src/test/regress/GNUmakefile
@@ -114,6 +114,10 @@ submake-contrib-spi: | submake-libpgport submake-generated-headers
REGRESS_OPTS = --dlpath=. --max-concurrent-tests=20 \
$(EXTRA_REGRESS_OPTS)
+ifeq ($(with_lz4),yes)
+override EXTRA_TESTS := join_hash_lz4 $(EXTRA_TESTS)
+endif
+
check: all
$(pg_regress_check) $(REGRESS_OPTS) --schedule=$(srcdir)/parallel_schedule $(MAXCONNOPT) $(EXTRA_TESTS)
diff --git a/src/test/regress/expected/join_hash_lz4.out b/src/test/regress/expected/join_hash_lz4.out
new file mode 100644
index 0000000000..966a5cd8f5
--- /dev/null
+++ b/src/test/regress/expected/join_hash_lz4.out
@@ -0,0 +1,1166 @@
+--
+-- exercises for the hash join code
+--
+begin;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'lz4';
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(6 rows)
+
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(6 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+-----------------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 4
+(1 row)
+
+rollback to settings;
+-- A couple of other hash join tests unrelated to work_mem management.
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 1
+(1 row)
+
+rollback to settings;
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is not matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Parallel Seq Scan on simple s
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple r
+(9 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ QUERY PLAN
+----------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Left Join
+ Hash Cond: (wide.id = wide_1.id)
+ -> Parallel Seq Scan on wide
+ -> Parallel Hash
+ -> Parallel Seq Scan on wide wide_1
+(9 rows)
+
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ length
+--------
+ 320000
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+ROLLBACK TO settings;
+rollback;
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: ((hjtest_1.id = (SubPlan 1)) AND ((SubPlan 2) = (SubPlan 3)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan 4) < 50)
+ SubPlan 4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ -> Hash
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan 5) < 55)
+ SubPlan 5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan 1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan 3
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan 2
+ -> Result
+ Output: (hjtest_1.b * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: (((SubPlan 1) = hjtest_1.id) AND ((SubPlan 3) = (SubPlan 2)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan 5) < 55)
+ SubPlan 5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ -> Hash
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan 4) < 50)
+ SubPlan 4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan 2
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan 1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan 3
+ -> Result
+ Output: (hjtest_2.c * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+ROLLBACK;
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ QUERY PLAN
+-----------------------------------------------------------
+ Nested Loop
+ -> Seq Scan on int8_tbl i8
+ -> Sort
+ Sort Key: t1.fivethous, i4.f1
+ -> Hash Join
+ Hash Cond: (t1.fivethous = (i4.f1 + i8.q2))
+ -> Seq Scan on tenk1 t1
+ -> Hash
+ -> Seq Scan on int4_tbl i4
+(9 rows)
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ q2 | fivethous | f1
+-----+-----------+----
+ 456 | 456 | 0
+ 456 | 456 | 0
+ 123 | 123 | 0
+ 123 | 123 | 0
+(4 rows)
+
+rollback;
diff --git a/src/test/regress/expected/jsonb_jsonpath.out b/src/test/regress/expected/jsonb_jsonpath.out
index acdf7e436f..b31b32490d 100644
--- a/src/test/regress/expected/jsonb_jsonpath.out
+++ b/src/test/regress/expected/jsonb_jsonpath.out
@@ -2687,7 +2687,7 @@ select jsonb_path_query('"12:34:56 +5:30"', '$.time_tz().string()');
select jsonb_path_query_tz('"12:34:56"', '$.time_tz().string()');
jsonb_path_query_tz
---------------------
- "12:34:56-07:00"
+ "12:34:56-08:00"
(1 row)
select jsonb_path_query('"12:34:56"', '$.time().string()');
diff --git a/src/test/regress/sql/join_hash_lz4.sql b/src/test/regress/sql/join_hash_lz4.sql
new file mode 100644
index 0000000000..1d19c1980e
--- /dev/null
+++ b/src/test/regress/sql/join_hash_lz4.sql
@@ -0,0 +1,626 @@
+--
+-- exercises for the hash join code
+--
+
+begin;
+
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'lz4';
+
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- A couple of other hash join tests unrelated to work_mem management.
+
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- A full outer join where every record is matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- A full outer join where every record is not matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+rollback to settings;
+
+
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ROLLBACK TO settings;
+
+rollback;
+
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+ROLLBACK;
+
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+rollback;
--
2.39.5 (Apple Git-154)
[application/octet-stream] 0002-This-commit-enhance-temporary-file-compression.patch (8.2K, 4-0002-This-commit-enhance-temporary-file-compression.patch)
download | inline diff:
From 0087b33bb80b67ae4f7d4901b047da5e2c913d72 Mon Sep 17 00:00:00 2001
From: Filip Janus <[email protected]>
Date: Sun, 1 Dec 2024 17:43:27 +0100
Subject: [PATCH 2/3] This commit enhance temporary file compression It
implements just one working buffer for compression and decompression to avoid
memory wasting. The buffer is allocated in the top memory context.
Also, it adds pglz support and enhances the code structure to be able
to add other compression method simply.
---
src/backend/executor/nodeHashjoin.c | 2 +-
src/backend/storage/file/buffile.c | 111 +++++++++++++++++++---------
src/backend/utils/misc/guc_tables.c | 1 +
src/include/storage/buffile.h | 4 +-
4 files changed, 80 insertions(+), 38 deletions(-)
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 1b5c6448ef..32cdb63ff8 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -1434,7 +1434,7 @@ ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
{
MemoryContext oldctx = MemoryContextSwitchTo(hashtable->spillCxt);
- file = BufFileCreateTemp(false, true);
+ file = BufFileCreateCompressTemp(false);
*fileptr = file;
MemoryContextSwitchTo(oldctx);
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 818ef39d5c..2b270211cc 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -53,7 +53,9 @@
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "utils/resowner.h"
+#include "utils/memutils.h"
+#include "common/pg_lzcompress.h"
#ifdef USE_LZ4
#include <lz4.h>
#endif
@@ -108,6 +110,7 @@ struct BufFile
int pos; /* next read/write position in buffer */
int nbytes; /* total # of valid bytes in buffer */
bool compress; /* State of usege file compression */
+ char *cBuffer;
/*
* XXX Should ideally us PGIOAlignedBlock, but might need a way to avoid
* wasting per-file alignment padding when some users create many files.
@@ -140,6 +143,7 @@ makeBufFileCommon(int nfiles)
file->pos = 0;
file->nbytes = 0;
file->compress = false;
+ file->cBuffer = NULL;
return file;
}
@@ -235,16 +239,45 @@ BufFileCreateTemp(bool interXact, bool compress)
if (temp_file_compression != TEMP_NONE_COMPRESSION)
{
-#ifdef USE_LZ4
file->compress = compress;
-#else
- NO_LZ4_SUPPORT();
-#endif
}
return file;
+
}
+/*
+ * Wrapper for BuffileCreateTemp
+ * We want to limit the number of memory allocations for the compression buffer,
+ * only one buffer for all compression operations is enough
+ */
+BufFile *
+BufFileCreateCompressTemp(bool interXact){
+ static char * buff = NULL;
+ BufFile *tmpBufFile = BufFileCreateTemp(interXact, true);
+
+ if (buff == NULL && temp_file_compression != TEMP_NONE_COMPRESSION)
+ {
+ int size = 0;
+ switch (temp_file_compression)
+ {
+ case TEMP_LZ4_COMPRESSION:
+#ifdef USE_LZ4
+ size = LZ4_compressBound(BLCKSZ)+sizeof(int);
+#endif
+ break;
+ case TEMP_PGLZ_COMPRESSION:
+ size = pglz_maximum_compressed_size(BLCKSZ, BLCKSZ)+sizeof(int);
+ break;
+ }
+ /*
+ * Persistent buffer for all temporary file compressions
+ */
+ buff = MemoryContextAlloc(TopMemoryContext, size);
+ }
+ tmpBufFile->cBuffer = buff;
+ return tmpBufFile;
+}
/*
* Build the name for a given segment of a given BufFile.
*/
@@ -516,12 +549,10 @@ BufFileLoadBuffer(BufFile *file)
/* if not EOF let's continue */
if (nread > 0)
{
- /*
- * A long life buffer would make sence to limit number of
- * memory allocations
- */
- char * buff;
+ /* A long life buffer limits number of memory allocations */
+ char * buff = file->cBuffer;
+ Assert(file->cBuffer != NULL);
/*
* Read compressed data, curOffset differs with pos
* It reads less data than it returns to caller
@@ -529,25 +560,32 @@ BufFileLoadBuffer(BufFile *file)
*/
file->curOffset+=sizeof(nbytes);
- buff = palloc(nbytes);
-
nread = FileRead(thisfile,
buff,
nbytes,
file->curOffset,
WAIT_EVENT_BUFFILE_READ);
+ switch (temp_file_compression)
+ {
+ case TEMP_LZ4_COMPRESSION:
#ifdef USE_LZ4
- file->nbytes = LZ4_decompress_safe(buff,
- file->buffer.data,nbytes,sizeof(file->buffer));
- file->curOffset += nread;
+ file->nbytes = LZ4_decompress_safe(buff,
+ file->buffer.data,nbytes,sizeof(file->buffer));
#endif
+ break;
+
+ case TEMP_PGLZ_COMPRESSION:
+ file->nbytes = pglz_decompress(buff,nbytes,
+ file->buffer.data,sizeof(file->buffer),false);
+ break;
+ }
+ file->curOffset += nread;
if (file->nbytes < 0)
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg_internal("compressed lz4 data is corrupt")));
- pfree(buff);
}
}
@@ -591,8 +629,6 @@ BufFileDumpBuffer(BufFile *file)
/* Save nbytes value because the size changes due to compression */
int nbytesOriginal = file->nbytes;
- bool compression = false;
-
char * DataToWrite;
DataToWrite = file->buffer.data;
@@ -604,26 +640,33 @@ BufFileDumpBuffer(BufFile *file)
if (file->compress) {
- int cBufferSize = 0;
char * cData;
int cSize = 0;
+
+ Assert(file->cBuffer != NULL);
+ cData = file->cBuffer;
+
+ switch (temp_file_compression)
+ {
+ case TEMP_LZ4_COMPRESSION:
+ {
#ifdef USE_LZ4
- cBufferSize = LZ4_compressBound(file->nbytes);
-#endif
- /*
- * A long life buffer would make sence to limit number of
- * memory allocations
- */
- compression = true;
- cData = palloc(cBufferSize + sizeof(int));
-#ifdef USE_LZ4
- /*
- * Using stream compression would lead to the slight improvement in
- * compression ratio
- */
- cSize = LZ4_compress_default(file->buffer.data,
- cData + sizeof(int),file->nbytes, cBufferSize);
+ int cBufferSize = LZ4_compressBound(file->nbytes);
+ /*
+ * Using stream compression would lead to the slight improvement in
+ * compression ratio
+ */
+ cSize = LZ4_compress_default(file->buffer.data,
+ cData + sizeof(int),file->nbytes, cBufferSize);
#endif
+ break;
+ }
+ case TEMP_PGLZ_COMPRESSION:
+ cSize = pglz_compress(file->buffer.data,file->nbytes,
+ cData + sizeof(int),PGLZ_strategy_always);
+ break;
+ }
+
/* Write size of compressed block in front of compressed data
* It's used to determine amount of data to read within
@@ -727,8 +770,6 @@ BufFileDumpBuffer(BufFile *file)
file->pos = 0;
file->nbytes = 0;
- if (compression)
- pfree(DataToWrite);
}
/*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 3821caf763..e4a98d1198 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -461,6 +461,7 @@ static const struct config_enum_entry default_toast_compression_options[] = {
*/
static const struct config_enum_entry temp_file_compression_options[] = {
{"no", TEMP_NONE_COMPRESSION, false},
+ {"pglz", TEMP_PGLZ_COMPRESSION, false},
#ifdef USE_LZ4
{"lz4", TEMP_LZ4_COMPRESSION, false},
#endif
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index 486b552e31..b8ce164e4b 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -35,9 +35,8 @@ typedef struct BufFile BufFile;
typedef enum
{
TEMP_NONE_COMPRESSION,
-#ifdef USE_LZ4
+ TEMP_PGLZ_COMPRESSION,
TEMP_LZ4_COMPRESSION
-#endif
} TempCompression;
extern PGDLLIMPORT int temp_file_compression;
@@ -47,6 +46,7 @@ extern PGDLLIMPORT int temp_file_compression;
* prototypes for functions in buffile.c
*/
+extern BufFile *BufFileCreateCompressTemp(bool interXact);
extern BufFile *BufFileCreateTemp(bool interXact, bool compress);
extern void BufFileClose(BufFile *file);
extern pg_nodiscard size_t BufFileRead(BufFile *file, void *ptr, size_t size);
--
2.39.5 (Apple Git-154)
[application/octet-stream] 0003-Add-test-for-pglz-compression-of-temporary-files.patch (63.3K, 5-0003-Add-test-for-pglz-compression-of-temporary-files.patch)
download | inline diff:
From 765e5eebb666972f5f3f67b7542a1ce34bbcd5cc Mon Sep 17 00:00:00 2001
From: Filip Janus <[email protected]>
Date: Thu, 26 Dec 2024 18:39:07 +0100
Subject: [PATCH 3/3] Add test for pglz compression of temporary files
---
src/test/regress/expected/join_hash_pglz.out | 1166 ++++++++++++++++++
src/test/regress/parallel_schedule | 4 +-
src/test/regress/sql/join_hash_pglz.sql | 626 ++++++++++
3 files changed, 1795 insertions(+), 1 deletion(-)
create mode 100644 src/test/regress/expected/join_hash_pglz.out
create mode 100644 src/test/regress/sql/join_hash_pglz.sql
diff --git a/src/test/regress/expected/join_hash_pglz.out b/src/test/regress/expected/join_hash_pglz.out
new file mode 100644
index 0000000000..99c67f982a
--- /dev/null
+++ b/src/test/regress/expected/join_hash_pglz.out
@@ -0,0 +1,1166 @@
+--
+-- exercises for the hash join code
+--
+begin;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'pglz';
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(6 rows)
+
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(6 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+-----------------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 4
+(1 row)
+
+rollback to settings;
+-- A couple of other hash join tests unrelated to work_mem management.
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 1
+(1 row)
+
+rollback to settings;
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is not matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Parallel Seq Scan on simple s
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple r
+(9 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ QUERY PLAN
+----------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Left Join
+ Hash Cond: (wide.id = wide_1.id)
+ -> Parallel Seq Scan on wide
+ -> Parallel Hash
+ -> Parallel Seq Scan on wide wide_1
+(9 rows)
+
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ length
+--------
+ 320000
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+ROLLBACK TO settings;
+rollback;
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: ((hjtest_1.id = (SubPlan 1)) AND ((SubPlan 2) = (SubPlan 3)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan 4) < 50)
+ SubPlan 4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ -> Hash
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan 5) < 55)
+ SubPlan 5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan 1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan 3
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan 2
+ -> Result
+ Output: (hjtest_1.b * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: (((SubPlan 1) = hjtest_1.id) AND ((SubPlan 3) = (SubPlan 2)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan 5) < 55)
+ SubPlan 5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ -> Hash
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan 4) < 50)
+ SubPlan 4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan 2
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan 1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan 3
+ -> Result
+ Output: (hjtest_2.c * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+ROLLBACK;
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ QUERY PLAN
+-----------------------------------------------------------
+ Nested Loop
+ -> Seq Scan on int8_tbl i8
+ -> Sort
+ Sort Key: t1.fivethous, i4.f1
+ -> Hash Join
+ Hash Cond: (t1.fivethous = (i4.f1 + i8.q2))
+ -> Seq Scan on tenk1 t1
+ -> Hash
+ -> Seq Scan on int4_tbl i4
+(9 rows)
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ q2 | fivethous | f1
+-----+-----------+----
+ 456 | 456 | 0
+ 456 | 456 | 0
+ 123 | 123 | 0
+ 123 | 123 | 0
+(4 rows)
+
+rollback;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 4f38104ba0..6e1ed70e87 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -15,7 +15,6 @@ test: test_setup
# The first group of parallel tests
# ----------
test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn regproc
-
# ----------
# The second group of parallel tests
# multirangetypes depends on rangetypes
@@ -136,3 +135,6 @@ test: fast_default
# run tablespace test at the end because it drops the tablespace created during
# setup that other tests may use.
test: tablespace
+
+# this test is equivalent to join_hash test just the compression is enabled
+test: join_hash_pglz
diff --git a/src/test/regress/sql/join_hash_pglz.sql b/src/test/regress/sql/join_hash_pglz.sql
new file mode 100644
index 0000000000..2686afab27
--- /dev/null
+++ b/src/test/regress/sql/join_hash_pglz.sql
@@ -0,0 +1,626 @@
+--
+-- exercises for the hash join code
+--
+
+begin;
+
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'pglz';
+
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- A couple of other hash join tests unrelated to work_mem management.
+
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- A full outer join where every record is matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- A full outer join where every record is not matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+rollback to settings;
+
+
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ROLLBACK TO settings;
+
+rollback;
+
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+ROLLBACK;
+
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+rollback;
--
2.39.5 (Apple Git-154)
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Proposal: Adding compression of temporary files
2025-01-04 23:43 Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
@ 2025-04-22 07:17 ` Filip Janus <[email protected]>
2025-04-25 21:54 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Filip Janus @ 2025-04-22 07:17 UTC (permalink / raw)
To: Dmitry Dolgov <[email protected]>; +Cc: Tomas Vondra <[email protected]>; Alexander Korotkov <[email protected]>; pgsql-hackers
Since the patch was prepared months ago, it needs to be rebased.
-Filip-
ne 13. 4. 2025 v 21:53 odesílatel Dmitry Dolgov <[email protected]>
napsal:
> > On Fri, Mar 28, 2025 at 09:23:13AM GMT, Filip Janus wrote:
> > > + else
> > > + if (nbytesOriginal - file->pos != 0)
> > > + /* curOffset must be corrected also if compression is
> > > + * enabled, nbytes was changed by compression but we
> > > + * have to use the original value of nbytes
> > > + */
> > > + file->curOffset-=bytestowrite;
> > >
> > > It's not something introduced by the compression patch - the first part
> > > is what we used to do before. But I find it a bit confusing - isn't it
> > > mixing the correction of "logical file position" adjustment we did
> > > before, and also the adjustment possibly needed due to compression?
> > >
> > > In fact, isn't it going to fail if the code gets multiple loops in
> > >
> > > while (wpos < file->nbytes)
> > > {
> > > ...
> > > }
> > >
> > > because bytestowrite will be the value from the last loop? I haven't
> > > tried, but I guess writing wide tuples (more than 8k) might fail.
> > >
> >
> > I will definitely test it with larger tuples than 8K.
> >
> > Maybe I don't understand it correctly,
> > the adjustment is performed in the case that file->nbytes and file->pos
> > differ.
> > So it must persist also if we are working with the compressed data, but
> the
> > problem is that data stored and compressed on disk has different sizes
> than
> > data incoming uncompressed ones, so what should be the correction value.
> > By debugging, I realized that the correction should correspond to the
> size
> > of
> > bytestowrite from the last iteration of the loop.
>
> I agree, this looks strange. If the idea is to set curOffset to its
> original value + pos, and the original value was advanced multiple times
> by bytestowrite, it seems incorrect to adjust it by bytestowrite, it
> seems incorrect to adjust it only once. From what I see current tests do
> not exercise a case where the while will get multiple loops, so it looks
> fine.
>
> At the same time maybe I'm missing something, but how exactly such test
> for 8k tuples and multiple loops in the while block should look like?
> E.g. when I force a hash join on a table with a single wide text column,
> the minimal tuple that is getting written to the temporary file still
> has rather small length, I assume due to toasting. Is there some other
> way to achieve that?
>
>
Attachments:
[application/octet-stream] 0002-Add-test-for-temporary-files-compression-this-commit.patch (48.5K, 3-0002-Add-test-for-temporary-files-compression-this-commit.patch)
download | inline diff:
From 4196f056c6d5e8eeac23a064475d52414b45ca23 Mon Sep 17 00:00:00 2001
From: Filip Janus <[email protected]>
Date: Wed, 16 Apr 2025 12:08:28 +0200
Subject: [PATCH 2/2] Add test for temporary files compression, this commit
adds tests for lz4 and pglz.
---
src/test/regress/GNUmakefile | 4 +
src/test/regress/parallel_schedule | 4 +-
src/test/regress/sql/join_hash_lz4.sql | 626 ++++++++++++++++++++++++
src/test/regress/sql/join_hash_pglz.sql | 626 ++++++++++++++++++++++++
4 files changed, 1259 insertions(+), 1 deletion(-)
create mode 100644 src/test/regress/sql/join_hash_lz4.sql
create mode 100644 src/test/regress/sql/join_hash_pglz.sql
diff --git a/src/test/regress/GNUmakefile b/src/test/regress/GNUmakefile
index ef2bddf42ca..00757a44ca6 100644
--- a/src/test/regress/GNUmakefile
+++ b/src/test/regress/GNUmakefile
@@ -94,6 +94,10 @@ installdirs-tests: installdirs
REGRESS_OPTS = --dlpath=. --max-concurrent-tests=20 \
$(EXTRA_REGRESS_OPTS)
+ifeq ($(with_lz4),yes)
+override EXTRA_TESTS := join_hash_lz4 $(EXTRA_TESTS)
+endif
+
check: all
$(pg_regress_check) $(REGRESS_OPTS) --schedule=$(srcdir)/parallel_schedule $(MAXCONNOPT) $(EXTRA_TESTS)
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0f38caa0d24..7701e57fad3 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -15,7 +15,6 @@ test: test_setup
# The first group of parallel tests
# ----------
test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn regproc
-
# ----------
# The second group of parallel tests
# multirangetypes depends on rangetypes
@@ -136,3 +135,6 @@ test: fast_default
# run tablespace test at the end because it drops the tablespace created during
# setup that other tests may use.
test: tablespace
+
+# this test is equivalent to join_hash test just the compression is enabled
+test: join_hash_pglz
diff --git a/src/test/regress/sql/join_hash_lz4.sql b/src/test/regress/sql/join_hash_lz4.sql
new file mode 100644
index 00000000000..1d19c1980e1
--- /dev/null
+++ b/src/test/regress/sql/join_hash_lz4.sql
@@ -0,0 +1,626 @@
+--
+-- exercises for the hash join code
+--
+
+begin;
+
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'lz4';
+
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- A couple of other hash join tests unrelated to work_mem management.
+
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- A full outer join where every record is matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- A full outer join where every record is not matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+rollback to settings;
+
+
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ROLLBACK TO settings;
+
+rollback;
+
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+ROLLBACK;
+
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+rollback;
diff --git a/src/test/regress/sql/join_hash_pglz.sql b/src/test/regress/sql/join_hash_pglz.sql
new file mode 100644
index 00000000000..2686afab272
--- /dev/null
+++ b/src/test/regress/sql/join_hash_pglz.sql
@@ -0,0 +1,626 @@
+--
+-- exercises for the hash join code
+--
+
+begin;
+
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'pglz';
+
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- A couple of other hash join tests unrelated to work_mem management.
+
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- A full outer join where every record is matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- A full outer join where every record is not matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+rollback to settings;
+
+
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ROLLBACK TO settings;
+
+rollback;
+
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+ROLLBACK;
+
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+rollback;
--
2.39.5 (Apple Git-154)
[application/octet-stream] 0001-This-commit-adds-support-for-temporary-files-compres.patch (15.7K, 4-0001-This-commit-adds-support-for-temporary-files-compres.patch)
download | inline diff:
From a6a465cf060e9e0b25ee3389bb57657be260f66b Mon Sep 17 00:00:00 2001
From: Filip Janus <[email protected]>
Date: Wed, 16 Apr 2025 12:03:03 +0200
Subject: [PATCH 1/2] This commit adds support for temporary files compression,
it can be used only for hashjoins now.
It also adds GUC parameter temp_file_compression that enables this functionality.
For now, it supports just lz4 and pglz algorithms. In the future, it
could also be implemented zstd support.
It implements just one working buffer for compression and decompression to avoid
memory wasting. The buffer is allocated in the top memory context.
---
src/Makefile.global.in | 1 +
src/backend/access/gist/gistbuildbuffers.c | 2 +-
src/backend/backup/backup_manifest.c | 2 +-
src/backend/executor/nodeHashjoin.c | 2 +-
src/backend/storage/file/buffile.c | 219 +++++++++++++++++-
src/backend/utils/misc/guc_tables.c | 24 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/backend/utils/sort/logtape.c | 2 +-
src/backend/utils/sort/tuplestore.c | 2 +-
src/include/storage/buffile.h | 13 +-
10 files changed, 252 insertions(+), 16 deletions(-)
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 6722fbdf365..6ff67bda17c 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -201,6 +201,7 @@ with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
+with_lz4 = @with_lz4@
with_system_tzdata = @with_system_tzdata@
with_uuid = @with_uuid@
with_zlib = @with_zlib@
diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
index 0707254d18e..9cc371f47fe 100644
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -54,7 +54,7 @@ gistInitBuildBuffers(int pagesPerBuffer, int levelStep, int maxLevel)
* Create a temporary file to hold buffer pages that are swapped out of
* memory.
*/
- gfbb->pfile = BufFileCreateTemp(false);
+ gfbb->pfile = BufFileCreateTemp(false, false);
gfbb->nFileBlocks = 0;
/* Initialize free page management. */
diff --git a/src/backend/backup/backup_manifest.c b/src/backend/backup/backup_manifest.c
index 22e2be37c95..c9f7daa1497 100644
--- a/src/backend/backup/backup_manifest.c
+++ b/src/backend/backup/backup_manifest.c
@@ -65,7 +65,7 @@ InitializeBackupManifest(backup_manifest_info *manifest,
manifest->buffile = NULL;
else
{
- manifest->buffile = BufFileCreateTemp(false);
+ manifest->buffile = BufFileCreateTemp(false, false);
manifest->manifest_ctx = pg_cryptohash_create(PG_SHA256);
if (pg_cryptohash_init(manifest->manifest_ctx) < 0)
elog(ERROR, "failed to initialize checksum of backup manifest: %s",
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 5661ad76830..384265ca74a 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -1434,7 +1434,7 @@ ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
{
MemoryContext oldctx = MemoryContextSwitchTo(hashtable->spillCxt);
- file = BufFileCreateTemp(false);
+ file = BufFileCreateCompressTemp(false);
*fileptr = file;
MemoryContextSwitchTo(oldctx);
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 366d70d38a1..10da6308004 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -53,6 +53,18 @@
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "utils/resowner.h"
+#include "utils/memutils.h"
+
+#include "common/pg_lzcompress.h"
+#ifdef USE_LZ4
+#include <lz4.h>
+#endif
+
+#define NO_LZ4_SUPPORT() \
+ ereport(ERROR, \
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), \
+ errmsg("compression method lz4 not supported"), \
+ errdetail("This functionality requires the server to be built with lz4 support.")))
/*
* We break BufFiles into gigabyte-sized segments, regardless of RELSEG_SIZE.
@@ -62,6 +74,8 @@
#define MAX_PHYSICAL_FILESIZE 0x40000000
#define BUFFILE_SEG_SIZE (MAX_PHYSICAL_FILESIZE / BLCKSZ)
+int temp_file_compression = TEMP_NONE_COMPRESSION;
+
/*
* This data structure represents a buffered file that consists of one or
* more physical files (each accessed through a virtual file descriptor
@@ -95,7 +109,8 @@ struct BufFile
off_t curOffset; /* offset part of current pos */
int pos; /* next read/write position in buffer */
int nbytes; /* total # of valid bytes in buffer */
-
+ bool compress; /* State of usege file compression */
+ char *cBuffer;
/*
* XXX Should ideally use PGIOAlignedBlock, but might need a way to avoid
* wasting per-file alignment padding when some users create many files.
@@ -127,6 +142,8 @@ makeBufFileCommon(int nfiles)
file->curOffset = 0;
file->pos = 0;
file->nbytes = 0;
+ file->compress = false;
+ file->cBuffer = NULL;
return file;
}
@@ -188,9 +205,17 @@ extendBufFile(BufFile *file)
* Note: if interXact is true, the caller had better be calling us in a
* memory context, and with a resource owner, that will survive across
* transaction boundaries.
+ *
+ * If compress is true the temporary files will be compressed before
+ * writing on disk.
+ *
+ * Note: The compression does not support random access. Only the hash joins
+ * use it for now. The seek operation other than seek to the beginning of the
+ * buffile will corrupt temporary data offsets.
+ *
*/
BufFile *
-BufFileCreateTemp(bool interXact)
+BufFileCreateTemp(bool interXact, bool compress)
{
BufFile *file;
File pfile;
@@ -212,9 +237,47 @@ BufFileCreateTemp(bool interXact)
file = makeBufFile(pfile);
file->isInterXact = interXact;
+ if (temp_file_compression != TEMP_NONE_COMPRESSION)
+ {
+ file->compress = compress;
+ }
+
return file;
-}
+}
+/*
+ * Wrapper for BuffileCreateTemp
+ * We want to limit the number of memory allocations for the compression buffer,
+ * only one buffer for all compression operations is enough
+ */
+BufFile *
+BufFileCreateCompressTemp(bool interXact){
+ static char * buff = NULL;
+ BufFile *tmpBufFile = BufFileCreateTemp(interXact, true);
+
+ if (buff == NULL && temp_file_compression != TEMP_NONE_COMPRESSION)
+ {
+ int size = 0;
+
+ switch (temp_file_compression)
+ {
+ case TEMP_LZ4_COMPRESSION:
+#ifdef USE_LZ4
+ size = LZ4_compressBound(BLCKSZ)+sizeof(int);
+#endif
+ break;
+ case TEMP_PGLZ_COMPRESSION:
+ size = pglz_maximum_compressed_size(BLCKSZ, BLCKSZ)+sizeof(int);
+ break;
+ }
+ /*
+ * Persistent buffer for all temporary file compressions
+ */
+ buff = MemoryContextAlloc(TopMemoryContext, size);
+ }
+ tmpBufFile->cBuffer = buff;
+ return tmpBufFile;
+}
/*
* Build the name for a given segment of a given BufFile.
*/
@@ -275,6 +338,7 @@ BufFileCreateFileSet(FileSet *fileset, const char *name)
file->files[0] = MakeNewFileSetSegment(file, 0);
file->readOnly = false;
+
return file;
}
@@ -457,11 +521,75 @@ BufFileLoadBuffer(BufFile *file)
/*
* Read whatever we can get, up to a full bufferload.
*/
- file->nbytes = FileRead(thisfile,
+ if (!file->compress)
+ {
+
+ /*
+ * Read whatever we can get, up to a full bufferload.
+ */
+ file->nbytes = FileRead(thisfile,
file->buffer.data,
- sizeof(file->buffer.data),
+ sizeof(file->buffer),
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+ /*
+ * Read and decompress data from the temporary file
+ * The first reading loads size of the compressed block
+ * Second reading loads compressed data
+ */
+ } else {
+ int nread;
+ int nbytes;
+
+ nread = FileRead(thisfile,
+ &nbytes,
+ sizeof(nbytes),
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+ /* if not EOF let's continue */
+ if (nread > 0)
+ {
+ /* A long life buffer limits number of memory allocations */
+ char * buff = file->cBuffer;
+
+ Assert(file->cBuffer != NULL);
+ /*
+ * Read compressed data, curOffset differs with pos
+ * It reads less data than it returns to caller
+ * So the curOffset must be advanced here based on compressed size
+ */
+ file->curOffset+=sizeof(nbytes);
+
+ nread = FileRead(thisfile,
+ buff,
+ nbytes,
file->curOffset,
WAIT_EVENT_BUFFILE_READ);
+
+ switch (temp_file_compression)
+ {
+ case TEMP_LZ4_COMPRESSION:
+#ifdef USE_LZ4
+ file->nbytes = LZ4_decompress_safe(buff,
+ file->buffer.data,nbytes,sizeof(file->buffer));
+#endif
+ break;
+
+ case TEMP_PGLZ_COMPRESSION:
+ file->nbytes = pglz_decompress(buff,nbytes,
+ file->buffer.data,sizeof(file->buffer),false);
+ break;
+ }
+ file->curOffset += nread;
+
+ if (file->nbytes < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("compressed lz4 data is corrupt")));
+ }
+
+ }
+
if (file->nbytes < 0)
{
file->nbytes = 0;
@@ -494,9 +622,61 @@ static void
BufFileDumpBuffer(BufFile *file)
{
int wpos = 0;
- int bytestowrite;
+ int bytestowrite = 0;
File thisfile;
+
+ /* Save nbytes value because the size changes due to compression */
+ int nbytesOriginal = file->nbytes;
+
+ char * DataToWrite;
+ DataToWrite = file->buffer.data;
+
+ /*
+ * Prepare compressed data to write
+ * size of compressed block needs to be added at the beggining of the
+ * compressed data
+ */
+
+
+ if (file->compress) {
+ char * cData;
+ int cSize = 0;
+
+ Assert(file->cBuffer != NULL);
+ cData = file->cBuffer;
+
+ switch (temp_file_compression)
+ {
+ case TEMP_LZ4_COMPRESSION:
+ {
+#ifdef USE_LZ4
+ int cBufferSize = LZ4_compressBound(file->nbytes);
+ /*
+ * Using stream compression would lead to the slight improvement in
+ * compression ratio
+ */
+ cSize = LZ4_compress_default(file->buffer.data,
+ cData + sizeof(int),file->nbytes, cBufferSize);
+#endif
+ break;
+ }
+ case TEMP_PGLZ_COMPRESSION:
+ cSize = pglz_compress(file->buffer.data,file->nbytes,
+ cData + sizeof(int),PGLZ_strategy_always);
+ break;
+ }
+
+
+ /* Write size of compressed block in front of compressed data
+ * It's used to determine amount of data to read within
+ * decompression process
+ */
+ memcpy(cData,&cSize,sizeof(int));
+ file->nbytes=cSize + sizeof(int);
+ DataToWrite = cData;
+ }
+
/*
* Unlike BufFileLoadBuffer, we must dump the whole buffer even if it
* crosses a component-file boundary; so we need a loop.
@@ -535,7 +715,7 @@ BufFileDumpBuffer(BufFile *file)
INSTR_TIME_SET_ZERO(io_start);
bytestowrite = FileWrite(thisfile,
- file->buffer.data + wpos,
+ DataToWrite + wpos,
bytestowrite,
file->curOffset,
WAIT_EVENT_BUFFILE_WRITE);
@@ -564,7 +744,19 @@ BufFileDumpBuffer(BufFile *file)
* logical file position, ie, original value + pos, in case that is less
* (as could happen due to a small backwards seek in a dirty buffer!)
*/
- file->curOffset -= (file->nbytes - file->pos);
+
+
+ if (!file->compress)
+ file->curOffset -= (file->nbytes - file->pos);
+ else
+ if (nbytesOriginal - file->pos != 0)
+ /* curOffset must be corrected also if compression is
+ * enabled, nbytes was changed by compression but we
+ * have to use the original value of nbytes
+ */
+ file->curOffset-=bytestowrite;
+
+
if (file->curOffset < 0) /* handle possible segment crossing */
{
file->curFile--;
@@ -577,6 +769,7 @@ BufFileDumpBuffer(BufFile *file)
*/
file->pos = 0;
file->nbytes = 0;
+
}
/*
@@ -602,8 +795,14 @@ BufFileReadCommon(BufFile *file, void *ptr, size_t size, bool exact, bool eofOK)
{
if (file->pos >= file->nbytes)
{
- /* Try to load more data into buffer. */
- file->curOffset += file->pos;
+ /* Try to load more data into buffer.
+ *
+ * curOffset is moved within BufFileLoadBuffer
+ * because stored data size differs from loaded/
+ * decompressed size
+ * */
+ if (!file->compress)
+ file->curOffset += file->pos;
file->pos = 0;
file->nbytes = 0;
BufFileLoadBuffer(file);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 60b12446a1c..ae052640ac0 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -78,6 +78,7 @@
#include "replication/syncrep.h"
#include "storage/aio.h"
#include "storage/bufmgr.h"
+#include "storage/buffile.h"
#include "storage/bufpage.h"
#include "storage/copydir.h"
#include "storage/io_worker.h"
@@ -463,6 +464,18 @@ static const struct config_enum_entry default_toast_compression_options[] = {
#endif
{NULL, 0, false}
};
+/*
+ * pglz and zstd support should be added as future enhancement
+ *
+ */
+static const struct config_enum_entry temp_file_compression_options[] = {
+ {"no", TEMP_NONE_COMPRESSION, false},
+ {"pglz", TEMP_PGLZ_COMPRESSION, false},
+#ifdef USE_LZ4
+ {"lz4", TEMP_LZ4_COMPRESSION, false},
+#endif
+ {NULL, 0, false}
+};
static const struct config_enum_entry wal_compression_options[] = {
{"pglz", WAL_COMPRESSION_PGLZ, false},
@@ -5058,6 +5071,17 @@ struct config_enum ConfigureNamesEnum[] =
NULL, NULL, NULL
},
+ {
+ {"temp_file_compression", PGC_USERSET, CLIENT_CONN_STATEMENT,
+ gettext_noop("Sets the default compression method for compressible values."),
+ NULL
+ },
+ &temp_file_compression,
+ TEMP_NONE_COMPRESSION,
+ temp_file_compression_options,
+ NULL, NULL, NULL
+ },
+
{
{"default_transaction_isolation", PGC_USERSET, CLIENT_CONN_STATEMENT,
gettext_noop("Sets the transaction isolation level of each new transaction."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 34826d01380..77961a45d65 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -182,6 +182,7 @@
#max_notify_queue_pages = 1048576 # limits the number of SLRU pages allocated
# for NOTIFY / LISTEN queue
+#temp_file_compression = 'no' # enables temporary files compression
#file_copy_method = copy # the default is the first option
# copy
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index e529ceb8260..d862e22ef18 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -592,7 +592,7 @@ LogicalTapeSetCreate(bool preallocate, SharedFileSet *fileset, int worker)
lts->pfile = BufFileCreateFileSet(&fileset->fs, filename);
}
else
- lts->pfile = BufFileCreateTemp(false);
+ lts->pfile = BufFileCreateTemp(false, false);
return lts;
}
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..ef85924cd21 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -860,7 +860,7 @@ tuplestore_puttuple_common(Tuplestorestate *state, void *tuple)
*/
oldcxt = MemoryContextSwitchTo(state->context->parent);
- state->myfile = BufFileCreateTemp(state->interXact);
+ state->myfile = BufFileCreateTemp(state->interXact, false);
MemoryContextSwitchTo(oldcxt);
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index a2f4821f240..931a211038b 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -32,11 +32,22 @@
typedef struct BufFile BufFile;
+typedef enum
+{
+ TEMP_NONE_COMPRESSION,
+ TEMP_PGLZ_COMPRESSION,
+ TEMP_LZ4_COMPRESSION
+} TempCompression;
+
+extern PGDLLIMPORT int temp_file_compression;
+
+
/*
* prototypes for functions in buffile.c
*/
-extern BufFile *BufFileCreateTemp(bool interXact);
+extern BufFile *BufFileCreateCompressTemp(bool interXact);
+extern BufFile *BufFileCreateTemp(bool interXact, bool compress);
extern void BufFileClose(BufFile *file);
pg_nodiscard extern size_t BufFileRead(BufFile *file, void *ptr, size_t size);
extern void BufFileReadExact(BufFile *file, void *ptr, size_t size);
--
2.39.5 (Apple Git-154)
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Proposal: Adding compression of temporary files
2025-01-04 23:43 Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-22 07:17 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
@ 2025-04-25 21:54 ` Filip Janus <[email protected]>
2026-01-13 12:53 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Filip Janus @ 2025-04-25 21:54 UTC (permalink / raw)
To: Dmitry Dolgov <[email protected]>; +Cc: Tomas Vondra <[email protected]>; Alexander Korotkov <[email protected]>; pgsql-hackers
The latest rebase.
-Filip-
út 22. 4. 2025 v 9:17 odesílatel Filip Janus <[email protected]> napsal:
> Since the patch was prepared months ago, it needs to be rebased.
>
> -Filip-
>
>
> ne 13. 4. 2025 v 21:53 odesílatel Dmitry Dolgov <[email protected]>
> napsal:
>
>> > On Fri, Mar 28, 2025 at 09:23:13AM GMT, Filip Janus wrote:
>> > > + else
>> > > + if (nbytesOriginal - file->pos != 0)
>> > > + /* curOffset must be corrected also if compression is
>> > > + * enabled, nbytes was changed by compression but we
>> > > + * have to use the original value of nbytes
>> > > + */
>> > > + file->curOffset-=bytestowrite;
>> > >
>> > > It's not something introduced by the compression patch - the first
>> part
>> > > is what we used to do before. But I find it a bit confusing - isn't it
>> > > mixing the correction of "logical file position" adjustment we did
>> > > before, and also the adjustment possibly needed due to compression?
>> > >
>> > > In fact, isn't it going to fail if the code gets multiple loops in
>> > >
>> > > while (wpos < file->nbytes)
>> > > {
>> > > ...
>> > > }
>> > >
>> > > because bytestowrite will be the value from the last loop? I haven't
>> > > tried, but I guess writing wide tuples (more than 8k) might fail.
>> > >
>> >
>> > I will definitely test it with larger tuples than 8K.
>> >
>> > Maybe I don't understand it correctly,
>> > the adjustment is performed in the case that file->nbytes and file->pos
>> > differ.
>> > So it must persist also if we are working with the compressed data, but
>> the
>> > problem is that data stored and compressed on disk has different sizes
>> than
>> > data incoming uncompressed ones, so what should be the correction value.
>> > By debugging, I realized that the correction should correspond to the
>> size
>> > of
>> > bytestowrite from the last iteration of the loop.
>>
>> I agree, this looks strange. If the idea is to set curOffset to its
>> original value + pos, and the original value was advanced multiple times
>> by bytestowrite, it seems incorrect to adjust it by bytestowrite, it
>> seems incorrect to adjust it only once. From what I see current tests do
>> not exercise a case where the while will get multiple loops, so it looks
>> fine.
>>
>> At the same time maybe I'm missing something, but how exactly such test
>> for 8k tuples and multiple loops in the while block should look like?
>> E.g. when I force a hash join on a table with a single wide text column,
>> the minimal tuple that is getting written to the temporary file still
>> has rather small length, I assume due to toasting. Is there some other
>> way to achieve that?
>>
>>
Attachments:
[application/octet-stream] 0002-Add-test-for-temporary-files-compression-this-commit.patch (126.1K, 3-0002-Add-test-for-temporary-files-compression-this-commit.patch)
download | inline diff:
From f5b83504ad924c6638ce4fbf2c26d7a058d76c45 Mon Sep 17 00:00:00 2001
From: Filip Janus <[email protected]>
Date: Wed, 16 Apr 2025 12:08:28 +0200
Subject: [PATCH 2/2] Add test for temporary files compression, this commit
adds tests for lz4 and pglz.
---
src/test/regress/GNUmakefile | 4 +
src/test/regress/expected/join_hash_lz4.out | 1166 ++++++++++++++++++
src/test/regress/expected/join_hash_pglz.out | 1166 ++++++++++++++++++
src/test/regress/parallel_schedule | 4 +-
src/test/regress/sql/join_hash_lz4.sql | 626 ++++++++++
src/test/regress/sql/join_hash_pglz.sql | 626 ++++++++++
6 files changed, 3591 insertions(+), 1 deletion(-)
create mode 100644 src/test/regress/expected/join_hash_lz4.out
create mode 100644 src/test/regress/expected/join_hash_pglz.out
create mode 100644 src/test/regress/sql/join_hash_lz4.sql
create mode 100644 src/test/regress/sql/join_hash_pglz.sql
diff --git a/src/test/regress/GNUmakefile b/src/test/regress/GNUmakefile
index ef2bddf42ca..00757a44ca6 100644
--- a/src/test/regress/GNUmakefile
+++ b/src/test/regress/GNUmakefile
@@ -94,6 +94,10 @@ installdirs-tests: installdirs
REGRESS_OPTS = --dlpath=. --max-concurrent-tests=20 \
$(EXTRA_REGRESS_OPTS)
+ifeq ($(with_lz4),yes)
+override EXTRA_TESTS := join_hash_lz4 $(EXTRA_TESTS)
+endif
+
check: all
$(pg_regress_check) $(REGRESS_OPTS) --schedule=$(srcdir)/parallel_schedule $(MAXCONNOPT) $(EXTRA_TESTS)
diff --git a/src/test/regress/expected/join_hash_lz4.out b/src/test/regress/expected/join_hash_lz4.out
new file mode 100644
index 00000000000..966a5cd8f55
--- /dev/null
+++ b/src/test/regress/expected/join_hash_lz4.out
@@ -0,0 +1,1166 @@
+--
+-- exercises for the hash join code
+--
+begin;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'lz4';
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(6 rows)
+
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(6 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+-----------------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 4
+(1 row)
+
+rollback to settings;
+-- A couple of other hash join tests unrelated to work_mem management.
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 1
+(1 row)
+
+rollback to settings;
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is not matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Parallel Seq Scan on simple s
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple r
+(9 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ QUERY PLAN
+----------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Left Join
+ Hash Cond: (wide.id = wide_1.id)
+ -> Parallel Seq Scan on wide
+ -> Parallel Hash
+ -> Parallel Seq Scan on wide wide_1
+(9 rows)
+
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ length
+--------
+ 320000
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+ROLLBACK TO settings;
+rollback;
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: ((hjtest_1.id = (SubPlan 1)) AND ((SubPlan 2) = (SubPlan 3)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan 4) < 50)
+ SubPlan 4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ -> Hash
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan 5) < 55)
+ SubPlan 5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan 1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan 3
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan 2
+ -> Result
+ Output: (hjtest_1.b * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: (((SubPlan 1) = hjtest_1.id) AND ((SubPlan 3) = (SubPlan 2)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan 5) < 55)
+ SubPlan 5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ -> Hash
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan 4) < 50)
+ SubPlan 4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan 2
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan 1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan 3
+ -> Result
+ Output: (hjtest_2.c * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+ROLLBACK;
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ QUERY PLAN
+-----------------------------------------------------------
+ Nested Loop
+ -> Seq Scan on int8_tbl i8
+ -> Sort
+ Sort Key: t1.fivethous, i4.f1
+ -> Hash Join
+ Hash Cond: (t1.fivethous = (i4.f1 + i8.q2))
+ -> Seq Scan on tenk1 t1
+ -> Hash
+ -> Seq Scan on int4_tbl i4
+(9 rows)
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ q2 | fivethous | f1
+-----+-----------+----
+ 456 | 456 | 0
+ 456 | 456 | 0
+ 123 | 123 | 0
+ 123 | 123 | 0
+(4 rows)
+
+rollback;
diff --git a/src/test/regress/expected/join_hash_pglz.out b/src/test/regress/expected/join_hash_pglz.out
new file mode 100644
index 00000000000..99c67f982af
--- /dev/null
+++ b/src/test/regress/expected/join_hash_pglz.out
@@ -0,0 +1,1166 @@
+--
+-- exercises for the hash join code
+--
+begin;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'pglz';
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(6 rows)
+
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(6 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+-----------------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 4
+(1 row)
+
+rollback to settings;
+-- A couple of other hash join tests unrelated to work_mem management.
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 1
+(1 row)
+
+rollback to settings;
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is not matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Parallel Seq Scan on simple s
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple r
+(9 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ QUERY PLAN
+----------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Left Join
+ Hash Cond: (wide.id = wide_1.id)
+ -> Parallel Seq Scan on wide
+ -> Parallel Hash
+ -> Parallel Seq Scan on wide wide_1
+(9 rows)
+
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ length
+--------
+ 320000
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+ROLLBACK TO settings;
+rollback;
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: ((hjtest_1.id = (SubPlan 1)) AND ((SubPlan 2) = (SubPlan 3)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan 4) < 50)
+ SubPlan 4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ -> Hash
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan 5) < 55)
+ SubPlan 5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan 1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan 3
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan 2
+ -> Result
+ Output: (hjtest_1.b * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: (((SubPlan 1) = hjtest_1.id) AND ((SubPlan 3) = (SubPlan 2)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan 5) < 55)
+ SubPlan 5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ -> Hash
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan 4) < 50)
+ SubPlan 4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan 2
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan 1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan 3
+ -> Result
+ Output: (hjtest_2.c * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+ROLLBACK;
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ QUERY PLAN
+-----------------------------------------------------------
+ Nested Loop
+ -> Seq Scan on int8_tbl i8
+ -> Sort
+ Sort Key: t1.fivethous, i4.f1
+ -> Hash Join
+ Hash Cond: (t1.fivethous = (i4.f1 + i8.q2))
+ -> Seq Scan on tenk1 t1
+ -> Hash
+ -> Seq Scan on int4_tbl i4
+(9 rows)
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ q2 | fivethous | f1
+-----+-----------+----
+ 456 | 456 | 0
+ 456 | 456 | 0
+ 123 | 123 | 0
+ 123 | 123 | 0
+(4 rows)
+
+rollback;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0f38caa0d24..7701e57fad3 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -15,7 +15,6 @@ test: test_setup
# The first group of parallel tests
# ----------
test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn regproc
-
# ----------
# The second group of parallel tests
# multirangetypes depends on rangetypes
@@ -136,3 +135,6 @@ test: fast_default
# run tablespace test at the end because it drops the tablespace created during
# setup that other tests may use.
test: tablespace
+
+# this test is equivalent to join_hash test just the compression is enabled
+test: join_hash_pglz
diff --git a/src/test/regress/sql/join_hash_lz4.sql b/src/test/regress/sql/join_hash_lz4.sql
new file mode 100644
index 00000000000..1d19c1980e1
--- /dev/null
+++ b/src/test/regress/sql/join_hash_lz4.sql
@@ -0,0 +1,626 @@
+--
+-- exercises for the hash join code
+--
+
+begin;
+
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'lz4';
+
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- A couple of other hash join tests unrelated to work_mem management.
+
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- A full outer join where every record is matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- A full outer join where every record is not matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+rollback to settings;
+
+
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ROLLBACK TO settings;
+
+rollback;
+
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+ROLLBACK;
+
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+rollback;
diff --git a/src/test/regress/sql/join_hash_pglz.sql b/src/test/regress/sql/join_hash_pglz.sql
new file mode 100644
index 00000000000..2686afab272
--- /dev/null
+++ b/src/test/regress/sql/join_hash_pglz.sql
@@ -0,0 +1,626 @@
+--
+-- exercises for the hash join code
+--
+
+begin;
+
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'pglz';
+
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- A couple of other hash join tests unrelated to work_mem management.
+
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- A full outer join where every record is matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- A full outer join where every record is not matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+rollback to settings;
+
+
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ROLLBACK TO settings;
+
+rollback;
+
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+ROLLBACK;
+
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+rollback;
--
2.39.5 (Apple Git-154)
[application/octet-stream] 0001-This-commit-adds-support-for-temporary-files-compres.patch (15.7K, 4-0001-This-commit-adds-support-for-temporary-files-compres.patch)
download | inline diff:
From 70f85b902463af7fec83bc9721b73c3eb53465f1 Mon Sep 17 00:00:00 2001
From: Filip Janus <[email protected]>
Date: Wed, 16 Apr 2025 12:03:03 +0200
Subject: [PATCH 1/2] This commit adds support for temporary files compression,
it can be used only for hashjoins now.
It also adds GUC parameter temp_file_compression that enables this functionality.
For now, it supports just lz4 and pglz algorithms. In the future, it
could also be implemented zstd support.
It implements just one working buffer for compression and decompression to avoid
memory wasting. The buffer is allocated in the top memory context.
---
src/Makefile.global.in | 1 +
src/backend/access/gist/gistbuildbuffers.c | 2 +-
src/backend/backup/backup_manifest.c | 2 +-
src/backend/executor/nodeHashjoin.c | 2 +-
src/backend/storage/file/buffile.c | 217 +++++++++++++++++-
src/backend/utils/misc/guc_tables.c | 24 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/backend/utils/sort/logtape.c | 2 +-
src/backend/utils/sort/tuplestore.c | 2 +-
src/include/storage/buffile.h | 13 +-
10 files changed, 251 insertions(+), 15 deletions(-)
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 6722fbdf365..6ff67bda17c 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -201,6 +201,7 @@ with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
+with_lz4 = @with_lz4@
with_system_tzdata = @with_system_tzdata@
with_uuid = @with_uuid@
with_zlib = @with_zlib@
diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
index 0707254d18e..9cc371f47fe 100644
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -54,7 +54,7 @@ gistInitBuildBuffers(int pagesPerBuffer, int levelStep, int maxLevel)
* Create a temporary file to hold buffer pages that are swapped out of
* memory.
*/
- gfbb->pfile = BufFileCreateTemp(false);
+ gfbb->pfile = BufFileCreateTemp(false, false);
gfbb->nFileBlocks = 0;
/* Initialize free page management. */
diff --git a/src/backend/backup/backup_manifest.c b/src/backend/backup/backup_manifest.c
index 22e2be37c95..c9f7daa1497 100644
--- a/src/backend/backup/backup_manifest.c
+++ b/src/backend/backup/backup_manifest.c
@@ -65,7 +65,7 @@ InitializeBackupManifest(backup_manifest_info *manifest,
manifest->buffile = NULL;
else
{
- manifest->buffile = BufFileCreateTemp(false);
+ manifest->buffile = BufFileCreateTemp(false, false);
manifest->manifest_ctx = pg_cryptohash_create(PG_SHA256);
if (pg_cryptohash_init(manifest->manifest_ctx) < 0)
elog(ERROR, "failed to initialize checksum of backup manifest: %s",
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 5661ad76830..384265ca74a 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -1434,7 +1434,7 @@ ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
{
MemoryContext oldctx = MemoryContextSwitchTo(hashtable->spillCxt);
- file = BufFileCreateTemp(false);
+ file = BufFileCreateCompressTemp(false);
*fileptr = file;
MemoryContextSwitchTo(oldctx);
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 366d70d38a1..103f6550322 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -53,6 +53,18 @@
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "utils/resowner.h"
+#include "utils/memutils.h"
+
+#include "common/pg_lzcompress.h"
+#ifdef USE_LZ4
+#include <lz4.h>
+#endif
+
+#define NO_LZ4_SUPPORT() \
+ ereport(ERROR, \
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), \
+ errmsg("compression method lz4 not supported"), \
+ errdetail("This functionality requires the server to be built with lz4 support.")))
/*
* We break BufFiles into gigabyte-sized segments, regardless of RELSEG_SIZE.
@@ -62,6 +74,8 @@
#define MAX_PHYSICAL_FILESIZE 0x40000000
#define BUFFILE_SEG_SIZE (MAX_PHYSICAL_FILESIZE / BLCKSZ)
+int temp_file_compression = TEMP_NONE_COMPRESSION;
+
/*
* This data structure represents a buffered file that consists of one or
* more physical files (each accessed through a virtual file descriptor
@@ -95,7 +109,8 @@ struct BufFile
off_t curOffset; /* offset part of current pos */
int pos; /* next read/write position in buffer */
int nbytes; /* total # of valid bytes in buffer */
-
+ bool compress; /* State of usege file compression */
+ char *cBuffer;
/*
* XXX Should ideally use PGIOAlignedBlock, but might need a way to avoid
* wasting per-file alignment padding when some users create many files.
@@ -127,6 +142,8 @@ makeBufFileCommon(int nfiles)
file->curOffset = 0;
file->pos = 0;
file->nbytes = 0;
+ file->compress = false;
+ file->cBuffer = NULL;
return file;
}
@@ -188,9 +205,17 @@ extendBufFile(BufFile *file)
* Note: if interXact is true, the caller had better be calling us in a
* memory context, and with a resource owner, that will survive across
* transaction boundaries.
+ *
+ * If compress is true the temporary files will be compressed before
+ * writing on disk.
+ *
+ * Note: The compression does not support random access. Only the hash joins
+ * use it for now. The seek operation other than seek to the beginning of the
+ * buffile will corrupt temporary data offsets.
+ *
*/
BufFile *
-BufFileCreateTemp(bool interXact)
+BufFileCreateTemp(bool interXact, bool compress)
{
BufFile *file;
File pfile;
@@ -212,9 +237,47 @@ BufFileCreateTemp(bool interXact)
file = makeBufFile(pfile);
file->isInterXact = interXact;
+ if (temp_file_compression != TEMP_NONE_COMPRESSION)
+ {
+ file->compress = compress;
+ }
+
return file;
+
}
+/*
+ * Wrapper for BuffileCreateTemp
+ * We want to limit the number of memory allocations for the compression buffer,
+ * only one buffer for all compression operations is enough
+ */
+BufFile *
+BufFileCreateCompressTemp(bool interXact){
+ static char * buff = NULL;
+ BufFile *tmpBufFile = BufFileCreateTemp(interXact, true);
+ if (buff == NULL && temp_file_compression != TEMP_NONE_COMPRESSION)
+ {
+ int size = 0;
+
+ switch (temp_file_compression)
+ {
+ case TEMP_LZ4_COMPRESSION:
+#ifdef USE_LZ4
+ size = LZ4_compressBound(BLCKSZ)+sizeof(int);
+#endif
+ break;
+ case TEMP_PGLZ_COMPRESSION:
+ size = pglz_maximum_compressed_size(BLCKSZ, BLCKSZ)+sizeof(int);
+ break;
+ }
+ /*
+ * Persistent buffer for all temporary file compressions
+ */
+ buff = MemoryContextAlloc(TopMemoryContext, size);
+ }
+ tmpBufFile->cBuffer = buff;
+ return tmpBufFile;
+}
/*
* Build the name for a given segment of a given BufFile.
*/
@@ -275,6 +338,7 @@ BufFileCreateFileSet(FileSet *fileset, const char *name)
file->files[0] = MakeNewFileSetSegment(file, 0);
file->readOnly = false;
+
return file;
}
@@ -457,11 +521,75 @@ BufFileLoadBuffer(BufFile *file)
/*
* Read whatever we can get, up to a full bufferload.
*/
- file->nbytes = FileRead(thisfile,
+ if (!file->compress)
+ {
+
+ /*
+ * Read whatever we can get, up to a full bufferload.
+ */
+ file->nbytes = FileRead(thisfile,
file->buffer.data,
- sizeof(file->buffer.data),
+ sizeof(file->buffer),
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+ /*
+ * Read and decompress data from the temporary file
+ * The first reading loads size of the compressed block
+ * Second reading loads compressed data
+ */
+ } else {
+ int nread;
+ int nbytes;
+
+ nread = FileRead(thisfile,
+ &nbytes,
+ sizeof(nbytes),
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+ /* if not EOF let's continue */
+ if (nread > 0)
+ {
+ /* A long life buffer limits number of memory allocations */
+ char * buff = file->cBuffer;
+
+ Assert(file->cBuffer != NULL);
+ /*
+ * Read compressed data, curOffset differs with pos
+ * It reads less data than it returns to caller
+ * So the curOffset must be advanced here based on compressed size
+ */
+ file->curOffset+=sizeof(nbytes);
+
+ nread = FileRead(thisfile,
+ buff,
+ nbytes,
file->curOffset,
WAIT_EVENT_BUFFILE_READ);
+
+ switch (temp_file_compression)
+ {
+ case TEMP_LZ4_COMPRESSION:
+#ifdef USE_LZ4
+ file->nbytes = LZ4_decompress_safe(buff,
+ file->buffer.data,nbytes,sizeof(file->buffer));
+#endif
+ break;
+
+ case TEMP_PGLZ_COMPRESSION:
+ file->nbytes = pglz_decompress(buff,nbytes,
+ file->buffer.data,sizeof(file->buffer),false);
+ break;
+ }
+ file->curOffset += nread;
+
+ if (file->nbytes < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("compressed lz4 data is corrupt")));
+ }
+
+ }
+
if (file->nbytes < 0)
{
file->nbytes = 0;
@@ -494,9 +622,61 @@ static void
BufFileDumpBuffer(BufFile *file)
{
int wpos = 0;
- int bytestowrite;
+ int bytestowrite = 0;
File thisfile;
+
+ /* Save nbytes value because the size changes due to compression */
+ int nbytesOriginal = file->nbytes;
+
+ char * DataToWrite;
+ DataToWrite = file->buffer.data;
+
+ /*
+ * Prepare compressed data to write
+ * size of compressed block needs to be added at the beggining of the
+ * compressed data
+ */
+
+
+ if (file->compress) {
+ char * cData;
+ int cSize = 0;
+
+ Assert(file->cBuffer != NULL);
+ cData = file->cBuffer;
+
+ switch (temp_file_compression)
+ {
+ case TEMP_LZ4_COMPRESSION:
+ {
+#ifdef USE_LZ4
+ int cBufferSize = LZ4_compressBound(file->nbytes);
+ /*
+ * Using stream compression would lead to the slight improvement in
+ * compression ratio
+ */
+ cSize = LZ4_compress_default(file->buffer.data,
+ cData + sizeof(int),file->nbytes, cBufferSize);
+#endif
+ break;
+ }
+ case TEMP_PGLZ_COMPRESSION:
+ cSize = pglz_compress(file->buffer.data,file->nbytes,
+ cData + sizeof(int),PGLZ_strategy_always);
+ break;
+ }
+
+
+ /* Write size of compressed block in front of compressed data
+ * It's used to determine amount of data to read within
+ * decompression process
+ */
+ memcpy(cData,&cSize,sizeof(int));
+ file->nbytes=cSize + sizeof(int);
+ DataToWrite = cData;
+ }
+
/*
* Unlike BufFileLoadBuffer, we must dump the whole buffer even if it
* crosses a component-file boundary; so we need a loop.
@@ -535,7 +715,7 @@ BufFileDumpBuffer(BufFile *file)
INSTR_TIME_SET_ZERO(io_start);
bytestowrite = FileWrite(thisfile,
- file->buffer.data + wpos,
+ DataToWrite + wpos,
bytestowrite,
file->curOffset,
WAIT_EVENT_BUFFILE_WRITE);
@@ -564,7 +744,19 @@ BufFileDumpBuffer(BufFile *file)
* logical file position, ie, original value + pos, in case that is less
* (as could happen due to a small backwards seek in a dirty buffer!)
*/
- file->curOffset -= (file->nbytes - file->pos);
+
+
+ if (!file->compress)
+ file->curOffset -= (file->nbytes - file->pos);
+ else
+ if (nbytesOriginal - file->pos != 0)
+ /* curOffset must be corrected also if compression is
+ * enabled, nbytes was changed by compression but we
+ * have to use the original value of nbytes
+ */
+ file->curOffset-=bytestowrite;
+
+
if (file->curOffset < 0) /* handle possible segment crossing */
{
file->curFile--;
@@ -577,6 +769,7 @@ BufFileDumpBuffer(BufFile *file)
*/
file->pos = 0;
file->nbytes = 0;
+
}
/*
@@ -602,8 +795,14 @@ BufFileReadCommon(BufFile *file, void *ptr, size_t size, bool exact, bool eofOK)
{
if (file->pos >= file->nbytes)
{
- /* Try to load more data into buffer. */
- file->curOffset += file->pos;
+ /* Try to load more data into buffer.
+ *
+ * curOffset is moved within BufFileLoadBuffer
+ * because stored data size differs from loaded/
+ * decompressed size
+ * */
+ if (!file->compress)
+ file->curOffset += file->pos;
file->pos = 0;
file->nbytes = 0;
BufFileLoadBuffer(file);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 60b12446a1c..ae052640ac0 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -78,6 +78,7 @@
#include "replication/syncrep.h"
#include "storage/aio.h"
#include "storage/bufmgr.h"
+#include "storage/buffile.h"
#include "storage/bufpage.h"
#include "storage/copydir.h"
#include "storage/io_worker.h"
@@ -463,6 +464,18 @@ static const struct config_enum_entry default_toast_compression_options[] = {
#endif
{NULL, 0, false}
};
+/*
+ * pglz and zstd support should be added as future enhancement
+ *
+ */
+static const struct config_enum_entry temp_file_compression_options[] = {
+ {"no", TEMP_NONE_COMPRESSION, false},
+ {"pglz", TEMP_PGLZ_COMPRESSION, false},
+#ifdef USE_LZ4
+ {"lz4", TEMP_LZ4_COMPRESSION, false},
+#endif
+ {NULL, 0, false}
+};
static const struct config_enum_entry wal_compression_options[] = {
{"pglz", WAL_COMPRESSION_PGLZ, false},
@@ -5058,6 +5071,17 @@ struct config_enum ConfigureNamesEnum[] =
NULL, NULL, NULL
},
+ {
+ {"temp_file_compression", PGC_USERSET, CLIENT_CONN_STATEMENT,
+ gettext_noop("Sets the default compression method for compressible values."),
+ NULL
+ },
+ &temp_file_compression,
+ TEMP_NONE_COMPRESSION,
+ temp_file_compression_options,
+ NULL, NULL, NULL
+ },
+
{
{"default_transaction_isolation", PGC_USERSET, CLIENT_CONN_STATEMENT,
gettext_noop("Sets the transaction isolation level of each new transaction."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 34826d01380..77961a45d65 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -182,6 +182,7 @@
#max_notify_queue_pages = 1048576 # limits the number of SLRU pages allocated
# for NOTIFY / LISTEN queue
+#temp_file_compression = 'no' # enables temporary files compression
#file_copy_method = copy # the default is the first option
# copy
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index e529ceb8260..d862e22ef18 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -592,7 +592,7 @@ LogicalTapeSetCreate(bool preallocate, SharedFileSet *fileset, int worker)
lts->pfile = BufFileCreateFileSet(&fileset->fs, filename);
}
else
- lts->pfile = BufFileCreateTemp(false);
+ lts->pfile = BufFileCreateTemp(false, false);
return lts;
}
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..ef85924cd21 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -860,7 +860,7 @@ tuplestore_puttuple_common(Tuplestorestate *state, void *tuple)
*/
oldcxt = MemoryContextSwitchTo(state->context->parent);
- state->myfile = BufFileCreateTemp(state->interXact);
+ state->myfile = BufFileCreateTemp(state->interXact, false);
MemoryContextSwitchTo(oldcxt);
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index a2f4821f240..931a211038b 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -32,11 +32,22 @@
typedef struct BufFile BufFile;
+typedef enum
+{
+ TEMP_NONE_COMPRESSION,
+ TEMP_PGLZ_COMPRESSION,
+ TEMP_LZ4_COMPRESSION
+} TempCompression;
+
+extern PGDLLIMPORT int temp_file_compression;
+
+
/*
* prototypes for functions in buffile.c
*/
-extern BufFile *BufFileCreateTemp(bool interXact);
+extern BufFile *BufFileCreateCompressTemp(bool interXact);
+extern BufFile *BufFileCreateTemp(bool interXact, bool compress);
extern void BufFileClose(BufFile *file);
pg_nodiscard extern size_t BufFileRead(BufFile *file, void *ptr, size_t size);
extern void BufFileReadExact(BufFile *file, void *ptr, size_t size);
--
2.39.5 (Apple Git-154)
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Proposal: Adding compression of temporary files
2025-01-04 23:43 Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-22 07:17 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-25 21:54 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
@ 2026-01-13 12:53 ` lakshmi <[email protected]>
2026-01-13 13:34 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: lakshmi @ 2026-01-13 12:53 UTC (permalink / raw)
To: Filip Janus <[email protected]>; +Cc: pgsql-hackers
Hi all,
I tried to replicate the temporary file compression issue by applying the
two patches shared in the thread on current PostgreSQL master.
here is what i observed,
1) patch 1:0001-Add-transparent-compression-for-temporary-files.patch
when applying the first patch it ultimately fails to apply due to context
mismatches.
failures i see are in the following files:
src/backend/storage/file/buffile.c
src/backend/utils/misc/guc_tables.c
src/backend/utils/misc/postgresql.conf.sample
2) The second
patch 0002-Add-regression-tests-for-temporary-file-compression.patch
,applies successfully without any issues.
Does it mean that the implementation patch needs to be rebased or otherwise
adjusted for the current codebase, and if so, what would be the recommended
way to proceed?could you please suggest how I should apply the
implementation patch in this case?
regards
lakshmi
On Tue, Jan 13, 2026 at 5:01 PM Filip Janus <[email protected]> wrote:
> Rebase after changes introduced in guc_tables.c
>
> -Filip-
>
>
> út 19. 8. 2025 v 17:48 odesílatel Filip Janus <[email protected]> napsal:
>
>> Fix overlooked compiler warnings
>>
>> -Filip-
>>
>>
>> po 18. 8. 2025 v 18:51 odesílatel Filip Janus <[email protected]> napsal:
>>
>>> I rebased the proposal and fixed the problem causing those problems.
>>>
>>> -Filip-
>>>
>>>
>>> út 17. 6. 2025 v 16:49 odesílatel Andres Freund <[email protected]>
>>> napsal:
>>>
>>>> Hi,
>>>>
>>>> On 2025-04-25 23:54:00 +0200, Filip Janus wrote:
>>>> > The latest rebase.
>>>>
>>>> This often seems to fail during tests:
>>>> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5382
>>>>
>>>> E.g.
>>>>
>>>> https://api.cirrus-ci.com/v1/artifact/task/4667337632120832/testrun/build-32/testrun/recovery/027_st...
>>>>
>>>> === dumping
>>>> /tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/regression.diffs
>>>> ===
>>>> diff -U3
>>>> /tmp/cirrus-ci-build/src/test/regress/expected/join_hash_pglz.out
>>>> /tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/results/join_hash_pglz.out
>>>> --- /tmp/cirrus-ci-build/src/test/regress/expected/join_hash_pglz.out
>>>> 2025-05-26 05:04:40.686524215 +0000
>>>> +++
>>>> /tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/results/join_hash_pglz.out
>>>> 2025-05-26 05:15:00.534907680 +0000
>>>> @@ -594,11 +594,8 @@
>>>> select count(*) from join_foo
>>>> left join (select b1.id, b1.t from join_bar b1 join join_bar b2
>>>> using (id)) ss
>>>> on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
>>>> - count
>>>> --------
>>>> - 3
>>>> -(1 row)
>>>> -
>>>> +ERROR: could not read from temporary file: read only 8180 of 1572860
>>>> bytes
>>>> +CONTEXT: parallel worker
>>>> select final > 1 as multibatch
>>>> from hash_join_batches(
>>>> $$
>>>> @@ -606,11 +603,7 @@
>>>> left join (select b1.id, b1.t from join_bar b1 join join_bar b2
>>>> using (id)) ss
>>>> on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
>>>> $$);
>>>> - multibatch
>>>> -------------
>>>> - t
>>>> -(1 row)
>>>> -
>>>> +ERROR: current transaction is aborted, commands ignored until end of
>>>> transaction block
>>>> rollback to settings;
>>>> -- single-batch with rescan, parallel-oblivious
>>>> savepoint settings;
>>>>
>>>>
>>>> Greetings,
>>>>
>>>> Andres
>>>>
>>>>
>>>>
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Proposal: Adding compression of temporary files
2025-01-04 23:43 Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-22 07:17 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-25 21:54 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-13 12:53 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
@ 2026-01-13 13:34 ` Filip Janus <[email protected]>
2026-01-18 15:50 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Filip Janus @ 2026-01-13 13:34 UTC (permalink / raw)
To: lakshmi <[email protected]>; +Cc: pgsql-hackers
Hi,
Yes, it needs to be rebased. I am working on it. I will post it here soon.
-Filip-
út 13. 1. 2026 v 13:51 odesílatel lakshmi <[email protected]> napsal:
> Hi all,
> I tried to replicate the temporary file compression issue by applying the
> two patches shared in the thread on current PostgreSQL master.
> here is what i observed,
> 1) patch 1:0001-Add-transparent-compression-for-temporary-files.patch
> when applying the first patch it ultimately fails to apply due to context
> mismatches.
>
> failures i see are in the following files:
> src/backend/storage/file/buffile.c
> src/backend/utils/misc/guc_tables.c
> src/backend/utils/misc/postgresql.conf.sample
>
> 2) The second
> patch 0002-Add-regression-tests-for-temporary-file-compression.patch
> ,applies successfully without any issues.
>
> Does it mean that the implementation patch needs to be rebased or
> otherwise adjusted for the current codebase, and if so, what would be the
> recommended way to proceed?could you please suggest how I should apply the
> implementation patch in this case?
>
>
> regards
> lakshmi
>
> On Tue, Jan 13, 2026 at 5:01 PM Filip Janus <[email protected]> wrote:
>
>> Rebase after changes introduced in guc_tables.c
>>
>> -Filip-
>>
>>
>> út 19. 8. 2025 v 17:48 odesílatel Filip Janus <[email protected]> napsal:
>>
>>> Fix overlooked compiler warnings
>>>
>>> -Filip-
>>>
>>>
>>> po 18. 8. 2025 v 18:51 odesílatel Filip Janus <[email protected]>
>>> napsal:
>>>
>>>> I rebased the proposal and fixed the problem causing those problems.
>>>>
>>>> -Filip-
>>>>
>>>>
>>>> út 17. 6. 2025 v 16:49 odesílatel Andres Freund <[email protected]>
>>>> napsal:
>>>>
>>>>> Hi,
>>>>>
>>>>> On 2025-04-25 23:54:00 +0200, Filip Janus wrote:
>>>>> > The latest rebase.
>>>>>
>>>>> This often seems to fail during tests:
>>>>> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5382
>>>>>
>>>>> E.g.
>>>>>
>>>>> https://api.cirrus-ci.com/v1/artifact/task/4667337632120832/testrun/build-32/testrun/recovery/027_st...
>>>>>
>>>>> === dumping
>>>>> /tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/regression.diffs
>>>>> ===
>>>>> diff -U3
>>>>> /tmp/cirrus-ci-build/src/test/regress/expected/join_hash_pglz.out
>>>>> /tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/results/join_hash_pglz.out
>>>>> --- /tmp/cirrus-ci-build/src/test/regress/expected/join_hash_pglz.out
>>>>> 2025-05-26 05:04:40.686524215 +0000
>>>>> +++
>>>>> /tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/results/join_hash_pglz.out
>>>>> 2025-05-26 05:15:00.534907680 +0000
>>>>> @@ -594,11 +594,8 @@
>>>>> select count(*) from join_foo
>>>>> left join (select b1.id, b1.t from join_bar b1 join join_bar b2
>>>>> using (id)) ss
>>>>> on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
>>>>> - count
>>>>> --------
>>>>> - 3
>>>>> -(1 row)
>>>>> -
>>>>> +ERROR: could not read from temporary file: read only 8180 of 1572860
>>>>> bytes
>>>>> +CONTEXT: parallel worker
>>>>> select final > 1 as multibatch
>>>>> from hash_join_batches(
>>>>> $$
>>>>> @@ -606,11 +603,7 @@
>>>>> left join (select b1.id, b1.t from join_bar b1 join join_bar b2
>>>>> using (id)) ss
>>>>> on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
>>>>> $$);
>>>>> - multibatch
>>>>> -------------
>>>>> - t
>>>>> -(1 row)
>>>>> -
>>>>> +ERROR: current transaction is aborted, commands ignored until end of
>>>>> transaction block
>>>>> rollback to settings;
>>>>> -- single-batch with rescan, parallel-oblivious
>>>>> savepoint settings;
>>>>>
>>>>>
>>>>> Greetings,
>>>>>
>>>>> Andres
>>>>>
>>>>>
>>>>>
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Proposal: Adding compression of temporary files
2025-01-04 23:43 Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-22 07:17 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-25 21:54 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-13 12:53 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
2026-01-13 13:34 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
@ 2026-01-18 15:50 ` Filip Janus <[email protected]>
2026-01-19 22:40 ` Re: Proposal: Adding compression of temporary files Zsolt Parragi <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Filip Janus @ 2026-01-18 15:50 UTC (permalink / raw)
To: lakshmi <[email protected]>; Tomas Vondra <[email protected]>; +Cc: pgsql-hackers
Hi,
Thank you, Tomas, for the thorough and detailed review!
I'm posting an updated patch set incorporating the changes from your review.
Changes applied from review:
- Simplified BufFileCreateTemp interface
- Improved error handling in BufFileLoadBuffer/BufFileDumpBuffer
- Unified compression header format (CompressHeader struct)
- Added tuplestore integration (compression when EXEC_FLAG_BACKWARD is not
required)
- Various code cleanups and comment improvements
Additional change (not from review):
- Switched from static shared buffer to per-file allocation. The shared
buffer
provided a negligible performance benefit while keeping memory allocated
for the backend's lifetime.
Future work:
- Support for additional compression methods (gzip, zstd)
- Random access and seek operations with compression
-Filip-
út 13. 1. 2026 v 14:34 odesílatel Filip Janus <[email protected]> napsal:
> Hi,
> Yes, it needs to be rebased. I am working on it. I will post it here soon.
>
>
> -Filip-
>
>
> út 13. 1. 2026 v 13:51 odesílatel lakshmi <[email protected]> napsal:
>
>> Hi all,
>> I tried to replicate the temporary file compression issue by applying the
>> two patches shared in the thread on current PostgreSQL master.
>> here is what i observed,
>> 1) patch 1:0001-Add-transparent-compression-for-temporary-files.patch
>> when applying the first patch it ultimately fails to apply due to context
>> mismatches.
>>
>> failures i see are in the following files:
>> src/backend/storage/file/buffile.c
>> src/backend/utils/misc/guc_tables.c
>> src/backend/utils/misc/postgresql.conf.sample
>>
>> 2) The second
>> patch 0002-Add-regression-tests-for-temporary-file-compression.patch
>> ,applies successfully without any issues.
>>
>> Does it mean that the implementation patch needs to be rebased or
>> otherwise adjusted for the current codebase, and if so, what would be the
>> recommended way to proceed?could you please suggest how I should apply the
>> implementation patch in this case?
>>
>>
>> regards
>> lakshmi
>>
>> On Tue, Jan 13, 2026 at 5:01 PM Filip Janus <[email protected]> wrote:
>>
>>> Rebase after changes introduced in guc_tables.c
>>>
>>> -Filip-
>>>
>>>
>>> út 19. 8. 2025 v 17:48 odesílatel Filip Janus <[email protected]>
>>> napsal:
>>>
>>>> Fix overlooked compiler warnings
>>>>
>>>> -Filip-
>>>>
>>>>
>>>> po 18. 8. 2025 v 18:51 odesílatel Filip Janus <[email protected]>
>>>> napsal:
>>>>
>>>>> I rebased the proposal and fixed the problem causing those problems.
>>>>>
>>>>> -Filip-
>>>>>
>>>>>
>>>>> út 17. 6. 2025 v 16:49 odesílatel Andres Freund <[email protected]>
>>>>> napsal:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 2025-04-25 23:54:00 +0200, Filip Janus wrote:
>>>>>> > The latest rebase.
>>>>>>
>>>>>> This often seems to fail during tests:
>>>>>> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5382
>>>>>>
>>>>>> E.g.
>>>>>>
>>>>>> https://api.cirrus-ci.com/v1/artifact/task/4667337632120832/testrun/build-32/testrun/recovery/027_st...
>>>>>>
>>>>>> === dumping
>>>>>> /tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/regression.diffs
>>>>>> ===
>>>>>> diff -U3
>>>>>> /tmp/cirrus-ci-build/src/test/regress/expected/join_hash_pglz.out
>>>>>> /tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/results/join_hash_pglz.out
>>>>>> ---
>>>>>> /tmp/cirrus-ci-build/src/test/regress/expected/join_hash_pglz.out
>>>>>> 2025-05-26 05:04:40.686524215 +0000
>>>>>> +++
>>>>>> /tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/results/join_hash_pglz.out
>>>>>> 2025-05-26 05:15:00.534907680 +0000
>>>>>> @@ -594,11 +594,8 @@
>>>>>> select count(*) from join_foo
>>>>>> left join (select b1.id, b1.t from join_bar b1 join join_bar b2
>>>>>> using (id)) ss
>>>>>> on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
>>>>>> - count
>>>>>> --------
>>>>>> - 3
>>>>>> -(1 row)
>>>>>> -
>>>>>> +ERROR: could not read from temporary file: read only 8180 of
>>>>>> 1572860 bytes
>>>>>> +CONTEXT: parallel worker
>>>>>> select final > 1 as multibatch
>>>>>> from hash_join_batches(
>>>>>> $$
>>>>>> @@ -606,11 +603,7 @@
>>>>>> left join (select b1.id, b1.t from join_bar b1 join join_bar b2
>>>>>> using (id)) ss
>>>>>> on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
>>>>>> $$);
>>>>>> - multibatch
>>>>>> -------------
>>>>>> - t
>>>>>> -(1 row)
>>>>>> -
>>>>>> +ERROR: current transaction is aborted, commands ignored until end
>>>>>> of transaction block
>>>>>> rollback to settings;
>>>>>> -- single-batch with rescan, parallel-oblivious
>>>>>> savepoint settings;
>>>>>>
>>>>>>
>>>>>> Greetings,
>>>>>>
>>>>>> Andres
>>>>>>
>>>>>>
>>>>>>
Attachments:
[application/octet-stream] 0002-Add-regression-tests-for-temporary-file-compression.patch (127.6K, 3-0002-Add-regression-tests-for-temporary-file-compression.patch)
download | inline diff:
From 77d5f06e3b1ddc50024f26df18f166bd0e98d4b6 Mon Sep 17 00:00:00 2001
From: Filip Janus <[email protected]>
Date: Thu, 31 Jul 2025 14:02:45 +0200
Subject: [PATCH 2/2] Add regression tests for temporary file compression
This commit adds comprehensive regression tests for the transparent
temporary file compression feature.
Test coverage:
- join_hash_lz4.sql: Tests hash join operations with LZ4 compression
- join_hash_pglz.sql: Tests hash join operations with PGLZ compression
- Both tests verify compression works correctly for various hash join scenarios
- Expected output files for validation
Test integration:
- LZ4 tests are conditionally enabled when PostgreSQL is built with --with-lz4
- PGLZ tests are always enabled as PGLZ is built-in
- Tests added to parallel regression test schedule
- GNUmakefile updated to include conditional LZ4 test execution
The tests ensure that compression/decompression works transparently
without affecting query results, while providing coverage for both
supported compression algorithms.
---
src/Makefile.global.in | 1 +
src/test/regress/GNUmakefile | 4 +
src/test/regress/expected/join_hash_lz4.out | 1167 ++++++++++++++++++
src/test/regress/expected/join_hash_pglz.out | 1167 ++++++++++++++++++
src/test/regress/parallel_schedule | 4 +-
src/test/regress/sql/join_hash_lz4.sql | 627 ++++++++++
src/test/regress/sql/join_hash_pglz.sql | 627 ++++++++++
7 files changed, 3596 insertions(+), 1 deletion(-)
create mode 100644 src/test/regress/expected/join_hash_lz4.out
create mode 100644 src/test/regress/expected/join_hash_pglz.out
create mode 100644 src/test/regress/sql/join_hash_lz4.sql
create mode 100644 src/test/regress/sql/join_hash_pglz.sql
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 371cd7eba2c..16109643fd3 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -201,6 +201,7 @@ with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
+with_lz4 = @with_lz4@
with_system_tzdata = @with_system_tzdata@
with_uuid = @with_uuid@
with_zlib = @with_zlib@
diff --git a/src/test/regress/GNUmakefile b/src/test/regress/GNUmakefile
index a8ba19e5971..5c4e25bc07c 100644
--- a/src/test/regress/GNUmakefile
+++ b/src/test/regress/GNUmakefile
@@ -94,6 +94,10 @@ installdirs-tests: installdirs
REGRESS_OPTS = --dlpath=. --max-concurrent-tests=20 \
$(EXTRA_REGRESS_OPTS)
+ifeq ($(with_lz4),yes)
+override EXTRA_TESTS := $(EXTRA_TESTS) join_hash_lz4
+endif
+
check: all
$(pg_regress_check) $(REGRESS_OPTS) --schedule=$(srcdir)/parallel_schedule $(MAXCONNOPT) $(EXTRA_TESTS)
diff --git a/src/test/regress/expected/join_hash_lz4.out b/src/test/regress/expected/join_hash_lz4.out
new file mode 100644
index 00000000000..78367bf3719
--- /dev/null
+++ b/src/test/regress/expected/join_hash_lz4.out
@@ -0,0 +1,1167 @@
+--
+-- exercises for the hash join code
+--
+begin;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'lz4';
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(6 rows)
+
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(6 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+-----------------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 4
+(1 row)
+
+rollback to settings;
+-- A couple of other hash join tests unrelated to work_mem management.
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 1
+(1 row)
+
+rollback to settings;
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+analyze join_foo, join_bar;
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is not matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Parallel Seq Scan on simple s
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple r
+(9 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ QUERY PLAN
+----------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Left Join
+ Hash Cond: (wide.id = wide_1.id)
+ -> Parallel Seq Scan on wide
+ -> Parallel Hash
+ -> Parallel Seq Scan on wide wide_1
+(9 rows)
+
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ length
+--------
+ 320000
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+ROLLBACK TO settings;
+rollback;
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: ((hjtest_1.id = (SubPlan expr_1)) AND ((SubPlan expr_2) = (SubPlan expr_3)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan expr_4) < 50)
+ SubPlan expr_4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ -> Hash
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan expr_5) < 55)
+ SubPlan expr_5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan expr_1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan expr_3
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan expr_2
+ -> Result
+ Output: (hjtest_1.b * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: (((SubPlan expr_1) = hjtest_1.id) AND ((SubPlan expr_3) = (SubPlan expr_2)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan expr_5) < 55)
+ SubPlan expr_5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ -> Hash
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan expr_4) < 50)
+ SubPlan expr_4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan expr_2
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan expr_1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan expr_3
+ -> Result
+ Output: (hjtest_2.c * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+ROLLBACK;
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ QUERY PLAN
+-----------------------------------------------------------
+ Nested Loop
+ -> Seq Scan on int8_tbl i8
+ -> Sort
+ Sort Key: t1.fivethous, i4.f1
+ -> Hash Join
+ Hash Cond: (t1.fivethous = (i4.f1 + i8.q2))
+ -> Seq Scan on tenk1 t1
+ -> Hash
+ -> Seq Scan on int4_tbl i4
+(9 rows)
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ q2 | fivethous | f1
+-----+-----------+----
+ 456 | 456 | 0
+ 456 | 456 | 0
+ 123 | 123 | 0
+ 123 | 123 | 0
+(4 rows)
+
+rollback;
diff --git a/src/test/regress/expected/join_hash_pglz.out b/src/test/regress/expected/join_hash_pglz.out
new file mode 100644
index 00000000000..c6ae5e255d2
--- /dev/null
+++ b/src/test/regress/expected/join_hash_pglz.out
@@ -0,0 +1,1167 @@
+--
+-- exercises for the hash join code
+--
+begin;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'pglz';
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(6 rows)
+
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(6 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+-----------------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 4
+(1 row)
+
+rollback to settings;
+-- A couple of other hash join tests unrelated to work_mem management.
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 1
+(1 row)
+
+rollback to settings;
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+analyze join_foo, join_bar;
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is not matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Parallel Seq Scan on simple s
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple r
+(9 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ QUERY PLAN
+----------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Left Join
+ Hash Cond: (wide.id = wide_1.id)
+ -> Parallel Seq Scan on wide
+ -> Parallel Hash
+ -> Parallel Seq Scan on wide wide_1
+(9 rows)
+
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ length
+--------
+ 320000
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+ROLLBACK TO settings;
+rollback;
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: ((hjtest_1.id = (SubPlan expr_1)) AND ((SubPlan expr_2) = (SubPlan expr_3)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan expr_4) < 50)
+ SubPlan expr_4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ -> Hash
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan expr_5) < 55)
+ SubPlan expr_5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan expr_1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan expr_3
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan expr_2
+ -> Result
+ Output: (hjtest_1.b * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: (((SubPlan expr_1) = hjtest_1.id) AND ((SubPlan expr_3) = (SubPlan expr_2)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan expr_5) < 55)
+ SubPlan expr_5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ -> Hash
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan expr_4) < 50)
+ SubPlan expr_4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan expr_2
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan expr_1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan expr_3
+ -> Result
+ Output: (hjtest_2.c * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+ROLLBACK;
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ QUERY PLAN
+-----------------------------------------------------------
+ Nested Loop
+ -> Seq Scan on int8_tbl i8
+ -> Sort
+ Sort Key: t1.fivethous, i4.f1
+ -> Hash Join
+ Hash Cond: (t1.fivethous = (i4.f1 + i8.q2))
+ -> Seq Scan on tenk1 t1
+ -> Hash
+ -> Seq Scan on int4_tbl i4
+(9 rows)
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ q2 | fivethous | f1
+-----+-----------+----
+ 456 | 456 | 0
+ 456 | 456 | 0
+ 123 | 123 | 0
+ 123 | 123 | 0
+(4 rows)
+
+rollback;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 021d57f66bb..eeb06407d11 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -15,7 +15,6 @@ test: test_setup
# The first group of parallel tests
# ----------
test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn regproc
-
# ----------
# The second group of parallel tests
# multirangetypes depends on rangetypes
@@ -140,3 +139,6 @@ test: fast_default
# run tablespace test at the end because it drops the tablespace created during
# setup that other tests may use.
test: tablespace
+
+# this test is equivalent to join_hash test just the compression is enabled
+test: join_hash_pglz
diff --git a/src/test/regress/sql/join_hash_lz4.sql b/src/test/regress/sql/join_hash_lz4.sql
new file mode 100644
index 00000000000..7f79242c306
--- /dev/null
+++ b/src/test/regress/sql/join_hash_lz4.sql
@@ -0,0 +1,627 @@
+--
+-- exercises for the hash join code
+--
+
+begin;
+
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'lz4';
+
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- A couple of other hash join tests unrelated to work_mem management.
+
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+analyze join_foo, join_bar;
+
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- A full outer join where every record is matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- A full outer join where every record is not matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+rollback to settings;
+
+
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ROLLBACK TO settings;
+
+rollback;
+
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+ROLLBACK;
+
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+rollback;
diff --git a/src/test/regress/sql/join_hash_pglz.sql b/src/test/regress/sql/join_hash_pglz.sql
new file mode 100644
index 00000000000..a7529ec3400
--- /dev/null
+++ b/src/test/regress/sql/join_hash_pglz.sql
@@ -0,0 +1,627 @@
+--
+-- exercises for the hash join code
+--
+
+begin;
+
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'pglz';
+
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- A couple of other hash join tests unrelated to work_mem management.
+
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+analyze join_foo, join_bar;
+
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- A full outer join where every record is matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- A full outer join where every record is not matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+rollback to settings;
+
+
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ROLLBACK TO settings;
+
+rollback;
+
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+ROLLBACK;
+
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+rollback;
--
2.51.1
[application/octet-stream] 0001-Add-transparent-compression-for-temporary-files.patch (18.3K, 4-0001-Add-transparent-compression-for-temporary-files.patch)
download | inline diff:
From 05e5f5ad0e909cb031e0520de2f636b52e3280be Mon Sep 17 00:00:00 2001
From: Filip Janus <[email protected]>
Date: Thu, 31 Jul 2025 14:02:16 +0200
Subject: [PATCH 1/2] Add transparent compression for temporary files
This commit implements transparent compression for temporary files in PostgreSQL,
specifically designed for hash join operations that spill to disk.
Features:
- Support for LZ4 and PGLZ compression algorithms
- GUC parameter 'temp_file_compression' to control compression
- Transparent compression/decompression in BufFile layer
- Shared compression buffer to minimize memory allocation
- Integration: Hash join uses BufFileCreateCompressTemp(), tuplestore
uses compression when EXEC_FLAG_BACKWARD is not required
The compression is applied automatically when temp_file_compression is enabled,
with no changes required to calling code. Only hash joins use compression
currently, with seeking limited to rewinding to start.
Configuration options:
- temp_file_compression = 'no' (default)
- temp_file_compression = 'pglz'
- temp_file_compression = 'lz4' (requires --with-lz4)
Fix GUC tables structure for compression support
---
src/backend/executor/nodeHashjoin.c | 2 +-
src/backend/storage/file/buffile.c | 352 +++++++++++++++++-
src/backend/utils/misc/guc_parameters.dat | 7 +
src/backend/utils/misc/guc_tables.c | 13 +
src/backend/utils/misc/postgresql.conf.sample | 5 +-
src/backend/utils/sort/tuplestore.c | 8 +-
src/include/storage/buffile.h | 10 +
7 files changed, 374 insertions(+), 23 deletions(-)
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 114620a8137..c2f1b2658e1 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -1434,7 +1434,7 @@ ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
{
MemoryContext oldctx = MemoryContextSwitchTo(hashtable->spillCxt);
- file = BufFileCreateTemp(false);
+ file = BufFileCreateCompressTemp(false);
*fileptr = file;
MemoryContextSwitchTo(oldctx);
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ddf3a410d6f..215074a3340 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -53,6 +53,17 @@
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "utils/resowner.h"
+#include "utils/memutils.h"
+#include "common/pg_lzcompress.h"
+
+#ifdef USE_LZ4
+#include <lz4.h>
+#endif
+
+/* Compression types */
+#define TEMP_NONE_COMPRESSION 0
+#define TEMP_PGLZ_COMPRESSION 1
+#define TEMP_LZ4_COMPRESSION 2
/*
* We break BufFiles into gigabyte-sized segments, regardless of RELSEG_SIZE.
@@ -62,6 +73,11 @@
#define MAX_PHYSICAL_FILESIZE 0x40000000
#define BUFFILE_SEG_SIZE (MAX_PHYSICAL_FILESIZE / BLCKSZ)
+/*
+ * Optional transparent compression of temporary files. Disaled by default.
+ */
+int temp_file_compression = TEMP_NONE_COMPRESSION;
+
/*
* This data structure represents a buffered file that consists of one or
* more physical files (each accessed through a virtual file descriptor
@@ -101,8 +117,29 @@ struct BufFile
* wasting per-file alignment padding when some users create many files.
*/
PGAlignedBlock buffer;
+
+ bool compress_tempfile; /* transparent compression mode */
+ int compress; /* enabled compression for the file */
+ char *cBuffer; /* compression buffer */
};
+/*
+ * Header written right before each chunk of data with compression enabled.
+ * The 'len' is the length of the data buffer written right after the header,
+ * and 'raw_len' is the length of uncompressed data. If the data ends up not
+ * being compressed (e.g. when pglz does not reach the compression ratio),
+ * the raw_len is set to -1 and the len is the raw (uncompressed) length.
+ *
+ * To make things simpler, we write these headers even for mathods that do
+ * not fail (or rather when they fail, it's a proper error). The space for
+ * an extra integer seems negligible.
+ */
+typedef struct CompressHeader
+{
+ int len; /* data length (compressed, excluding header) */
+ int raw_len; /* raw length (-1: not compressed) */
+} CompressHeader;
+
static BufFile *makeBufFileCommon(int nfiles);
static BufFile *makeBufFile(File firstfile);
static void extendBufFile(BufFile *file);
@@ -127,6 +164,9 @@ makeBufFileCommon(int nfiles)
file->curOffset = 0;
file->pos = 0;
file->nbytes = 0;
+ file->compress_tempfile = false;
+ file->compress = TEMP_NONE_COMPRESSION;
+ file->cBuffer = NULL;
return file;
}
@@ -215,6 +255,57 @@ BufFileCreateTemp(bool interXact)
return file;
}
+/*
+ * BufFileCreateCompressTemp
+ * Create a temporary file with transparent compression.
+ *
+ * The temporary files will use compression, depending on the current value of
+ * temp_file_compression GUC.
+ *
+ * Note: Compressed files do not support random access. A seek operation other
+ * than seek to the beginning of the buffile will corrupt data.
+ *
+ * Note: The compression algorithm is determined by temp_file_compression GUC.
+ * If set to "none" (TEMP_NONE_COMPRESSION), the file is not compressed.
+ *
+ */
+BufFile *
+BufFileCreateCompressTemp(bool interXact)
+{
+ BufFile *file = BufFileCreateTemp(interXact);
+
+ if (temp_file_compression != TEMP_NONE_COMPRESSION)
+ {
+ int size = 0;
+
+ switch (temp_file_compression)
+ {
+ case TEMP_LZ4_COMPRESSION:
+#ifdef USE_LZ4
+ size = LZ4_compressBound(BLCKSZ) + sizeof(CompressHeader);
+#endif
+ break;
+ case TEMP_PGLZ_COMPRESSION:
+ size = pglz_maximum_compressed_size(BLCKSZ, BLCKSZ) + sizeof(CompressHeader);
+ break;
+ default:
+ break;
+ }
+
+ if (size > 0)
+ {
+ file->compress = temp_file_compression;
+ file->cBuffer = palloc(size);
+ }
+ }
+
+ /* compression with buffer, or no compression and no buffer */
+ Assert((!file->compress && file->cBuffer == NULL) ||
+ (file->compress && file->cBuffer != NULL));
+
+ return file;
+}
+
/*
* Build the name for a given segment of a given BufFile.
*/
@@ -415,9 +506,15 @@ BufFileClose(BufFile *file)
/* flush any unwritten data */
BufFileFlush(file);
+
/* close and delete the underlying file(s) */
for (i = 0; i < file->numFiles; i++)
FileClose(file->files[i]);
+
+ /* release compression buffer if allocated */
+ if (file->cBuffer)
+ pfree(file->cBuffer);
+
/* release the buffer space */
pfree(file->files);
pfree(file);
@@ -454,21 +551,137 @@ BufFileLoadBuffer(BufFile *file)
else
INSTR_TIME_SET_ZERO(io_start);
- /*
- * Read whatever we can get, up to a full bufferload.
- */
- file->nbytes = FileRead(thisfile,
- file->buffer.data,
- sizeof(file->buffer.data),
- file->curOffset,
- WAIT_EVENT_BUFFILE_READ);
- if (file->nbytes < 0)
+ if (file->compress == TEMP_NONE_COMPRESSION)
{
- file->nbytes = 0;
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not read file \"%s\": %m",
- FilePathName(thisfile))));
+ /*
+ * Read whatever we can get, up to a full bufferload.
+ */
+ file->nbytes = FileRead(thisfile,
+ file->buffer.data,
+ sizeof(file->buffer),
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+ if (file->nbytes < 0)
+ {
+ file->nbytes = 0;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(thisfile))));
+ }
+ }
+ else
+ {
+ /*
+ * Read and decompress data from a temporary file. We first read the
+ * header with compressed/raw lengths, and then the compressed data.
+ */
+ int nread;
+ CompressHeader header;
+
+ nread = FileRead(thisfile,
+ &header,
+ sizeof(header),
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+
+ /* did we read the length of the next buffer? */
+ if (nread == 0)
+ {
+ /* eof, nothing to do */
+ }
+ else if (nread != sizeof(header))
+ {
+ /* unexpected number of bytes, also covers (nread < 0) */
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(thisfile))));
+ }
+ else
+ {
+ /* read length of compressed data, read (and decompress) data */
+ char *buff = file->cBuffer;
+
+ Assert(file->cBuffer != NULL);
+
+ /* advance past the length field */
+ file->curOffset += sizeof(header);
+
+ /*
+ * raw_len==-1 means the data was not compressed after all, which
+ * can happen e.g. for non-compressible data with pglz. In that
+ * case just copy the data in place. Otherwise do the
+ * decompression.
+ *
+ * XXX Maybe we should just do the FileRead first, and then either
+ * decompress or memcpy() for raw_len=-1. That'd be an extra
+ * memcpy, but it'd make the code simpler (this ways we do the
+ * error checks twice, for each branch).
+ */
+ if (header.raw_len == -1)
+ {
+ nread = FileRead(thisfile,
+ file->buffer.data,
+ header.len,
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+ if (nread != header.len)
+ {
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(thisfile))));
+ }
+
+ file->nbytes = nread;
+ file->curOffset += nread;
+ }
+ else
+ {
+ /*
+ * Read compressed data into the separate buffer, and then
+ * decompress into the target file buffer.
+ */
+ nread = FileRead(thisfile,
+ buff,
+ header.len,
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+ if (nread != header.len)
+ {
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(thisfile))));
+ }
+
+ switch (file->compress)
+ {
+ case TEMP_LZ4_COMPRESSION:
+#ifdef USE_LZ4
+ file->nbytes = LZ4_decompress_safe(buff,
+ file->buffer.data, header.len,
+ sizeof(file->buffer));
+#endif
+ break;
+
+ case TEMP_PGLZ_COMPRESSION:
+ file->nbytes = pglz_decompress(buff, header.len,
+ file->buffer.data, header.raw_len, false);
+ break;
+ }
+ file->curOffset += nread;
+
+ if (file->nbytes < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("compressed data is corrupt")));
+
+ /* should have got the expected length */
+ Assert(file->nbytes == header.raw_len);
+ }
+ }
}
if (track_io_timing)
@@ -494,8 +707,93 @@ static void
BufFileDumpBuffer(BufFile *file)
{
int64 wpos = 0;
- int64 bytestowrite;
+ int64 bytestowrite = 0;
File thisfile;
+ char *DataToWrite = file->buffer.data;
+ int nbytesOriginal = file->nbytes;
+
+ /*
+ * Compress the data if requested for this temporary file (and if enabled
+ * by the temp_file_compression GUC).
+ *
+ * The compressed data is written to the one shared compression buffer.
+ * There's only a single compression operation at any given time, so one
+ * buffer is enough.
+ *
+ * Then we simply point the "DataToWrite" buffer at the compressed buffer.
+ */
+ if (file->compress != TEMP_NONE_COMPRESSION)
+ {
+ char *cData;
+ int cSize = 0;
+ CompressHeader header;
+
+ Assert(file->cBuffer != NULL);
+ cData = file->cBuffer;
+
+ /* initialize the header for compression */
+ header.len = -1;
+ header.raw_len = nbytesOriginal;
+
+ switch (file->compress)
+ {
+ case TEMP_LZ4_COMPRESSION:
+ {
+#ifdef USE_LZ4
+ int cBufferSize = LZ4_compressBound(file->nbytes);
+
+ /*
+ * XXX We might use lz4 stream compression here. Depending
+ * on the data, that might improve the compression ratio.
+ * The length is stored at the beginning, we'll fill it in
+ * at the end.
+ */
+ cSize = LZ4_compress_default(file->buffer.data,
+ cData + sizeof(CompressHeader),
+ file->nbytes, cBufferSize);
+ if (cSize == 0)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("compression failed, compressed size %d, original size %d",
+ cSize, nbytesOriginal)));
+ }
+#endif
+ break;
+ }
+ case TEMP_PGLZ_COMPRESSION:
+ cSize = pglz_compress(file->buffer.data, file->nbytes,
+ cData + sizeof(CompressHeader),
+ PGLZ_strategy_always);
+
+ /*
+ * pglz returns -1 for non-compressible data. In that case
+ * just copy the raw data into the output buffer.
+ */
+ if (cSize == -1)
+ {
+ memcpy(cData + sizeof(CompressHeader), file->buffer.data,
+ header.raw_len);
+
+ cSize = header.raw_len;
+ header.raw_len = -1;
+ }
+ break;
+ }
+
+ Assert(cSize != -1);
+ header.len = cSize;
+
+ /*
+ * Write the header with compressed length at the beginning of the
+ * buffer. We store both the compressed and raw lengths, and use
+ * raw_len=-1 when the data was not compressed after all.
+ */
+ memcpy(cData, &header, sizeof(CompressHeader));
+ file->nbytes = header.len + sizeof(CompressHeader);
+
+ DataToWrite = cData;
+ }
/*
* Unlike BufFileLoadBuffer, we must dump the whole buffer even if it
@@ -535,7 +833,7 @@ BufFileDumpBuffer(BufFile *file)
INSTR_TIME_SET_ZERO(io_start);
bytestowrite = FileWrite(thisfile,
- file->buffer.data + wpos,
+ DataToWrite + wpos,
bytestowrite,
file->curOffset,
WAIT_EVENT_BUFFILE_WRITE);
@@ -564,7 +862,17 @@ BufFileDumpBuffer(BufFile *file)
* logical file position, ie, original value + pos, in case that is less
* (as could happen due to a small backwards seek in a dirty buffer!)
*/
- file->curOffset -= (file->nbytes - file->pos);
+ if (!file->compress)
+ file->curOffset -= (file->nbytes - file->pos);
+ else if (nbytesOriginal - file->pos != 0)
+ {
+ /*
+ * curOffset must be corrected also if compression is enabled, nbytes
+ * was changed by compression but we have to use the original value of
+ * nbytes
+ */
+ file->curOffset -= bytestowrite;
+ }
if (file->curOffset < 0) /* handle possible segment crossing */
{
file->curFile--;
@@ -602,8 +910,14 @@ BufFileReadCommon(BufFile *file, void *ptr, size_t size, bool exact, bool eofOK)
{
if (file->pos >= file->nbytes)
{
- /* Try to load more data into buffer. */
- file->curOffset += file->pos;
+ /*
+ * Try to load more data into buffer.
+ *
+ * curOffset is moved within BufFileLoadBuffer because stored data
+ * size differs from loaded/ decompressed size
+ */
+ if (!file->compress)
+ file->curOffset += file->pos;
file->pos = 0;
file->nbytes = 0;
BufFileLoadBuffer(file);
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 7c60b125564..4a51c4bb132 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -2933,6 +2933,13 @@
check_hook => 'check_temp_buffers',
},
+{ name => 'temp_file_compression', type => 'enum', context => 'PGC_USERSET', group => 'CLIENT_CONN_STATEMENT',
+ short_desc => 'Sets the default compression method for temporary files.',
+ variable => 'temp_file_compression',
+ boot_val => 'TEMP_NONE_COMPRESSION',
+ options => 'temp_file_compression_options',
+},
+
{ name => 'temp_file_limit', type => 'int', context => 'PGC_SUSET', group => 'RESOURCES_DISK',
short_desc => 'Limits the total size of all temporary files used by each process.',
long_desc => '-1 means no limit.',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 73ff6ad0a32..2887d8c9825 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -78,6 +78,7 @@
#include "replication/syncrep.h"
#include "storage/aio.h"
#include "storage/bufmgr.h"
+#include "storage/buffile.h"
#include "storage/bufpage.h"
#include "storage/copydir.h"
#include "storage/io_worker.h"
@@ -464,6 +465,18 @@ static const struct config_enum_entry default_toast_compression_options[] = {
{NULL, 0, false}
};
+/*
+ * pglz and zstd support should be added as future enhancement
+ */
+static const struct config_enum_entry temp_file_compression_options[] = {
+ {"no", TEMP_NONE_COMPRESSION, false},
+ {"pglz", TEMP_PGLZ_COMPRESSION, false},
+#ifdef USE_LZ4
+ {"lz4", TEMP_LZ4_COMPRESSION, false},
+#endif
+ {NULL, 0, false}
+};
+
static const struct config_enum_entry wal_compression_options[] = {
{"pglz", WAL_COMPRESSION_PGLZ, false},
#ifdef USE_LZ4
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dc9e2255f8a..9f0fc996b6e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -180,8 +180,9 @@
#file_copy_method = copy # copy, clone (if supported by OS)
-#max_notify_queue_pages = 1048576 # limits the number of SLRU pages allocated
- # for NOTIFY / LISTEN queue
+#max_notify_queue_pages = 1048576 # limits the number of SLRU pages allocated
+ # for NOTIFY / LISTEN queue
+#temp_file_compression = 'no' # enables temporary files compression
# - Kernel Resources -
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index e57fa5bca64..76695e4c9e4 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -860,7 +860,13 @@ tuplestore_puttuple_common(Tuplestorestate *state, void *tuple)
*/
oldcxt = MemoryContextSwitchTo(state->context->parent);
- state->myfile = BufFileCreateTemp(state->interXact);
+ /*
+ * If requested random access, can't compress the temp file.
+ */
+ if ((state->eflags & EXEC_FLAG_BACKWARD) != 0)
+ state->myfile = BufFileCreateTemp(state->interXact);
+ else
+ state->myfile = BufFileCreateCompressTemp(state->interXact);
MemoryContextSwitchTo(oldcxt);
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index 6754f37836c..2e52ca74bde 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -32,11 +32,21 @@
typedef struct BufFile BufFile;
+typedef enum
+{
+ TEMP_NONE_COMPRESSION,
+ TEMP_PGLZ_COMPRESSION,
+ TEMP_LZ4_COMPRESSION
+} TempCompression;
+
+extern PGDLLIMPORT int temp_file_compression;
+
/*
* prototypes for functions in buffile.c
*/
extern BufFile *BufFileCreateTemp(bool interXact);
+extern BufFile *BufFileCreateCompressTemp(bool interXact);
extern void BufFileClose(BufFile *file);
pg_nodiscard extern size_t BufFileRead(BufFile *file, void *ptr, size_t size);
extern void BufFileReadExact(BufFile *file, void *ptr, size_t size);
--
2.51.1
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Proposal: Adding compression of temporary files
2025-01-04 23:43 Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-22 07:17 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-25 21:54 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-13 12:53 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
2026-01-13 13:34 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-18 15:50 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
@ 2026-01-19 22:40 ` Zsolt Parragi <[email protected]>
2026-01-20 10:51 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Zsolt Parragi @ 2026-01-19 22:40 UTC (permalink / raw)
To: Filip Janus <[email protected]>; +Cc: lakshmi <[email protected]>; Tomas Vondra <[email protected]>; pgsql-hackers
Hello!
I tried to review the code. It compiled, the test suite passed.
I noticed two typos:
buffile.c:77 - "Disaled"
buffile.c:133 - "mathods"
And a few other small findings:
buffile.h:35 and buffile.c:63 - same constants defined first as an
Enum and then as #defines - code builds properly without the defines.
buffile.c:121 - compress_tempfile is defined, set to false at :167,
but never used otherwise
guc_tables.c:470 - the comment says that pglz isn't supported yet, but
we have a value for it, and I see support for it in the code
buffile.c:659: (and at other places) if USE_LZ4 is undefined, the
codepath doesn't do anything. I think these ifdefs should follow how
other compression code works, such as wal compression where there's an
#else path with elog(ERROR, ...)
Similarly, maybe there should be an explicit TEMP_NONE_COMPRESSION
branch that does nothing, and the default branch should be an error?
buffile.c:265: If seek isn't supported/limited, shouldn't there be at
least an assertion about it in BufFileSeek? And tell isn't mentioned,
but it seems to me that tell also doesn't work properly.
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Proposal: Adding compression of temporary files
2025-01-04 23:43 Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-22 07:17 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-25 21:54 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-13 12:53 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
2026-01-13 13:34 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-18 15:50 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-19 22:40 ` Re: Proposal: Adding compression of temporary files Zsolt Parragi <[email protected]>
@ 2026-01-20 10:51 ` lakshmi <[email protected]>
2026-01-21 06:32 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: lakshmi @ 2026-01-20 10:51 UTC (permalink / raw)
To: Filip Janus <[email protected]>; +Cc: Tomas Vondra <[email protected]>; pgsql-hackers; [email protected]
Hi Filip,
I tested both patches on current master using git am -3 .They apply
cleanly,build fine,and the temp_file _compression GUC works as expected.
Query results are unchanged.
For hash join spill test,temp files were created as expected,but the logged
size were same for no,lz4,and pglz,which seems consistent with fixed-size
fileset chunking.It might be helpful to briefly note this in the
documentation to avoid confusion.
Thanks for working on this .
best regards,
lakshmi
On Tue, Jan 20, 2026 at 4:10 AM Zsolt Parragi <[email protected]>
wrote:
> Hello!
>
> I tried to review the code. It compiled, the test suite passed.
>
> I noticed two typos:
>
> buffile.c:77 - "Disaled"
> buffile.c:133 - "mathods"
>
> And a few other small findings:
>
> buffile.h:35 and buffile.c:63 - same constants defined first as an
> Enum and then as #defines - code builds properly without the defines.
>
> buffile.c:121 - compress_tempfile is defined, set to false at :167,
> but never used otherwise
>
> guc_tables.c:470 - the comment says that pglz isn't supported yet, but
> we have a value for it, and I see support for it in the code
>
> buffile.c:659: (and at other places) if USE_LZ4 is undefined, the
> codepath doesn't do anything. I think these ifdefs should follow how
> other compression code works, such as wal compression where there's an
> #else path with elog(ERROR, ...)
> Similarly, maybe there should be an explicit TEMP_NONE_COMPRESSION
> branch that does nothing, and the default branch should be an error?
>
> buffile.c:265: If seek isn't supported/limited, shouldn't there be at
> least an assertion about it in BufFileSeek? And tell isn't mentioned,
> but it seems to me that tell also doesn't work properly.
>
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Proposal: Adding compression of temporary files
2025-01-04 23:43 Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-22 07:17 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-25 21:54 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-13 12:53 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
2026-01-13 13:34 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-18 15:50 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-19 22:40 ` Re: Proposal: Adding compression of temporary files Zsolt Parragi <[email protected]>
2026-01-20 10:51 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
@ 2026-01-21 06:32 ` lakshmi <[email protected]>
2026-01-23 16:40 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: lakshmi @ 2026-01-21 06:32 UTC (permalink / raw)
To: pgsql-hackers; +Cc: Filip Janus <[email protected]>; [email protected]
HI all,
While testing the temp file compression patch,noticed that the new
temp_file_compression GUC isn't documented yet.I put together a small docs
patch to add a short description and clarify that the effect of compression
depends on the workload(for example ,hash join spills may not show visible
size reduction due to fixed_size chunks).
patch is attached.Happy to adjust the wording if needed.
thanks,
lakshmi
On Tue, Jan 20, 2026 at 4:21 PM lakshmi <[email protected]> wrote:
> Hi Filip,
>
> I tested both patches on current master using git am -3 .They apply
> cleanly,build fine,and the temp_file _compression GUC works as expected.
> Query results are unchanged.
>
> For hash join spill test,temp files were created as expected,but the
> logged size were same for no,lz4,and pglz,which seems consistent with
> fixed-size fileset chunking.It might be helpful to briefly note this in the
> documentation to avoid confusion.
>
> Thanks for working on this .
> best regards,
> lakshmi
>
> On Tue, Jan 20, 2026 at 4:10 AM Zsolt Parragi <[email protected]>
> wrote:
>
>> Hello!
>>
>> I tried to review the code. It compiled, the test suite passed.
>>
>> I noticed two typos:
>>
>> buffile.c:77 - "Disaled"
>> buffile.c:133 - "mathods"
>>
>> And a few other small findings:
>>
>> buffile.h:35 and buffile.c:63 - same constants defined first as an
>> Enum and then as #defines - code builds properly without the defines.
>>
>> buffile.c:121 - compress_tempfile is defined, set to false at :167,
>> but never used otherwise
>>
>> guc_tables.c:470 - the comment says that pglz isn't supported yet, but
>> we have a value for it, and I see support for it in the code
>>
>> buffile.c:659: (and at other places) if USE_LZ4 is undefined, the
>> codepath doesn't do anything. I think these ifdefs should follow how
>> other compression code works, such as wal compression where there's an
>> #else path with elog(ERROR, ...)
>> Similarly, maybe there should be an explicit TEMP_NONE_COMPRESSION
>> branch that does nothing, and the default branch should be an error?
>>
>> buffile.c:265: If seek isn't supported/limited, shouldn't there be at
>> least an assertion about it in BufFileSeek? And tell isn't mentioned,
>> but it seems to me that tell also doesn't work properly.
>>
>
Attachments:
[text/x-patch] doc-temp-file-compression-doc.patch (1.2K, 3-doc-temp-file-compression-doc.patch)
download | inline diff:
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0fad34da6eb..57a8af2a2fc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1955,6 +1955,26 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-temp-file-compression">
+ <term><varname>temp_file_compression</varname> (<type>enum</type>)</term>
+ <listitem>
+ <para>
+ Enables transparent compression of temporary files used by query execution.
+ Supported values are <literal>no</literal>, <literal>lz4</literal>, and
+ <literal>pglz</literal>.
+ </para>
+
+ <para>
+ The effectiveness of temporary file compression depends on the workload.
+ For example, temporary files created by hash join spills use fixed-size
+ chunks, so on-disk file sizes may not visibly shrink even when compression
+ is enabled. Statistics such as <literal>temp_bytes</literal> report logical
+ bytes written before compression.
+ </para>
+ </listitem>
+ </varlistentry>
+
+
<varlistentry id="guc-hash-mem-multiplier" xreflabel="hash_mem_multiplier">
<term><varname>hash_mem_multiplier</varname> (<type>floating point</type>)
<indexterm>
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Proposal: Adding compression of temporary files
2025-01-04 23:43 Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-22 07:17 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-25 21:54 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-13 12:53 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
2026-01-13 13:34 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-18 15:50 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-19 22:40 ` Re: Proposal: Adding compression of temporary files Zsolt Parragi <[email protected]>
2026-01-20 10:51 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
2026-01-21 06:32 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
@ 2026-01-23 16:40 ` Filip Janus <[email protected]>
2026-01-25 11:57 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Filip Janus @ 2026-01-23 16:40 UTC (permalink / raw)
To: lakshmi <[email protected]>; +Cc: pgsql-hackers; [email protected]
Hi all,
Thanks for the feedback and the provided patch.
I've addressed your findings and proposals. Lakshmi's documentation patch
was incorporated.
-Filip-
st 21. 1. 2026 v 7:30 odesílatel lakshmi <[email protected]> napsal:
> HI all,
> While testing the temp file compression patch,noticed that the new
> temp_file_compression GUC isn't documented yet.I put together a small docs
> patch to add a short description and clarify that the effect of compression
> depends on the workload(for example ,hash join spills may not show visible
> size reduction due to fixed_size chunks).
>
> patch is attached.Happy to adjust the wording if needed.
> thanks,
> lakshmi
>
> On Tue, Jan 20, 2026 at 4:21 PM lakshmi <[email protected]> wrote:
>
>> Hi Filip,
>>
>> I tested both patches on current master using git am -3 .They apply
>> cleanly,build fine,and the temp_file _compression GUC works as expected.
>> Query results are unchanged.
>>
>> For hash join spill test,temp files were created as expected,but the
>> logged size were same for no,lz4,and pglz,which seems consistent with
>> fixed-size fileset chunking.It might be helpful to briefly note this in the
>> documentation to avoid confusion.
>>
>> Thanks for working on this .
>> best regards,
>> lakshmi
>>
>> On Tue, Jan 20, 2026 at 4:10 AM Zsolt Parragi <[email protected]>
>> wrote:
>>
>>> Hello!
>>>
>>> I tried to review the code. It compiled, the test suite passed.
>>>
>>> I noticed two typos:
>>>
>>> buffile.c:77 - "Disaled"
>>> buffile.c:133 - "mathods"
>>>
>>> And a few other small findings:
>>>
>>> buffile.h:35 and buffile.c:63 - same constants defined first as an
>>> Enum and then as #defines - code builds properly without the defines.
>>>
>>> buffile.c:121 - compress_tempfile is defined, set to false at :167,
>>> but never used otherwise
>>>
>>> guc_tables.c:470 - the comment says that pglz isn't supported yet, but
>>> we have a value for it, and I see support for it in the code
>>>
>>> buffile.c:659: (and at other places) if USE_LZ4 is undefined, the
>>> codepath doesn't do anything. I think these ifdefs should follow how
>>> other compression code works, such as wal compression where there's an
>>> #else path with elog(ERROR, ...)
>>> Similarly, maybe there should be an explicit TEMP_NONE_COMPRESSION
>>> branch that does nothing, and the default branch should be an error?
>>>
>>> buffile.c:265: If seek isn't supported/limited, shouldn't there be at
>>> least an assertion about it in BufFileSeek? And tell isn't mentioned,
>>> but it seems to me that tell also doesn't work properly.
>>>
>>
Attachments:
[application/octet-stream] 0002-Add-regression-tests-for-temporary-file-compression.patch (127.6K, 3-0002-Add-regression-tests-for-temporary-file-compression.patch)
download | inline diff:
From a5c6b7b0f977ade9884d7350deb9fdf93df312ae Mon Sep 17 00:00:00 2001
From: Filip Janus <[email protected]>
Date: Thu, 31 Jul 2025 14:02:45 +0200
Subject: [PATCH 2/2] Add regression tests for temporary file compression
This commit adds comprehensive regression tests for the transparent
temporary file compression feature.
Test coverage:
- join_hash_lz4.sql: Tests hash join operations with LZ4 compression
- join_hash_pglz.sql: Tests hash join operations with PGLZ compression
- Both tests verify compression works correctly for various hash join scenarios
- Expected output files for validation
Test integration:
- LZ4 tests are conditionally enabled when PostgreSQL is built with --with-lz4
- PGLZ tests are always enabled as PGLZ is built-in
- Tests added to parallel regression test schedule
- GNUmakefile updated to include conditional LZ4 test execution
The tests ensure that compression/decompression works transparently
without affecting query results, while providing coverage for both
supported compression algorithms.
---
src/Makefile.global.in | 1 +
src/test/regress/GNUmakefile | 4 +
src/test/regress/expected/join_hash_lz4.out | 1167 ++++++++++++++++++
src/test/regress/expected/join_hash_pglz.out | 1167 ++++++++++++++++++
src/test/regress/parallel_schedule | 4 +-
src/test/regress/sql/join_hash_lz4.sql | 627 ++++++++++
src/test/regress/sql/join_hash_pglz.sql | 627 ++++++++++
7 files changed, 3596 insertions(+), 1 deletion(-)
create mode 100644 src/test/regress/expected/join_hash_lz4.out
create mode 100644 src/test/regress/expected/join_hash_pglz.out
create mode 100644 src/test/regress/sql/join_hash_lz4.sql
create mode 100644 src/test/regress/sql/join_hash_pglz.sql
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 947a2d79e29..87c950cca82 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -201,6 +201,7 @@ with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
+with_lz4 = @with_lz4@
with_system_tzdata = @with_system_tzdata@
with_uuid = @with_uuid@
with_zlib = @with_zlib@
diff --git a/src/test/regress/GNUmakefile b/src/test/regress/GNUmakefile
index a8ba19e5971..5c4e25bc07c 100644
--- a/src/test/regress/GNUmakefile
+++ b/src/test/regress/GNUmakefile
@@ -94,6 +94,10 @@ installdirs-tests: installdirs
REGRESS_OPTS = --dlpath=. --max-concurrent-tests=20 \
$(EXTRA_REGRESS_OPTS)
+ifeq ($(with_lz4),yes)
+override EXTRA_TESTS := $(EXTRA_TESTS) join_hash_lz4
+endif
+
check: all
$(pg_regress_check) $(REGRESS_OPTS) --schedule=$(srcdir)/parallel_schedule $(MAXCONNOPT) $(EXTRA_TESTS)
diff --git a/src/test/regress/expected/join_hash_lz4.out b/src/test/regress/expected/join_hash_lz4.out
new file mode 100644
index 00000000000..78367bf3719
--- /dev/null
+++ b/src/test/regress/expected/join_hash_lz4.out
@@ -0,0 +1,1167 @@
+--
+-- exercises for the hash join code
+--
+begin;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'lz4';
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(6 rows)
+
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(6 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+-----------------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 4
+(1 row)
+
+rollback to settings;
+-- A couple of other hash join tests unrelated to work_mem management.
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 1
+(1 row)
+
+rollback to settings;
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+analyze join_foo, join_bar;
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is not matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Parallel Seq Scan on simple s
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple r
+(9 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ QUERY PLAN
+----------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Left Join
+ Hash Cond: (wide.id = wide_1.id)
+ -> Parallel Seq Scan on wide
+ -> Parallel Hash
+ -> Parallel Seq Scan on wide wide_1
+(9 rows)
+
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ length
+--------
+ 320000
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+ROLLBACK TO settings;
+rollback;
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: ((hjtest_1.id = (SubPlan expr_1)) AND ((SubPlan expr_2) = (SubPlan expr_3)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan expr_4) < 50)
+ SubPlan expr_4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ -> Hash
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan expr_5) < 55)
+ SubPlan expr_5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan expr_1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan expr_3
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan expr_2
+ -> Result
+ Output: (hjtest_1.b * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: (((SubPlan expr_1) = hjtest_1.id) AND ((SubPlan expr_3) = (SubPlan expr_2)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan expr_5) < 55)
+ SubPlan expr_5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ -> Hash
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan expr_4) < 50)
+ SubPlan expr_4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan expr_2
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan expr_1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan expr_3
+ -> Result
+ Output: (hjtest_2.c * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+ROLLBACK;
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ QUERY PLAN
+-----------------------------------------------------------
+ Nested Loop
+ -> Seq Scan on int8_tbl i8
+ -> Sort
+ Sort Key: t1.fivethous, i4.f1
+ -> Hash Join
+ Hash Cond: (t1.fivethous = (i4.f1 + i8.q2))
+ -> Seq Scan on tenk1 t1
+ -> Hash
+ -> Seq Scan on int4_tbl i4
+(9 rows)
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ q2 | fivethous | f1
+-----+-----------+----
+ 456 | 456 | 0
+ 456 | 456 | 0
+ 123 | 123 | 0
+ 123 | 123 | 0
+(4 rows)
+
+rollback;
diff --git a/src/test/regress/expected/join_hash_pglz.out b/src/test/regress/expected/join_hash_pglz.out
new file mode 100644
index 00000000000..c6ae5e255d2
--- /dev/null
+++ b/src/test/regress/expected/join_hash_pglz.out
@@ -0,0 +1,1167 @@
+--
+-- exercises for the hash join code
+--
+begin;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'pglz';
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(6 rows)
+
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(6 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+-----------------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 4
+(1 row)
+
+rollback to settings;
+-- A couple of other hash join tests unrelated to work_mem management.
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 1
+(1 row)
+
+rollback to settings;
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+analyze join_foo, join_bar;
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is not matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Parallel Seq Scan on simple s
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple r
+(9 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ QUERY PLAN
+----------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Left Join
+ Hash Cond: (wide.id = wide_1.id)
+ -> Parallel Seq Scan on wide
+ -> Parallel Hash
+ -> Parallel Seq Scan on wide wide_1
+(9 rows)
+
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ length
+--------
+ 320000
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+ROLLBACK TO settings;
+rollback;
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: ((hjtest_1.id = (SubPlan expr_1)) AND ((SubPlan expr_2) = (SubPlan expr_3)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan expr_4) < 50)
+ SubPlan expr_4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ -> Hash
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan expr_5) < 55)
+ SubPlan expr_5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan expr_1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan expr_3
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan expr_2
+ -> Result
+ Output: (hjtest_1.b * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: (((SubPlan expr_1) = hjtest_1.id) AND ((SubPlan expr_3) = (SubPlan expr_2)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan expr_5) < 55)
+ SubPlan expr_5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ -> Hash
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan expr_4) < 50)
+ SubPlan expr_4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan expr_2
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan expr_1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan expr_3
+ -> Result
+ Output: (hjtest_2.c * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+ROLLBACK;
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ QUERY PLAN
+-----------------------------------------------------------
+ Nested Loop
+ -> Seq Scan on int8_tbl i8
+ -> Sort
+ Sort Key: t1.fivethous, i4.f1
+ -> Hash Join
+ Hash Cond: (t1.fivethous = (i4.f1 + i8.q2))
+ -> Seq Scan on tenk1 t1
+ -> Hash
+ -> Seq Scan on int4_tbl i4
+(9 rows)
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ q2 | fivethous | f1
+-----+-----------+----
+ 456 | 456 | 0
+ 456 | 456 | 0
+ 123 | 123 | 0
+ 123 | 123 | 0
+(4 rows)
+
+rollback;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 021d57f66bb..eeb06407d11 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -15,7 +15,6 @@ test: test_setup
# The first group of parallel tests
# ----------
test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn regproc
-
# ----------
# The second group of parallel tests
# multirangetypes depends on rangetypes
@@ -140,3 +139,6 @@ test: fast_default
# run tablespace test at the end because it drops the tablespace created during
# setup that other tests may use.
test: tablespace
+
+# this test is equivalent to join_hash test just the compression is enabled
+test: join_hash_pglz
diff --git a/src/test/regress/sql/join_hash_lz4.sql b/src/test/regress/sql/join_hash_lz4.sql
new file mode 100644
index 00000000000..7f79242c306
--- /dev/null
+++ b/src/test/regress/sql/join_hash_lz4.sql
@@ -0,0 +1,627 @@
+--
+-- exercises for the hash join code
+--
+
+begin;
+
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'lz4';
+
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- A couple of other hash join tests unrelated to work_mem management.
+
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+analyze join_foo, join_bar;
+
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- A full outer join where every record is matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- A full outer join where every record is not matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+rollback to settings;
+
+
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ROLLBACK TO settings;
+
+rollback;
+
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+ROLLBACK;
+
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+rollback;
diff --git a/src/test/regress/sql/join_hash_pglz.sql b/src/test/regress/sql/join_hash_pglz.sql
new file mode 100644
index 00000000000..a7529ec3400
--- /dev/null
+++ b/src/test/regress/sql/join_hash_pglz.sql
@@ -0,0 +1,627 @@
+--
+-- exercises for the hash join code
+--
+
+begin;
+
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'pglz';
+
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- A couple of other hash join tests unrelated to work_mem management.
+
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+analyze join_foo, join_bar;
+
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- A full outer join where every record is matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- A full outer join where every record is not matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+rollback to settings;
+
+
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ROLLBACK TO settings;
+
+rollback;
+
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+ROLLBACK;
+
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+rollback;
--
2.39.5 (Apple Git-154)
[application/octet-stream] 0001-Add-transparent-compression-for-temporary-files.patch (21.0K, 4-0001-Add-transparent-compression-for-temporary-files.patch)
download | inline diff:
From 2e5e6bdc3d28e342f525fb0f1cf76719f7a3213b Mon Sep 17 00:00:00 2001
From: Filip Janus <[email protected]>
Date: Thu, 31 Jul 2025 14:02:16 +0200
Subject: [PATCH 1/2] Add transparent compression for temporary files
This commit implements transparent compression for temporary files in PostgreSQL,
specifically designed for hash join operations that spill to disk.
Features:
- Support for LZ4 and PGLZ compression algorithms
- GUC parameter 'temp_file_compression' to control compression
- Transparent compression/decompression in BufFile layer
- Shared compression buffer to minimize memory allocation
- Integration: Hash join uses BufFileCreateCompressTemp(), tuplestore
uses compression when EXEC_FLAG_BACKWARD is not required
The compression is applied automatically when temp_file_compression is enabled,
with no changes required to calling code. Only hash joins use compression
currently, with seeking limited to rewinding to start.
Configuration options:
- temp_file_compression = 'no' (default)
- temp_file_compression = 'pglz'
- temp_file_compression = 'lz4' (requires --with-lz4)
This patch also adds documentation for guc-temp-file-compression (contributed by Lakshmi <[email protected]>).
---
doc/src/sgml/config.sgml | 20 +
src/backend/executor/nodeHashjoin.c | 2 +-
src/backend/storage/file/buffile.c | 374 +++++++++++++++++-
src/backend/utils/misc/guc_parameters.dat | 8 +
src/backend/utils/misc/guc_tables.c | 10 +
src/backend/utils/misc/postgresql.conf.sample | 5 +-
src/backend/utils/sort/tuplestore.c | 8 +-
src/include/storage/buffile.h | 10 +
8 files changed, 414 insertions(+), 23 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5560b95ee60..179e09a8604 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1955,6 +1955,26 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-temp-file-compression">
+ <term><varname>temp_file_compression</varname> (<type>enum</type>)</term>
+ <listitem>
+ <para>
+ Enables transparent compression of temporary files used by query execution.
+ Supported values are <literal>no</literal>, <literal>lz4</literal>, and
+ <literal>pglz</literal>.
+ </para>
+
+ <para>
+ The effectiveness of temporary file compression depends on the workload.
+ For example, temporary files created by hash join spills use fixed-size
+ chunks, so on-disk file sizes may not visibly shrink even when compression
+ is enabled. Statistics such as <literal>temp_bytes</literal> report logical
+ bytes written before compression.
+ </para>
+ </listitem>
+ </varlistentry>
+
+
<varlistentry id="guc-hash-mem-multiplier" xreflabel="hash_mem_multiplier">
<term><varname>hash_mem_multiplier</varname> (<type>floating point</type>)
<indexterm>
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 114620a8137..c2f1b2658e1 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -1434,7 +1434,7 @@ ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
{
MemoryContext oldctx = MemoryContextSwitchTo(hashtable->spillCxt);
- file = BufFileCreateTemp(false);
+ file = BufFileCreateCompressTemp(false);
*fileptr = file;
MemoryContextSwitchTo(oldctx);
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ddf3a410d6f..e6d2af0f815 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -53,6 +53,12 @@
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "utils/resowner.h"
+#include "utils/memutils.h"
+#include "common/pg_lzcompress.h"
+
+#ifdef USE_LZ4
+#include <lz4.h>
+#endif
/*
* We break BufFiles into gigabyte-sized segments, regardless of RELSEG_SIZE.
@@ -62,6 +68,11 @@
#define MAX_PHYSICAL_FILESIZE 0x40000000
#define BUFFILE_SEG_SIZE (MAX_PHYSICAL_FILESIZE / BLCKSZ)
+/*
+ * Optional transparent compression of temporary files. Disabled by default.
+ */
+int temp_file_compression = TEMP_NONE_COMPRESSION;
+
/*
* This data structure represents a buffered file that consists of one or
* more physical files (each accessed through a virtual file descriptor
@@ -101,8 +112,28 @@ struct BufFile
* wasting per-file alignment padding when some users create many files.
*/
PGAlignedBlock buffer;
+
+ int compress; /* enabled compression for the file */
+ char *cBuffer; /* compression buffer */
};
+/*
+ * Header written right before each chunk of data with compression enabled.
+ * The 'len' is the length of the data buffer written right after the header,
+ * and 'raw_len' is the length of uncompressed data. If the data ends up not
+ * being compressed (e.g. when pglz does not reach the compression ratio),
+ * the raw_len is set to -1 and the len is the raw (uncompressed) length.
+ *
+ * To make things simpler, we write these headers even for methods that do
+ * not fail (or rather when they fail, it's a proper error). The space for
+ * an extra integer seems negligible.
+ */
+typedef struct CompressHeader
+{
+ int len; /* data length (compressed, excluding header) */
+ int raw_len; /* raw length (-1: not compressed) */
+} CompressHeader;
+
static BufFile *makeBufFileCommon(int nfiles);
static BufFile *makeBufFile(File firstfile);
static void extendBufFile(BufFile *file);
@@ -127,6 +158,8 @@ makeBufFileCommon(int nfiles)
file->curOffset = 0;
file->pos = 0;
file->nbytes = 0;
+ file->compress = TEMP_NONE_COMPRESSION;
+ file->cBuffer = NULL;
return file;
}
@@ -215,6 +248,63 @@ BufFileCreateTemp(bool interXact)
return file;
}
+/*
+ * BufFileCreateCompressTemp
+ * Create a temporary file with transparent compression.
+ *
+ * The temporary files will use compression, depending on the current value of
+ * temp_file_compression GUC.
+ *
+ * Note: Compressed files do not support random access. A seek operation other
+ * than seek to the beginning of the buffile will corrupt data.
+ *
+ * Note: The compression algorithm is determined by temp_file_compression GUC.
+ * If set to "none" (TEMP_NONE_COMPRESSION), the file is not compressed.
+ *
+ */
+BufFile *
+BufFileCreateCompressTemp(bool interXact)
+{
+ BufFile *file = BufFileCreateTemp(interXact);
+
+ if (temp_file_compression != TEMP_NONE_COMPRESSION)
+ {
+ int size = 0;
+
+ switch (temp_file_compression)
+ {
+ case TEMP_LZ4_COMPRESSION:
+#ifdef USE_LZ4
+ size = LZ4_compressBound(BLCKSZ) + sizeof(CompressHeader);
+#else
+ elog(ERROR, "LZ4 is not supported by this build");
+#endif
+ break;
+ case TEMP_PGLZ_COMPRESSION:
+ size = pglz_maximum_compressed_size(BLCKSZ, BLCKSZ) + sizeof(CompressHeader);
+ break;
+ case TEMP_NONE_COMPRESSION:
+ /* no compression, nothing to do */
+ break;
+ default:
+ elog(ERROR, "unknown compression method: %d", temp_file_compression);
+ break;
+ }
+
+ if (size > 0)
+ {
+ file->compress = temp_file_compression;
+ file->cBuffer = palloc(size);
+ }
+ }
+
+ /* compression with buffer, or no compression and no buffer */
+ Assert((!file->compress && file->cBuffer == NULL) ||
+ (file->compress && file->cBuffer != NULL));
+
+ return file;
+}
+
/*
* Build the name for a given segment of a given BufFile.
*/
@@ -415,9 +505,15 @@ BufFileClose(BufFile *file)
/* flush any unwritten data */
BufFileFlush(file);
+
/* close and delete the underlying file(s) */
for (i = 0; i < file->numFiles; i++)
FileClose(file->files[i]);
+
+ /* release compression buffer if allocated */
+ if (file->cBuffer)
+ pfree(file->cBuffer);
+
/* release the buffer space */
pfree(file->files);
pfree(file);
@@ -454,21 +550,145 @@ BufFileLoadBuffer(BufFile *file)
else
INSTR_TIME_SET_ZERO(io_start);
- /*
- * Read whatever we can get, up to a full bufferload.
- */
- file->nbytes = FileRead(thisfile,
- file->buffer.data,
- sizeof(file->buffer.data),
- file->curOffset,
- WAIT_EVENT_BUFFILE_READ);
- if (file->nbytes < 0)
+ if (file->compress == TEMP_NONE_COMPRESSION)
{
- file->nbytes = 0;
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not read file \"%s\": %m",
- FilePathName(thisfile))));
+ /*
+ * Read whatever we can get, up to a full bufferload.
+ */
+ file->nbytes = FileRead(thisfile,
+ file->buffer.data,
+ sizeof(file->buffer),
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+ if (file->nbytes < 0)
+ {
+ file->nbytes = 0;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(thisfile))));
+ }
+ }
+ else
+ {
+ /*
+ * Read and decompress data from a temporary file. We first read the
+ * header with compressed/raw lengths, and then the compressed data.
+ */
+ int nread;
+ CompressHeader header;
+
+ nread = FileRead(thisfile,
+ &header,
+ sizeof(header),
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+
+ /* did we read the length of the next buffer? */
+ if (nread == 0)
+ {
+ /* eof, nothing to do */
+ }
+ else if (nread != sizeof(header))
+ {
+ /* unexpected number of bytes, also covers (nread < 0) */
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(thisfile))));
+ }
+ else
+ {
+ /* read length of compressed data, read (and decompress) data */
+ char *buff = file->cBuffer;
+
+ Assert(file->cBuffer != NULL);
+
+ /* advance past the length field */
+ file->curOffset += sizeof(header);
+
+ /*
+ * raw_len==-1 means the data was not compressed after all, which
+ * can happen e.g. for non-compressible data with pglz. In that
+ * case just copy the data in place. Otherwise do the
+ * decompression.
+ *
+ * XXX Maybe we should just do the FileRead first, and then either
+ * decompress or memcpy() for raw_len=-1. That'd be an extra
+ * memcpy, but it'd make the code simpler (this ways we do the
+ * error checks twice, for each branch).
+ */
+ if (header.raw_len == -1)
+ {
+ nread = FileRead(thisfile,
+ file->buffer.data,
+ header.len,
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+ if (nread != header.len)
+ {
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(thisfile))));
+ }
+
+ file->nbytes = nread;
+ file->curOffset += nread;
+ }
+ else
+ {
+ /*
+ * Read compressed data into the separate buffer, and then
+ * decompress into the target file buffer.
+ */
+ nread = FileRead(thisfile,
+ buff,
+ header.len,
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+ if (nread != header.len)
+ {
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(thisfile))));
+ }
+
+ switch (file->compress)
+ {
+ case TEMP_LZ4_COMPRESSION:
+#ifdef USE_LZ4
+ file->nbytes = LZ4_decompress_safe(buff,
+ file->buffer.data, header.len,
+ sizeof(file->buffer));
+#else
+ elog(ERROR, "LZ4 is not supported by this build");
+#endif
+ break;
+
+ case TEMP_PGLZ_COMPRESSION:
+ file->nbytes = pglz_decompress(buff, header.len,
+ file->buffer.data, header.raw_len, false);
+ break;
+ case TEMP_NONE_COMPRESSION:
+ /* no compression, nothing to do */
+ break;
+ default:
+ elog(ERROR, "unknown compression method: %d", file->compress);
+ break;
+ }
+ file->curOffset += nread;
+
+ if (file->nbytes < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("compressed data is corrupt")));
+
+ /* should have got the expected length */
+ Assert(file->nbytes == header.raw_len);
+ }
+ }
}
if (track_io_timing)
@@ -494,8 +714,101 @@ static void
BufFileDumpBuffer(BufFile *file)
{
int64 wpos = 0;
- int64 bytestowrite;
+ int64 bytestowrite = 0;
File thisfile;
+ char *DataToWrite = file->buffer.data;
+ int nbytesOriginal = file->nbytes;
+
+ /*
+ * Compress the data if requested for this temporary file (and if enabled
+ * by the temp_file_compression GUC).
+ *
+ * The compressed data is written to the one shared compression buffer.
+ * There's only a single compression operation at any given time, so one
+ * buffer is enough.
+ *
+ * Then we simply point the "DataToWrite" buffer at the compressed buffer.
+ */
+ if (file->compress != TEMP_NONE_COMPRESSION)
+ {
+ char *cData;
+ int cSize = 0;
+ CompressHeader header;
+
+ Assert(file->cBuffer != NULL);
+ cData = file->cBuffer;
+
+ /* initialize the header for compression */
+ header.len = -1;
+ header.raw_len = nbytesOriginal;
+
+ switch (file->compress)
+ {
+ case TEMP_LZ4_COMPRESSION:
+ {
+#ifdef USE_LZ4
+ int cBufferSize = LZ4_compressBound(file->nbytes);
+
+ /*
+ * XXX We might use lz4 stream compression here. Depending
+ * on the data, that might improve the compression ratio.
+ * The length is stored at the beginning, we'll fill it in
+ * at the end.
+ */
+ cSize = LZ4_compress_default(file->buffer.data,
+ cData + sizeof(CompressHeader),
+ file->nbytes, cBufferSize);
+ if (cSize == 0)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("compression failed, compressed size %d, original size %d",
+ cSize, nbytesOriginal)));
+ }
+#else
+ elog(ERROR, "LZ4 is not supported by this build");
+#endif
+ break;
+ }
+ case TEMP_PGLZ_COMPRESSION:
+ cSize = pglz_compress(file->buffer.data, file->nbytes,
+ cData + sizeof(CompressHeader),
+ PGLZ_strategy_always);
+
+ /*
+ * pglz returns -1 for non-compressible data. In that case
+ * just copy the raw data into the output buffer.
+ */
+ if (cSize == -1)
+ {
+ memcpy(cData + sizeof(CompressHeader), file->buffer.data,
+ header.raw_len);
+
+ cSize = header.raw_len;
+ header.raw_len = -1;
+ }
+ break;
+ case TEMP_NONE_COMPRESSION:
+ /* no compression, nothing to do */
+ break;
+ default:
+ elog(ERROR, "unknown compression method: %d", file->compress);
+ break;
+ }
+
+ Assert(cSize != -1);
+ header.len = cSize;
+
+ /*
+ * Write the header with compressed length at the beginning of the
+ * buffer. We store both the compressed and raw lengths, and use
+ * raw_len=-1 when the data was not compressed after all.
+ */
+ memcpy(cData, &header, sizeof(CompressHeader));
+ file->nbytes = header.len + sizeof(CompressHeader);
+
+ DataToWrite = cData;
+ }
/*
* Unlike BufFileLoadBuffer, we must dump the whole buffer even if it
@@ -535,7 +848,7 @@ BufFileDumpBuffer(BufFile *file)
INSTR_TIME_SET_ZERO(io_start);
bytestowrite = FileWrite(thisfile,
- file->buffer.data + wpos,
+ DataToWrite + wpos,
bytestowrite,
file->curOffset,
WAIT_EVENT_BUFFILE_WRITE);
@@ -564,7 +877,17 @@ BufFileDumpBuffer(BufFile *file)
* logical file position, ie, original value + pos, in case that is less
* (as could happen due to a small backwards seek in a dirty buffer!)
*/
- file->curOffset -= (file->nbytes - file->pos);
+ if (!file->compress)
+ file->curOffset -= (file->nbytes - file->pos);
+ else if (nbytesOriginal - file->pos != 0)
+ {
+ /*
+ * curOffset must be corrected also if compression is enabled, nbytes
+ * was changed by compression but we have to use the original value of
+ * nbytes
+ */
+ file->curOffset -= bytestowrite;
+ }
if (file->curOffset < 0) /* handle possible segment crossing */
{
file->curFile--;
@@ -602,8 +925,14 @@ BufFileReadCommon(BufFile *file, void *ptr, size_t size, bool exact, bool eofOK)
{
if (file->pos >= file->nbytes)
{
- /* Try to load more data into buffer. */
- file->curOffset += file->pos;
+ /*
+ * Try to load more data into buffer.
+ *
+ * curOffset is moved within BufFileLoadBuffer because stored data
+ * size differs from loaded/ decompressed size
+ */
+ if (!file->compress)
+ file->curOffset += file->pos;
file->pos = 0;
file->nbytes = 0;
BufFileLoadBuffer(file);
@@ -742,6 +1071,10 @@ BufFileSeek(BufFile *file, int fileno, pgoff_t offset, int whence)
int newFile;
pgoff_t newOffset;
+ /* Compressed files only support seek to the beginning */
+ Assert(file->compress == TEMP_NONE_COMPRESSION ||
+ (whence == SEEK_SET && fileno == 0 && offset == 0));
+
switch (whence)
{
case SEEK_SET:
@@ -831,6 +1164,9 @@ BufFileSeek(BufFile *file, int fileno, pgoff_t offset, int whence)
void
BufFileTell(BufFile *file, int *fileno, pgoff_t *offset)
{
+ /* Tell doesn't work correctly for compressed files */
+ Assert(file->compress == TEMP_NONE_COMPRESSION);
+
*fileno = file->curFile;
*offset = file->curOffset + file->pos;
}
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index f0260e6e412..55bf207637f 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -2931,6 +2931,14 @@
check_hook => 'check_temp_buffers',
},
+{ name => 'temp_file_compression', type => 'enum', context => 'PGC_USERSET', group => 'CLIENT_CONN_STATEMENT',
+ short_desc => 'Sets the default compression method for temporary files.',
+ long_desc => 'Compression is applied transparently to temporary files used by hash joins. Note that the logged file sizes may appear the same regardless of compression method due to fixed-size chunking; actual disk space savings depend on data compressibility.',
+ variable => 'temp_file_compression',
+ boot_val => 'TEMP_NONE_COMPRESSION',
+ options => 'temp_file_compression_options',
+},
+
{ name => 'temp_file_limit', type => 'int', context => 'PGC_SUSET', group => 'RESOURCES_DISK',
short_desc => 'Limits the total size of all temporary files used by each process.',
long_desc => '-1 means no limit.',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 13c569d8790..945e00920cd 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -78,6 +78,7 @@
#include "replication/syncrep.h"
#include "storage/aio.h"
#include "storage/bufmgr.h"
+#include "storage/buffile.h"
#include "storage/bufpage.h"
#include "storage/copydir.h"
#include "storage/io_worker.h"
@@ -464,6 +465,15 @@ static const struct config_enum_entry default_toast_compression_options[] = {
{NULL, 0, false}
};
+static const struct config_enum_entry temp_file_compression_options[] = {
+ {"no", TEMP_NONE_COMPRESSION, false},
+ {"pglz", TEMP_PGLZ_COMPRESSION, false},
+#ifdef USE_LZ4
+ {"lz4", TEMP_LZ4_COMPRESSION, false},
+#endif
+ {NULL, 0, false}
+};
+
static const struct config_enum_entry wal_compression_options[] = {
{"pglz", WAL_COMPRESSION_PGLZ, false},
#ifdef USE_LZ4
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c4f92fcdac8..db23696d243 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -180,8 +180,9 @@
#file_copy_method = copy # copy, clone (if supported by OS)
-#max_notify_queue_pages = 1048576 # limits the number of SLRU pages allocated
- # for NOTIFY / LISTEN queue
+#max_notify_queue_pages = 1048576 # limits the number of SLRU pages allocated
+ # for NOTIFY / LISTEN queue
+#temp_file_compression = 'no' # enables temporary files compression
# - Kernel Resources -
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index e57fa5bca64..76695e4c9e4 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -860,7 +860,13 @@ tuplestore_puttuple_common(Tuplestorestate *state, void *tuple)
*/
oldcxt = MemoryContextSwitchTo(state->context->parent);
- state->myfile = BufFileCreateTemp(state->interXact);
+ /*
+ * If requested random access, can't compress the temp file.
+ */
+ if ((state->eflags & EXEC_FLAG_BACKWARD) != 0)
+ state->myfile = BufFileCreateTemp(state->interXact);
+ else
+ state->myfile = BufFileCreateCompressTemp(state->interXact);
MemoryContextSwitchTo(oldcxt);
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index 6754f37836c..2e52ca74bde 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -32,11 +32,21 @@
typedef struct BufFile BufFile;
+typedef enum
+{
+ TEMP_NONE_COMPRESSION,
+ TEMP_PGLZ_COMPRESSION,
+ TEMP_LZ4_COMPRESSION
+} TempCompression;
+
+extern PGDLLIMPORT int temp_file_compression;
+
/*
* prototypes for functions in buffile.c
*/
extern BufFile *BufFileCreateTemp(bool interXact);
+extern BufFile *BufFileCreateCompressTemp(bool interXact);
extern void BufFileClose(BufFile *file);
pg_nodiscard extern size_t BufFileRead(BufFile *file, void *ptr, size_t size);
extern void BufFileReadExact(BufFile *file, void *ptr, size_t size);
--
2.39.5 (Apple Git-154)
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Proposal: Adding compression of temporary files
2025-01-04 23:43 Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-22 07:17 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-25 21:54 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-13 12:53 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
2026-01-13 13:34 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-18 15:50 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-19 22:40 ` Re: Proposal: Adding compression of temporary files Zsolt Parragi <[email protected]>
2026-01-20 10:51 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
2026-01-21 06:32 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
2026-01-23 16:40 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
@ 2026-01-25 11:57 ` Filip Janus <[email protected]>
2026-02-03 17:07 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
2026-05-25 07:23 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
0 siblings, 2 replies; 13+ messages in thread
From: Filip Janus @ 2026-01-25 11:57 UTC (permalink / raw)
To: lakshmi <[email protected]>; +Cc: pgsql-hackers; [email protected]
Fixed spacing in the patch.
-Filip-
pá 23. 1. 2026 v 17:40 odesílatel Filip Janus <[email protected]> napsal:
> Hi all,
> Thanks for the feedback and the provided patch.
> I've addressed your findings and proposals. Lakshmi's documentation patch
> was incorporated.
>
> -Filip-
>
>
> st 21. 1. 2026 v 7:30 odesílatel lakshmi <[email protected]> napsal:
>
>> HI all,
>> While testing the temp file compression patch,noticed that the new
>> temp_file_compression GUC isn't documented yet.I put together a small docs
>> patch to add a short description and clarify that the effect of compression
>> depends on the workload(for example ,hash join spills may not show visible
>> size reduction due to fixed_size chunks).
>>
>> patch is attached.Happy to adjust the wording if needed.
>> thanks,
>> lakshmi
>>
>> On Tue, Jan 20, 2026 at 4:21 PM lakshmi <[email protected]> wrote:
>>
>>> Hi Filip,
>>>
>>> I tested both patches on current master using git am -3 .They apply
>>> cleanly,build fine,and the temp_file _compression GUC works as expected.
>>> Query results are unchanged.
>>>
>>> For hash join spill test,temp files were created as expected,but the
>>> logged size were same for no,lz4,and pglz,which seems consistent with
>>> fixed-size fileset chunking.It might be helpful to briefly note this in the
>>> documentation to avoid confusion.
>>>
>>> Thanks for working on this .
>>> best regards,
>>> lakshmi
>>>
>>> On Tue, Jan 20, 2026 at 4:10 AM Zsolt Parragi <[email protected]>
>>> wrote:
>>>
>>>> Hello!
>>>>
>>>> I tried to review the code. It compiled, the test suite passed.
>>>>
>>>> I noticed two typos:
>>>>
>>>> buffile.c:77 - "Disaled"
>>>> buffile.c:133 - "mathods"
>>>>
>>>> And a few other small findings:
>>>>
>>>> buffile.h:35 and buffile.c:63 - same constants defined first as an
>>>> Enum and then as #defines - code builds properly without the defines.
>>>>
>>>> buffile.c:121 - compress_tempfile is defined, set to false at :167,
>>>> but never used otherwise
>>>>
>>>> guc_tables.c:470 - the comment says that pglz isn't supported yet, but
>>>> we have a value for it, and I see support for it in the code
>>>>
>>>> buffile.c:659: (and at other places) if USE_LZ4 is undefined, the
>>>> codepath doesn't do anything. I think these ifdefs should follow how
>>>> other compression code works, such as wal compression where there's an
>>>> #else path with elog(ERROR, ...)
>>>> Similarly, maybe there should be an explicit TEMP_NONE_COMPRESSION
>>>> branch that does nothing, and the default branch should be an error?
>>>>
>>>> buffile.c:265: If seek isn't supported/limited, shouldn't there be at
>>>> least an assertion about it in BufFileSeek? And tell isn't mentioned,
>>>> but it seems to me that tell also doesn't work properly.
>>>>
>>>
Attachments:
[application/octet-stream] 0001-Add-transparent-compression-for-temporary-files.patch (20.8K, 3-0001-Add-transparent-compression-for-temporary-files.patch)
download | inline diff:
From 98bc19517d188eca8411c85984a9e1fffb3591a9 Mon Sep 17 00:00:00 2001
From: Filip Janus <[email protected]>
Date: Thu, 31 Jul 2025 14:02:16 +0200
Subject: [PATCH 1/2] Add transparent compression for temporary files
This commit implements transparent compression for temporary files in PostgreSQL,
specifically designed for hash join operations that spill to disk.
Features:
- Support for LZ4 and PGLZ compression algorithms
- GUC parameter 'temp_file_compression' to control compression
- Transparent compression/decompression in BufFile layer
- Shared compression buffer to minimize memory allocation
- Integration: Hash join uses BufFileCreateCompressTemp(), tuplestore
uses compression when EXEC_FLAG_BACKWARD is not required
The compression is applied automatically when temp_file_compression is enabled,
with no changes required to calling code. Only hash joins use compression
currently, with seeking limited to rewinding to start.
Configuration options:
- temp_file_compression = 'no' (default)
- temp_file_compression = 'pglz'
- temp_file_compression = 'lz4' (requires --with-lz4)
This patch also adds documentation for guc-temp-file-compression (contributed by Lakshmi <[email protected]>).
---
doc/src/sgml/config.sgml | 20 +
src/backend/executor/nodeHashjoin.c | 2 +-
src/backend/storage/file/buffile.c | 374 +++++++++++++++++-
src/backend/utils/misc/guc_parameters.dat | 8 +
src/backend/utils/misc/guc_tables.c | 10 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/backend/utils/sort/tuplestore.c | 8 +-
src/include/storage/buffile.h | 10 +
8 files changed, 412 insertions(+), 21 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5560b95ee60..179e09a8604 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1955,6 +1955,26 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-temp-file-compression">
+ <term><varname>temp_file_compression</varname> (<type>enum</type>)</term>
+ <listitem>
+ <para>
+ Enables transparent compression of temporary files used by query execution.
+ Supported values are <literal>no</literal>, <literal>lz4</literal>, and
+ <literal>pglz</literal>.
+ </para>
+
+ <para>
+ The effectiveness of temporary file compression depends on the workload.
+ For example, temporary files created by hash join spills use fixed-size
+ chunks, so on-disk file sizes may not visibly shrink even when compression
+ is enabled. Statistics such as <literal>temp_bytes</literal> report logical
+ bytes written before compression.
+ </para>
+ </listitem>
+ </varlistentry>
+
+
<varlistentry id="guc-hash-mem-multiplier" xreflabel="hash_mem_multiplier">
<term><varname>hash_mem_multiplier</varname> (<type>floating point</type>)
<indexterm>
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 114620a8137..c2f1b2658e1 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -1434,7 +1434,7 @@ ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
{
MemoryContext oldctx = MemoryContextSwitchTo(hashtable->spillCxt);
- file = BufFileCreateTemp(false);
+ file = BufFileCreateCompressTemp(false);
*fileptr = file;
MemoryContextSwitchTo(oldctx);
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index ddf3a410d6f..e6d2af0f815 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -53,6 +53,12 @@
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "utils/resowner.h"
+#include "utils/memutils.h"
+#include "common/pg_lzcompress.h"
+
+#ifdef USE_LZ4
+#include <lz4.h>
+#endif
/*
* We break BufFiles into gigabyte-sized segments, regardless of RELSEG_SIZE.
@@ -62,6 +68,11 @@
#define MAX_PHYSICAL_FILESIZE 0x40000000
#define BUFFILE_SEG_SIZE (MAX_PHYSICAL_FILESIZE / BLCKSZ)
+/*
+ * Optional transparent compression of temporary files. Disabled by default.
+ */
+int temp_file_compression = TEMP_NONE_COMPRESSION;
+
/*
* This data structure represents a buffered file that consists of one or
* more physical files (each accessed through a virtual file descriptor
@@ -101,8 +112,28 @@ struct BufFile
* wasting per-file alignment padding when some users create many files.
*/
PGAlignedBlock buffer;
+
+ int compress; /* enabled compression for the file */
+ char *cBuffer; /* compression buffer */
};
+/*
+ * Header written right before each chunk of data with compression enabled.
+ * The 'len' is the length of the data buffer written right after the header,
+ * and 'raw_len' is the length of uncompressed data. If the data ends up not
+ * being compressed (e.g. when pglz does not reach the compression ratio),
+ * the raw_len is set to -1 and the len is the raw (uncompressed) length.
+ *
+ * To make things simpler, we write these headers even for methods that do
+ * not fail (or rather when they fail, it's a proper error). The space for
+ * an extra integer seems negligible.
+ */
+typedef struct CompressHeader
+{
+ int len; /* data length (compressed, excluding header) */
+ int raw_len; /* raw length (-1: not compressed) */
+} CompressHeader;
+
static BufFile *makeBufFileCommon(int nfiles);
static BufFile *makeBufFile(File firstfile);
static void extendBufFile(BufFile *file);
@@ -127,6 +158,8 @@ makeBufFileCommon(int nfiles)
file->curOffset = 0;
file->pos = 0;
file->nbytes = 0;
+ file->compress = TEMP_NONE_COMPRESSION;
+ file->cBuffer = NULL;
return file;
}
@@ -215,6 +248,63 @@ BufFileCreateTemp(bool interXact)
return file;
}
+/*
+ * BufFileCreateCompressTemp
+ * Create a temporary file with transparent compression.
+ *
+ * The temporary files will use compression, depending on the current value of
+ * temp_file_compression GUC.
+ *
+ * Note: Compressed files do not support random access. A seek operation other
+ * than seek to the beginning of the buffile will corrupt data.
+ *
+ * Note: The compression algorithm is determined by temp_file_compression GUC.
+ * If set to "none" (TEMP_NONE_COMPRESSION), the file is not compressed.
+ *
+ */
+BufFile *
+BufFileCreateCompressTemp(bool interXact)
+{
+ BufFile *file = BufFileCreateTemp(interXact);
+
+ if (temp_file_compression != TEMP_NONE_COMPRESSION)
+ {
+ int size = 0;
+
+ switch (temp_file_compression)
+ {
+ case TEMP_LZ4_COMPRESSION:
+#ifdef USE_LZ4
+ size = LZ4_compressBound(BLCKSZ) + sizeof(CompressHeader);
+#else
+ elog(ERROR, "LZ4 is not supported by this build");
+#endif
+ break;
+ case TEMP_PGLZ_COMPRESSION:
+ size = pglz_maximum_compressed_size(BLCKSZ, BLCKSZ) + sizeof(CompressHeader);
+ break;
+ case TEMP_NONE_COMPRESSION:
+ /* no compression, nothing to do */
+ break;
+ default:
+ elog(ERROR, "unknown compression method: %d", temp_file_compression);
+ break;
+ }
+
+ if (size > 0)
+ {
+ file->compress = temp_file_compression;
+ file->cBuffer = palloc(size);
+ }
+ }
+
+ /* compression with buffer, or no compression and no buffer */
+ Assert((!file->compress && file->cBuffer == NULL) ||
+ (file->compress && file->cBuffer != NULL));
+
+ return file;
+}
+
/*
* Build the name for a given segment of a given BufFile.
*/
@@ -415,9 +505,15 @@ BufFileClose(BufFile *file)
/* flush any unwritten data */
BufFileFlush(file);
+
/* close and delete the underlying file(s) */
for (i = 0; i < file->numFiles; i++)
FileClose(file->files[i]);
+
+ /* release compression buffer if allocated */
+ if (file->cBuffer)
+ pfree(file->cBuffer);
+
/* release the buffer space */
pfree(file->files);
pfree(file);
@@ -454,21 +550,145 @@ BufFileLoadBuffer(BufFile *file)
else
INSTR_TIME_SET_ZERO(io_start);
- /*
- * Read whatever we can get, up to a full bufferload.
- */
- file->nbytes = FileRead(thisfile,
- file->buffer.data,
- sizeof(file->buffer.data),
- file->curOffset,
- WAIT_EVENT_BUFFILE_READ);
- if (file->nbytes < 0)
+ if (file->compress == TEMP_NONE_COMPRESSION)
{
- file->nbytes = 0;
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not read file \"%s\": %m",
- FilePathName(thisfile))));
+ /*
+ * Read whatever we can get, up to a full bufferload.
+ */
+ file->nbytes = FileRead(thisfile,
+ file->buffer.data,
+ sizeof(file->buffer),
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+ if (file->nbytes < 0)
+ {
+ file->nbytes = 0;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(thisfile))));
+ }
+ }
+ else
+ {
+ /*
+ * Read and decompress data from a temporary file. We first read the
+ * header with compressed/raw lengths, and then the compressed data.
+ */
+ int nread;
+ CompressHeader header;
+
+ nread = FileRead(thisfile,
+ &header,
+ sizeof(header),
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+
+ /* did we read the length of the next buffer? */
+ if (nread == 0)
+ {
+ /* eof, nothing to do */
+ }
+ else if (nread != sizeof(header))
+ {
+ /* unexpected number of bytes, also covers (nread < 0) */
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(thisfile))));
+ }
+ else
+ {
+ /* read length of compressed data, read (and decompress) data */
+ char *buff = file->cBuffer;
+
+ Assert(file->cBuffer != NULL);
+
+ /* advance past the length field */
+ file->curOffset += sizeof(header);
+
+ /*
+ * raw_len==-1 means the data was not compressed after all, which
+ * can happen e.g. for non-compressible data with pglz. In that
+ * case just copy the data in place. Otherwise do the
+ * decompression.
+ *
+ * XXX Maybe we should just do the FileRead first, and then either
+ * decompress or memcpy() for raw_len=-1. That'd be an extra
+ * memcpy, but it'd make the code simpler (this ways we do the
+ * error checks twice, for each branch).
+ */
+ if (header.raw_len == -1)
+ {
+ nread = FileRead(thisfile,
+ file->buffer.data,
+ header.len,
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+ if (nread != header.len)
+ {
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(thisfile))));
+ }
+
+ file->nbytes = nread;
+ file->curOffset += nread;
+ }
+ else
+ {
+ /*
+ * Read compressed data into the separate buffer, and then
+ * decompress into the target file buffer.
+ */
+ nread = FileRead(thisfile,
+ buff,
+ header.len,
+ file->curOffset,
+ WAIT_EVENT_BUFFILE_READ);
+ if (nread != header.len)
+ {
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read file \"%s\": %m",
+ FilePathName(thisfile))));
+ }
+
+ switch (file->compress)
+ {
+ case TEMP_LZ4_COMPRESSION:
+#ifdef USE_LZ4
+ file->nbytes = LZ4_decompress_safe(buff,
+ file->buffer.data, header.len,
+ sizeof(file->buffer));
+#else
+ elog(ERROR, "LZ4 is not supported by this build");
+#endif
+ break;
+
+ case TEMP_PGLZ_COMPRESSION:
+ file->nbytes = pglz_decompress(buff, header.len,
+ file->buffer.data, header.raw_len, false);
+ break;
+ case TEMP_NONE_COMPRESSION:
+ /* no compression, nothing to do */
+ break;
+ default:
+ elog(ERROR, "unknown compression method: %d", file->compress);
+ break;
+ }
+ file->curOffset += nread;
+
+ if (file->nbytes < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("compressed data is corrupt")));
+
+ /* should have got the expected length */
+ Assert(file->nbytes == header.raw_len);
+ }
+ }
}
if (track_io_timing)
@@ -494,8 +714,101 @@ static void
BufFileDumpBuffer(BufFile *file)
{
int64 wpos = 0;
- int64 bytestowrite;
+ int64 bytestowrite = 0;
File thisfile;
+ char *DataToWrite = file->buffer.data;
+ int nbytesOriginal = file->nbytes;
+
+ /*
+ * Compress the data if requested for this temporary file (and if enabled
+ * by the temp_file_compression GUC).
+ *
+ * The compressed data is written to the one shared compression buffer.
+ * There's only a single compression operation at any given time, so one
+ * buffer is enough.
+ *
+ * Then we simply point the "DataToWrite" buffer at the compressed buffer.
+ */
+ if (file->compress != TEMP_NONE_COMPRESSION)
+ {
+ char *cData;
+ int cSize = 0;
+ CompressHeader header;
+
+ Assert(file->cBuffer != NULL);
+ cData = file->cBuffer;
+
+ /* initialize the header for compression */
+ header.len = -1;
+ header.raw_len = nbytesOriginal;
+
+ switch (file->compress)
+ {
+ case TEMP_LZ4_COMPRESSION:
+ {
+#ifdef USE_LZ4
+ int cBufferSize = LZ4_compressBound(file->nbytes);
+
+ /*
+ * XXX We might use lz4 stream compression here. Depending
+ * on the data, that might improve the compression ratio.
+ * The length is stored at the beginning, we'll fill it in
+ * at the end.
+ */
+ cSize = LZ4_compress_default(file->buffer.data,
+ cData + sizeof(CompressHeader),
+ file->nbytes, cBufferSize);
+ if (cSize == 0)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg_internal("compression failed, compressed size %d, original size %d",
+ cSize, nbytesOriginal)));
+ }
+#else
+ elog(ERROR, "LZ4 is not supported by this build");
+#endif
+ break;
+ }
+ case TEMP_PGLZ_COMPRESSION:
+ cSize = pglz_compress(file->buffer.data, file->nbytes,
+ cData + sizeof(CompressHeader),
+ PGLZ_strategy_always);
+
+ /*
+ * pglz returns -1 for non-compressible data. In that case
+ * just copy the raw data into the output buffer.
+ */
+ if (cSize == -1)
+ {
+ memcpy(cData + sizeof(CompressHeader), file->buffer.data,
+ header.raw_len);
+
+ cSize = header.raw_len;
+ header.raw_len = -1;
+ }
+ break;
+ case TEMP_NONE_COMPRESSION:
+ /* no compression, nothing to do */
+ break;
+ default:
+ elog(ERROR, "unknown compression method: %d", file->compress);
+ break;
+ }
+
+ Assert(cSize != -1);
+ header.len = cSize;
+
+ /*
+ * Write the header with compressed length at the beginning of the
+ * buffer. We store both the compressed and raw lengths, and use
+ * raw_len=-1 when the data was not compressed after all.
+ */
+ memcpy(cData, &header, sizeof(CompressHeader));
+ file->nbytes = header.len + sizeof(CompressHeader);
+
+ DataToWrite = cData;
+ }
/*
* Unlike BufFileLoadBuffer, we must dump the whole buffer even if it
@@ -535,7 +848,7 @@ BufFileDumpBuffer(BufFile *file)
INSTR_TIME_SET_ZERO(io_start);
bytestowrite = FileWrite(thisfile,
- file->buffer.data + wpos,
+ DataToWrite + wpos,
bytestowrite,
file->curOffset,
WAIT_EVENT_BUFFILE_WRITE);
@@ -564,7 +877,17 @@ BufFileDumpBuffer(BufFile *file)
* logical file position, ie, original value + pos, in case that is less
* (as could happen due to a small backwards seek in a dirty buffer!)
*/
- file->curOffset -= (file->nbytes - file->pos);
+ if (!file->compress)
+ file->curOffset -= (file->nbytes - file->pos);
+ else if (nbytesOriginal - file->pos != 0)
+ {
+ /*
+ * curOffset must be corrected also if compression is enabled, nbytes
+ * was changed by compression but we have to use the original value of
+ * nbytes
+ */
+ file->curOffset -= bytestowrite;
+ }
if (file->curOffset < 0) /* handle possible segment crossing */
{
file->curFile--;
@@ -602,8 +925,14 @@ BufFileReadCommon(BufFile *file, void *ptr, size_t size, bool exact, bool eofOK)
{
if (file->pos >= file->nbytes)
{
- /* Try to load more data into buffer. */
- file->curOffset += file->pos;
+ /*
+ * Try to load more data into buffer.
+ *
+ * curOffset is moved within BufFileLoadBuffer because stored data
+ * size differs from loaded/ decompressed size
+ */
+ if (!file->compress)
+ file->curOffset += file->pos;
file->pos = 0;
file->nbytes = 0;
BufFileLoadBuffer(file);
@@ -742,6 +1071,10 @@ BufFileSeek(BufFile *file, int fileno, pgoff_t offset, int whence)
int newFile;
pgoff_t newOffset;
+ /* Compressed files only support seek to the beginning */
+ Assert(file->compress == TEMP_NONE_COMPRESSION ||
+ (whence == SEEK_SET && fileno == 0 && offset == 0));
+
switch (whence)
{
case SEEK_SET:
@@ -831,6 +1164,9 @@ BufFileSeek(BufFile *file, int fileno, pgoff_t offset, int whence)
void
BufFileTell(BufFile *file, int *fileno, pgoff_t *offset)
{
+ /* Tell doesn't work correctly for compressed files */
+ Assert(file->compress == TEMP_NONE_COMPRESSION);
+
*fileno = file->curFile;
*offset = file->curOffset + file->pos;
}
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index f0260e6e412..55bf207637f 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -2931,6 +2931,14 @@
check_hook => 'check_temp_buffers',
},
+{ name => 'temp_file_compression', type => 'enum', context => 'PGC_USERSET', group => 'CLIENT_CONN_STATEMENT',
+ short_desc => 'Sets the default compression method for temporary files.',
+ long_desc => 'Compression is applied transparently to temporary files used by hash joins. Note that the logged file sizes may appear the same regardless of compression method due to fixed-size chunking; actual disk space savings depend on data compressibility.',
+ variable => 'temp_file_compression',
+ boot_val => 'TEMP_NONE_COMPRESSION',
+ options => 'temp_file_compression_options',
+},
+
{ name => 'temp_file_limit', type => 'int', context => 'PGC_SUSET', group => 'RESOURCES_DISK',
short_desc => 'Limits the total size of all temporary files used by each process.',
long_desc => '-1 means no limit.',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 13c569d8790..945e00920cd 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -78,6 +78,7 @@
#include "replication/syncrep.h"
#include "storage/aio.h"
#include "storage/bufmgr.h"
+#include "storage/buffile.h"
#include "storage/bufpage.h"
#include "storage/copydir.h"
#include "storage/io_worker.h"
@@ -464,6 +465,15 @@ static const struct config_enum_entry default_toast_compression_options[] = {
{NULL, 0, false}
};
+static const struct config_enum_entry temp_file_compression_options[] = {
+ {"no", TEMP_NONE_COMPRESSION, false},
+ {"pglz", TEMP_PGLZ_COMPRESSION, false},
+#ifdef USE_LZ4
+ {"lz4", TEMP_LZ4_COMPRESSION, false},
+#endif
+ {NULL, 0, false}
+};
+
static const struct config_enum_entry wal_compression_options[] = {
{"pglz", WAL_COMPRESSION_PGLZ, false},
#ifdef USE_LZ4
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c4f92fcdac8..aab9af8a040 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -182,6 +182,7 @@
#max_notify_queue_pages = 1048576 # limits the number of SLRU pages allocated
# for NOTIFY / LISTEN queue
+#temp_file_compression = 'no' # enables temporary files compression
# - Kernel Resources -
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index e57fa5bca64..76695e4c9e4 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -860,7 +860,13 @@ tuplestore_puttuple_common(Tuplestorestate *state, void *tuple)
*/
oldcxt = MemoryContextSwitchTo(state->context->parent);
- state->myfile = BufFileCreateTemp(state->interXact);
+ /*
+ * If requested random access, can't compress the temp file.
+ */
+ if ((state->eflags & EXEC_FLAG_BACKWARD) != 0)
+ state->myfile = BufFileCreateTemp(state->interXact);
+ else
+ state->myfile = BufFileCreateCompressTemp(state->interXact);
MemoryContextSwitchTo(oldcxt);
diff --git a/src/include/storage/buffile.h b/src/include/storage/buffile.h
index 6754f37836c..2e52ca74bde 100644
--- a/src/include/storage/buffile.h
+++ b/src/include/storage/buffile.h
@@ -32,11 +32,21 @@
typedef struct BufFile BufFile;
+typedef enum
+{
+ TEMP_NONE_COMPRESSION,
+ TEMP_PGLZ_COMPRESSION,
+ TEMP_LZ4_COMPRESSION
+} TempCompression;
+
+extern PGDLLIMPORT int temp_file_compression;
+
/*
* prototypes for functions in buffile.c
*/
extern BufFile *BufFileCreateTemp(bool interXact);
+extern BufFile *BufFileCreateCompressTemp(bool interXact);
extern void BufFileClose(BufFile *file);
pg_nodiscard extern size_t BufFileRead(BufFile *file, void *ptr, size_t size);
extern void BufFileReadExact(BufFile *file, void *ptr, size_t size);
--
2.39.5 (Apple Git-154)
[application/octet-stream] 0002-Add-regression-tests-for-temporary-file-compression.patch (127.6K, 4-0002-Add-regression-tests-for-temporary-file-compression.patch)
download | inline diff:
From b0c2737591fa48abd01965d6204d47efb9771436 Mon Sep 17 00:00:00 2001
From: Filip Janus <[email protected]>
Date: Thu, 31 Jul 2025 14:02:45 +0200
Subject: [PATCH 2/2] Add regression tests for temporary file compression
This commit adds comprehensive regression tests for the transparent
temporary file compression feature.
Test coverage:
- join_hash_lz4.sql: Tests hash join operations with LZ4 compression
- join_hash_pglz.sql: Tests hash join operations with PGLZ compression
- Both tests verify compression works correctly for various hash join scenarios
- Expected output files for validation
Test integration:
- LZ4 tests are conditionally enabled when PostgreSQL is built with --with-lz4
- PGLZ tests are always enabled as PGLZ is built-in
- Tests added to parallel regression test schedule
- GNUmakefile updated to include conditional LZ4 test execution
The tests ensure that compression/decompression works transparently
without affecting query results, while providing coverage for both
supported compression algorithms.
---
src/Makefile.global.in | 1 +
src/test/regress/GNUmakefile | 4 +
src/test/regress/expected/join_hash_lz4.out | 1167 ++++++++++++++++++
src/test/regress/expected/join_hash_pglz.out | 1167 ++++++++++++++++++
src/test/regress/parallel_schedule | 4 +-
src/test/regress/sql/join_hash_lz4.sql | 627 ++++++++++
src/test/regress/sql/join_hash_pglz.sql | 627 ++++++++++
7 files changed, 3596 insertions(+), 1 deletion(-)
create mode 100644 src/test/regress/expected/join_hash_lz4.out
create mode 100644 src/test/regress/expected/join_hash_pglz.out
create mode 100644 src/test/regress/sql/join_hash_lz4.sql
create mode 100644 src/test/regress/sql/join_hash_pglz.sql
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 947a2d79e29..87c950cca82 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -201,6 +201,7 @@ with_liburing = @with_liburing@
with_libxml = @with_libxml@
with_libxslt = @with_libxslt@
with_llvm = @with_llvm@
+with_lz4 = @with_lz4@
with_system_tzdata = @with_system_tzdata@
with_uuid = @with_uuid@
with_zlib = @with_zlib@
diff --git a/src/test/regress/GNUmakefile b/src/test/regress/GNUmakefile
index a8ba19e5971..5c4e25bc07c 100644
--- a/src/test/regress/GNUmakefile
+++ b/src/test/regress/GNUmakefile
@@ -94,6 +94,10 @@ installdirs-tests: installdirs
REGRESS_OPTS = --dlpath=. --max-concurrent-tests=20 \
$(EXTRA_REGRESS_OPTS)
+ifeq ($(with_lz4),yes)
+override EXTRA_TESTS := $(EXTRA_TESTS) join_hash_lz4
+endif
+
check: all
$(pg_regress_check) $(REGRESS_OPTS) --schedule=$(srcdir)/parallel_schedule $(MAXCONNOPT) $(EXTRA_TESTS)
diff --git a/src/test/regress/expected/join_hash_lz4.out b/src/test/regress/expected/join_hash_lz4.out
new file mode 100644
index 00000000000..78367bf3719
--- /dev/null
+++ b/src/test/regress/expected/join_hash_lz4.out
@@ -0,0 +1,1167 @@
+--
+-- exercises for the hash join code
+--
+begin;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'lz4';
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(6 rows)
+
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(6 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+-----------------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 4
+(1 row)
+
+rollback to settings;
+-- A couple of other hash join tests unrelated to work_mem management.
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 1
+(1 row)
+
+rollback to settings;
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+analyze join_foo, join_bar;
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is not matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Parallel Seq Scan on simple s
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple r
+(9 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ QUERY PLAN
+----------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Left Join
+ Hash Cond: (wide.id = wide_1.id)
+ -> Parallel Seq Scan on wide
+ -> Parallel Hash
+ -> Parallel Seq Scan on wide wide_1
+(9 rows)
+
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ length
+--------
+ 320000
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+ROLLBACK TO settings;
+rollback;
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: ((hjtest_1.id = (SubPlan expr_1)) AND ((SubPlan expr_2) = (SubPlan expr_3)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan expr_4) < 50)
+ SubPlan expr_4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ -> Hash
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan expr_5) < 55)
+ SubPlan expr_5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan expr_1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan expr_3
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan expr_2
+ -> Result
+ Output: (hjtest_1.b * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: (((SubPlan expr_1) = hjtest_1.id) AND ((SubPlan expr_3) = (SubPlan expr_2)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan expr_5) < 55)
+ SubPlan expr_5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ -> Hash
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan expr_4) < 50)
+ SubPlan expr_4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan expr_2
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan expr_1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan expr_3
+ -> Result
+ Output: (hjtest_2.c * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+ROLLBACK;
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ QUERY PLAN
+-----------------------------------------------------------
+ Nested Loop
+ -> Seq Scan on int8_tbl i8
+ -> Sort
+ Sort Key: t1.fivethous, i4.f1
+ -> Hash Join
+ Hash Cond: (t1.fivethous = (i4.f1 + i8.q2))
+ -> Seq Scan on tenk1 t1
+ -> Hash
+ -> Seq Scan on int4_tbl i4
+(9 rows)
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ q2 | fivethous | f1
+-----+-----------+----
+ 456 | 456 | 0
+ 456 | 456 | 0
+ 123 | 123 | 0
+ 123 | 123 | 0
+(4 rows)
+
+rollback;
diff --git a/src/test/regress/expected/join_hash_pglz.out b/src/test/regress/expected/join_hash_pglz.out
new file mode 100644
index 00000000000..c6ae5e255d2
--- /dev/null
+++ b/src/test/regress/expected/join_hash_pglz.out
@@ -0,0 +1,1167 @@
+--
+-- exercises for the hash join code
+--
+begin;
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'pglz';
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | f
+(1 row)
+
+rollback to settings;
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ t | f
+(1 row)
+
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ QUERY PLAN
+------------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(6 rows)
+
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+ QUERY PLAN
+---------------------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Partial Aggregate
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on bigger_than_it_looks s
+(9 rows)
+
+select count(*) from simple r join bigger_than_it_looks s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+ initially_multibatch | increased_batches
+----------------------+-------------------
+ f | t
+(1 row)
+
+rollback to settings;
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------
+ Aggregate
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(6 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+--------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 2
+(1 row)
+
+rollback to settings;
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+ QUERY PLAN
+-----------------------------------------------------------------
+ Aggregate
+ -> Gather
+ Workers Planned: 1
+ -> Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on extremely_skewed s
+(8 rows)
+
+select count(*) from simple r join extremely_skewed s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 4
+(1 row)
+
+rollback to settings;
+-- A couple of other hash join tests unrelated to work_mem management.
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+ original | final
+----------+-------
+ 1 | 1
+(1 row)
+
+rollback to settings;
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+analyze join_foo, join_bar;
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Hash
+ -> Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ QUERY PLAN
+------------------------------------------------------------------------------------
+ Aggregate
+ -> Nested Loop Left Join
+ Join Filter: ((join_foo.id < (b1.id + 1)) AND (join_foo.id > (b1.id - 1)))
+ -> Seq Scan on join_foo
+ -> Gather
+ Workers Planned: 2
+ -> Parallel Hash Join
+ Hash Cond: (b1.id = b2.id)
+ -> Parallel Seq Scan on join_bar b1
+ -> Parallel Hash
+ -> Parallel Seq Scan on join_bar b2
+(11 rows)
+
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+ count
+-------
+ 3
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+ multibatch
+------------
+ f
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Seq Scan on simple r
+ -> Hash
+ -> Seq Scan on simple s
+(6 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: (r.id = s.id)
+ -> Parallel Seq Scan on simple r
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple s
+(9 rows)
+
+select count(*) from simple r full outer join simple s using (id);
+ count
+-------
+ 20000
+(1 row)
+
+rollback to settings;
+-- A full outer join where every record is not matched.
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+----------------------------------------
+ Aggregate
+ -> Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Seq Scan on simple s
+ -> Hash
+ -> Seq Scan on simple r
+(6 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ QUERY PLAN
+-------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Full Join
+ Hash Cond: ((0 - s.id) = r.id)
+ -> Parallel Seq Scan on simple s
+ -> Parallel Hash
+ -> Parallel Seq Scan on simple r
+(9 rows)
+
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+ count
+-------
+ 40000
+(1 row)
+
+rollback to settings;
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ QUERY PLAN
+----------------------------------------------------------------
+ Finalize Aggregate
+ -> Gather
+ Workers Planned: 2
+ -> Partial Aggregate
+ -> Parallel Hash Left Join
+ Hash Cond: (wide.id = wide_1.id)
+ -> Parallel Seq Scan on wide
+ -> Parallel Hash
+ -> Parallel Seq Scan on wide wide_1
+(9 rows)
+
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+ length
+--------
+ 320000
+(1 row)
+
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+ multibatch
+------------
+ t
+(1 row)
+
+rollback to settings;
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ id | id
+----+----
+ 1 |
+ | 2
+(2 rows)
+
+ROLLBACK TO settings;
+rollback;
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: ((hjtest_1.id = (SubPlan expr_1)) AND ((SubPlan expr_2) = (SubPlan expr_3)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan expr_4) < 50)
+ SubPlan expr_4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ -> Hash
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan expr_5) < 55)
+ SubPlan expr_5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan expr_1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan expr_3
+ -> Result
+ Output: (hjtest_2.c * 5)
+ SubPlan expr_2
+ -> Result
+ Output: (hjtest_1.b * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Hash Join
+ Output: hjtest_1.a, hjtest_2.a, (hjtest_1.tableoid)::regclass, (hjtest_2.tableoid)::regclass
+ Hash Cond: (((SubPlan expr_1) = hjtest_1.id) AND ((SubPlan expr_3) = (SubPlan expr_2)))
+ Join Filter: (hjtest_1.a <> hjtest_2.b)
+ -> Seq Scan on public.hjtest_2
+ Output: hjtest_2.a, hjtest_2.tableoid, hjtest_2.id, hjtest_2.c, hjtest_2.b
+ Filter: ((SubPlan expr_5) < 55)
+ SubPlan expr_5
+ -> Result
+ Output: (hjtest_2.c * 5)
+ -> Hash
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ -> Seq Scan on public.hjtest_1
+ Output: hjtest_1.a, hjtest_1.tableoid, hjtest_1.id, hjtest_1.b
+ Filter: ((SubPlan expr_4) < 50)
+ SubPlan expr_4
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan expr_2
+ -> Result
+ Output: (hjtest_1.b * 5)
+ SubPlan expr_1
+ -> Result
+ Output: 1
+ One-Time Filter: (hjtest_2.id = 1)
+ SubPlan expr_3
+ -> Result
+ Output: (hjtest_2.c * 5)
+(28 rows)
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+ a1 | a2 | t1 | t2
+------+----+----------+----------
+ text | t | hjtest_1 | hjtest_2
+(1 row)
+
+ROLLBACK;
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ QUERY PLAN
+-----------------------------------------------------------
+ Nested Loop
+ -> Seq Scan on int8_tbl i8
+ -> Sort
+ Sort Key: t1.fivethous, i4.f1
+ -> Hash Join
+ Hash Cond: (t1.fivethous = (i4.f1 + i8.q2))
+ -> Seq Scan on tenk1 t1
+ -> Hash
+ -> Seq Scan on int4_tbl i4
+(9 rows)
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+ q2 | fivethous | f1
+-----+-----------+----
+ 456 | 456 | 0
+ 456 | 456 | 0
+ 123 | 123 | 0
+ 123 | 123 | 0
+(4 rows)
+
+rollback;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 021d57f66bb..eeb06407d11 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -15,7 +15,6 @@ test: test_setup
# The first group of parallel tests
# ----------
test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn regproc
-
# ----------
# The second group of parallel tests
# multirangetypes depends on rangetypes
@@ -140,3 +139,6 @@ test: fast_default
# run tablespace test at the end because it drops the tablespace created during
# setup that other tests may use.
test: tablespace
+
+# this test is equivalent to join_hash test just the compression is enabled
+test: join_hash_pglz
diff --git a/src/test/regress/sql/join_hash_lz4.sql b/src/test/regress/sql/join_hash_lz4.sql
new file mode 100644
index 00000000000..7f79242c306
--- /dev/null
+++ b/src/test/regress/sql/join_hash_lz4.sql
@@ -0,0 +1,627 @@
+--
+-- exercises for the hash join code
+--
+
+begin;
+
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'lz4';
+
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- A couple of other hash join tests unrelated to work_mem management.
+
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+analyze join_foo, join_bar;
+
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- A full outer join where every record is matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- A full outer join where every record is not matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+rollback to settings;
+
+
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ROLLBACK TO settings;
+
+rollback;
+
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+ROLLBACK;
+
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+rollback;
diff --git a/src/test/regress/sql/join_hash_pglz.sql b/src/test/regress/sql/join_hash_pglz.sql
new file mode 100644
index 00000000000..a7529ec3400
--- /dev/null
+++ b/src/test/regress/sql/join_hash_pglz.sql
@@ -0,0 +1,627 @@
+--
+-- exercises for the hash join code
+--
+
+begin;
+
+set local min_parallel_table_scan_size = 0;
+set local parallel_setup_cost = 0;
+set local enable_hashjoin = on;
+set local temp_file_compression = 'pglz';
+
+-- Extract bucket and batch counts from an explain analyze plan. In
+-- general we can't make assertions about how many batches (or
+-- buckets) will be required because it can vary, but we can in some
+-- special cases and we can check for growth.
+create or replace function find_hash(node json)
+returns json language plpgsql
+as
+$$
+declare
+ x json;
+ child json;
+begin
+ if node->>'Node Type' = 'Hash' then
+ return node;
+ else
+ for child in select json_array_elements(node->'Plans')
+ loop
+ x := find_hash(child);
+ if x is not null then
+ return x;
+ end if;
+ end loop;
+ return null;
+ end if;
+end;
+$$;
+create or replace function hash_join_batches(query text)
+returns table (original int, final int) language plpgsql
+as
+$$
+declare
+ whole_plan json;
+ hash_node json;
+begin
+ for whole_plan in
+ execute 'explain (analyze, format ''json'') ' || query
+ loop
+ hash_node := find_hash(json_extract_path(whole_plan, '0', 'Plan'));
+ original := hash_node->>'Original Hash Batches';
+ final := hash_node->>'Hash Batches';
+ return next;
+ end loop;
+end;
+$$;
+
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+ select generate_series(1, 20000) AS id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+
+-- Make a relation whose size we will under-estimate. We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+ select generate_series(1, 20000) as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+analyze bigger_than_it_looks;
+update pg_class set reltuples = 1000 where relname = 'bigger_than_it_looks';
+
+-- Make a relation whose size we underestimate and that also has a
+-- kind of skew that breaks our batching scheme. We want stats to say
+-- 2 rows, but actually there are 20,000 rows with the same key.
+create table extremely_skewed (id int, t text);
+alter table extremely_skewed set (autovacuum_enabled = 'false');
+alter table extremely_skewed set (parallel_workers = 2);
+analyze extremely_skewed;
+insert into extremely_skewed
+ select 42 as id, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ from generate_series(1, 20000);
+update pg_class
+ set reltuples = 2, relpages = pg_relation_size('extremely_skewed') / 8192
+ where relname = 'extremely_skewed';
+
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 320000, 'x') as t;
+alter table wide set (parallel_workers = 2);
+
+-- The "optimal" case: the hash table fits in memory; we plan for 1
+-- batch, we stick to that number, and peak memory usage stays within
+-- our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- The "good" case: batches required, but we plan the right number; we
+-- plan for some number of batches, and we stick to that number, and
+-- peak memory usage says within our work_mem budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join simple s using (id);
+select count(*) from simple r join simple s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+-- parallel full multi-batch hash join
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- The "bad" case: during execution we need to increase number of
+-- batches; in this case we plan for 1 batch, and increase at least a
+-- couple of times, and peak memory usage stays within our work_mem
+-- budget
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) FROM simple r JOIN bigger_than_it_looks s USING (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '192kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+select count(*) from simple r join bigger_than_it_looks s using (id);
+select original > 1 as initially_multibatch, final > original as increased_batches
+ from hash_join_batches(
+$$
+ select count(*) from simple r join bigger_than_it_looks s using (id);
+$$);
+rollback to settings;
+
+-- The "ugly" case: increasing the number of batches during execution
+-- doesn't help, so stop trying to fit in work_mem and hope for the
+-- best; in this case we plan for 1 batch, increases just once and
+-- then stop increasing because that didn't help at all, so we blow
+-- right through the work_mem budget and hope for the best...
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-oblivious hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = off;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- parallel with parallel-aware hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 1;
+set local work_mem = '128kB';
+set local hash_mem_multiplier = 1.0;
+set local enable_parallel_hash = on;
+explain (costs off)
+ select count(*) from simple r join extremely_skewed s using (id);
+select count(*) from simple r join extremely_skewed s using (id);
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join extremely_skewed s using (id);
+$$);
+rollback to settings;
+
+-- A couple of other hash join tests unrelated to work_mem management.
+
+-- Check that EXPLAIN ANALYZE has data even if the leader doesn't participate
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+set local work_mem = '4MB';
+set local hash_mem_multiplier = 1.0;
+set local parallel_leader_participation = off;
+select * from hash_join_batches(
+$$
+ select count(*) from simple r join simple s using (id);
+$$);
+rollback to settings;
+
+-- Exercise rescans. We'll turn off parallel_leader_participation so
+-- that we can check that instrumentation comes back correctly.
+
+create table join_foo as select generate_series(1, 3) as id, 'xxxxx'::text as t;
+alter table join_foo set (parallel_workers = 0);
+create table join_bar as select generate_series(1, 10000) as id, 'xxxxx'::text as t;
+alter table join_bar set (parallel_workers = 2);
+analyze join_foo, join_bar;
+
+-- multi-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-oblivious
+savepoint settings;
+set enable_parallel_hash = off;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- multi-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '64kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- single-batch with rescan, parallel-aware
+savepoint settings;
+set enable_parallel_hash = on;
+set parallel_leader_participation = off;
+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+set parallel_tuple_cost = 0;
+set max_parallel_workers_per_gather = 2;
+set enable_material = off;
+set enable_mergejoin = off;
+set work_mem = '4MB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select count(*) from join_foo
+ left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
+ on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
+$$);
+rollback to settings;
+
+-- A full outer join where every record is matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s using (id);
+select count(*) from simple r full outer join simple s using (id);
+rollback to settings;
+
+-- A full outer join where every record is not matched.
+
+-- non-parallel
+savepoint settings;
+set local max_parallel_workers_per_gather = 0;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism not possible with parallel-oblivious full hash join
+savepoint settings;
+set enable_parallel_hash = off;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+-- parallelism is possible with parallel-aware full hash join
+savepoint settings;
+set local max_parallel_workers_per_gather = 2;
+explain (costs off)
+ select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
+rollback to settings;
+
+
+-- exercise special code paths for huge tuples (note use of non-strict
+-- expression and left join required to get the detoasted tuple into
+-- the hash table)
+
+-- parallel with parallel-aware hash join (hits ExecParallelHashLoadTuple and
+-- sts_puttuple oversized tuple cases because it's multi-batch)
+savepoint settings;
+set max_parallel_workers_per_gather = 2;
+set enable_parallel_hash = on;
+set work_mem = '128kB';
+set hash_mem_multiplier = 1.0;
+explain (costs off)
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select length(max(s.t))
+from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+select final > 1 as multibatch
+ from hash_join_batches(
+$$
+ select length(max(s.t))
+ from wide left join (select id, coalesce(t, '') || '' as t from wide) s using (id);
+$$);
+rollback to settings;
+
+
+-- Hash join reuses the HOT status bit to indicate match status. This can only
+-- be guaranteed to produce correct results if all the hash join tuple match
+-- bits are reset before reuse. This is done upon loading them into the
+-- hashtable.
+SAVEPOINT settings;
+SET enable_parallel_hash = on;
+SET min_parallel_table_scan_size = 0;
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+CREATE TABLE hjtest_matchbits_t1(id int);
+CREATE TABLE hjtest_matchbits_t2(id int);
+INSERT INTO hjtest_matchbits_t1 VALUES (1);
+INSERT INTO hjtest_matchbits_t2 VALUES (2);
+-- Update should create a HOT tuple. If this status bit isn't cleared, we won't
+-- correctly emit the NULL-extended unmatching tuple in full hash join.
+UPDATE hjtest_matchbits_t2 set id = 2;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id
+ ORDER BY t1.id;
+-- Test serial full hash join.
+-- Resetting parallel_setup_cost should force a serial plan.
+-- Just to be safe, however, set enable_parallel_hash to off, as parallel full
+-- hash joins are only supported with shared hashtables.
+RESET parallel_setup_cost;
+SET enable_parallel_hash = off;
+SELECT * FROM hjtest_matchbits_t1 t1 FULL JOIN hjtest_matchbits_t2 t2 ON t1.id = t2.id;
+ROLLBACK TO settings;
+
+rollback;
+
+-- Verify that hash key expressions reference the correct
+-- nodes. Hashjoin's hashkeys need to reference its outer plan, Hash's
+-- need to reference Hash's outer plan (which is below HashJoin's
+-- inner plan). It's not trivial to verify that the references are
+-- correct (we don't display the hashkeys themselves), but if the
+-- hashkeys contain subplan references, those will be displayed. Force
+-- subplans to appear just about everywhere.
+--
+-- Bug report:
+-- https://www.postgresql.org/message-id/CAPpHfdvGVegF_TKKRiBrSmatJL2dR9uwFCuR%2BteQ_8tEXU8mxg%40mail.gmail.com
+--
+BEGIN;
+SET LOCAL enable_sort = OFF; -- avoid mergejoins
+SET LOCAL from_collapse_limit = 1; -- allows easy changing of join order
+
+CREATE TABLE hjtest_1 (a text, b int, id int, c bool);
+CREATE TABLE hjtest_2 (a bool, id int, b text, c int);
+
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 2, 1, false); -- matches
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 2, false); -- fails id join condition
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 20, 1, false); -- fails < 50
+INSERT INTO hjtest_1(a, b, id, c) VALUES ('text', 1, 1, false); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 2); -- matches
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 3, 'another', 7); -- fails id join condition
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 90); -- fails < 55
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'another', 3); -- fails (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+INSERT INTO hjtest_2(a, id, b, c) VALUES (true, 1, 'text', 1); -- fails hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_1, hjtest_2
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+EXPLAIN (COSTS OFF, VERBOSE)
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+SELECT hjtest_1.a a1, hjtest_2.a a2,hjtest_1.tableoid::regclass t1, hjtest_2.tableoid::regclass t2
+FROM hjtest_2, hjtest_1
+WHERE
+ hjtest_1.id = (SELECT 1 WHERE hjtest_2.id = 1)
+ AND (SELECT hjtest_1.b * 5) = (SELECT hjtest_2.c*5)
+ AND (SELECT hjtest_1.b * 5) < 50
+ AND (SELECT hjtest_2.c * 5) < 55
+ AND hjtest_1.a <> hjtest_2.b;
+
+ROLLBACK;
+
+-- Verify that we behave sanely when the inner hash keys contain parameters
+-- (that is, outer or lateral references). This situation has to defeat
+-- re-use of the inner hash table across rescans.
+begin;
+set local enable_hashjoin = on;
+
+explain (costs off)
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+select i8.q2, ss.* from
+int8_tbl i8,
+lateral (select t1.fivethous, i4.f1 from tenk1 t1 join int4_tbl i4
+ on t1.fivethous = i4.f1+i8.q2 order by 1,2) ss;
+
+rollback;
--
2.39.5 (Apple Git-154)
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Proposal: Adding compression of temporary files
2025-01-04 23:43 Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-22 07:17 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-25 21:54 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-13 12:53 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
2026-01-13 13:34 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-18 15:50 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-19 22:40 ` Re: Proposal: Adding compression of temporary files Zsolt Parragi <[email protected]>
2026-01-20 10:51 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
2026-01-21 06:32 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
2026-01-23 16:40 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-25 11:57 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
@ 2026-02-03 17:07 ` lakshmi <[email protected]>
1 sibling, 0 replies; 13+ messages in thread
From: lakshmi @ 2026-02-03 17:07 UTC (permalink / raw)
To: Filip Janus <[email protected]>; +Cc: pgsql-hackers; [email protected]
Hi all,
I’ve applied the latest two patches on current 19devel and wanted to share
some testing results.
The patches apply cleanly, build and install without issues, and the server
starts normally. I verified that the new temp_file_compression GUC works as
expected and accepts the documented values (no, lz4, pglz), while invalid
values are correctly rejected.
For testing, I forced temp file usage by running parallel hash joins with a
small work_mem. I ran the same query with temp_file_compression set to no,
lz4, and pglz. In all cases, temp files were created and used (hash join
spilling to disk), query results were identical, and I did not see any
crashes or read/write errors.
The temp read/write counters were very similar across all three modes. This
seems expected for hash join spills, since they use fixed-size fileset
chunks, so compression doesn’t necessarily reduce the number of temp blocks
written. Execution time was comparable across modes, with no regressions
observed.
I also ran make check, which passed.
Based on this testing, everything looks good to me, and the observed
behavior matches the documentation clarification about workload-dependent
effects of temp file compression.
Thanks for working on this I’m happy to test further or try additional
workloads if that would be useful.
Best regards,
Lakshmi
On Sun, Jan 25, 2026 at 5:27 PM Filip Janus <[email protected]> wrote:
> Fixed spacing in the patch.
>
> -Filip-
>
>
> pá 23. 1. 2026 v 17:40 odesílatel Filip Janus <[email protected]> napsal:
>
>> Hi all,
>> Thanks for the feedback and the provided patch.
>> I've addressed your findings and proposals. Lakshmi's documentation patch
>> was incorporated.
>>
>> -Filip-
>>
>>
>> st 21. 1. 2026 v 7:30 odesílatel lakshmi <[email protected]> napsal:
>>
>>> HI all,
>>> While testing the temp file compression patch,noticed that the new
>>> temp_file_compression GUC isn't documented yet.I put together a small docs
>>> patch to add a short description and clarify that the effect of compression
>>> depends on the workload(for example ,hash join spills may not show visible
>>> size reduction due to fixed_size chunks).
>>>
>>> patch is attached.Happy to adjust the wording if needed.
>>> thanks,
>>> lakshmi
>>>
>>> On Tue, Jan 20, 2026 at 4:21 PM lakshmi <[email protected]> wrote:
>>>
>>>> Hi Filip,
>>>>
>>>> I tested both patches on current master using git am -3 .They apply
>>>> cleanly,build fine,and the temp_file _compression GUC works as expected.
>>>> Query results are unchanged.
>>>>
>>>> For hash join spill test,temp files were created as expected,but the
>>>> logged size were same for no,lz4,and pglz,which seems consistent with
>>>> fixed-size fileset chunking.It might be helpful to briefly note this in the
>>>> documentation to avoid confusion.
>>>>
>>>> Thanks for working on this .
>>>> best regards,
>>>> lakshmi
>>>>
>>>> On Tue, Jan 20, 2026 at 4:10 AM Zsolt Parragi <
>>>> [email protected]> wrote:
>>>>
>>>>> Hello!
>>>>>
>>>>> I tried to review the code. It compiled, the test suite passed.
>>>>>
>>>>> I noticed two typos:
>>>>>
>>>>> buffile.c:77 - "Disaled"
>>>>> buffile.c:133 - "mathods"
>>>>>
>>>>> And a few other small findings:
>>>>>
>>>>> buffile.h:35 and buffile.c:63 - same constants defined first as an
>>>>> Enum and then as #defines - code builds properly without the defines.
>>>>>
>>>>> buffile.c:121 - compress_tempfile is defined, set to false at :167,
>>>>> but never used otherwise
>>>>>
>>>>> guc_tables.c:470 - the comment says that pglz isn't supported yet, but
>>>>> we have a value for it, and I see support for it in the code
>>>>>
>>>>> buffile.c:659: (and at other places) if USE_LZ4 is undefined, the
>>>>> codepath doesn't do anything. I think these ifdefs should follow how
>>>>> other compression code works, such as wal compression where there's an
>>>>> #else path with elog(ERROR, ...)
>>>>> Similarly, maybe there should be an explicit TEMP_NONE_COMPRESSION
>>>>> branch that does nothing, and the default branch should be an error?
>>>>>
>>>>> buffile.c:265: If seek isn't supported/limited, shouldn't there be at
>>>>> least an assertion about it in BufFileSeek? And tell isn't mentioned,
>>>>> but it seems to me that tell also doesn't work properly.
>>>>>
>>>>
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Proposal: Adding compression of temporary files
2025-01-04 23:43 Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-22 07:17 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-25 21:54 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-13 12:53 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
2026-01-13 13:34 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-18 15:50 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-19 22:40 ` Re: Proposal: Adding compression of temporary files Zsolt Parragi <[email protected]>
2026-01-20 10:51 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
2026-01-21 06:32 ` Re: Proposal: Adding compression of temporary files lakshmi <[email protected]>
2026-01-23 16:40 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2026-01-25 11:57 ` Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
@ 2026-05-25 07:23 ` Filip Janus <[email protected]>
1 sibling, 0 replies; 13+ messages in thread
From: Filip Janus @ 2026-05-25 07:23 UTC (permalink / raw)
To: Tomas Vondra <[email protected]>; +Cc: lakshmi <[email protected]>; pgsql-hackers; [email protected]
Hi Tomas,
Thanks for the feedback.
> What is COMPRESS_BLCKSZ? I don't see that in the patch anywhere.
> What am I missing?
It is a #define I introduced in the latest revision of the patch,
in src/backend/storage/file/buffile.c:
#define COMPRESS_BLCKSZ (4 * BLCKSZ) /* 32KB */
The version you benchmarked (from January) used BLCKSZ (8 KB) directly
as the compression unit -- each 8 KB buffer was compressed and written
separately. After your benchmark, I experimented with larger blocks
and found that compressing 32 KB at a time works noticeably better:
the algorithm gets more context per call (better entropy coding),
per-block framing overhead is amortized, and we make 4x fewer
compress/decompress calls.
The motivation was to improve both speed and compression ratio: with
8 KB blocks, the algorithm sees too little data per call to exploit
redundancy effectively, especially for wider rows where repetitive
patterns span more than one page. 32 KB gives substantially better
ratios with fewer calls, without excessive memory overhead (one
extra 32 KB buffer per open compressed BufFile).
For testing I recompiled with three values -- BLCKSZ (8 KB),
4*BLCKSZ (32 KB), and 8*BLCKSZ (64 KB). It is a compile-time
constant, not a GUC.
> I'm still not quite sure what "compression block size" means here,
> and how did you change it.
Same answer -- sorry for not being clear. Your benchmark used the
original 8 KB block size from the January patch. My main results
used the updated patch with 32 KB blocks. The comparison in my first
email was not entirely apples-to-apples -- I should have noted that
more clearly.
That said, the block size accounts for only a modest part of the
difference (e.g. lz4 d=1000 w=8 on HDD: 58% with 8 KB vs 52% with
32 KB). The larger gains come from the storage and memory pressure
differences between our machines.
> I wonder how much this depends on the particular data set (e.g. if
> we generate data differently, how much would it affect the results).
Good question. The d parameter already covers a range of data
redundancy (d=1 is least compressible, d=1000 is most), so the tables
show best and worst cases for the same schema. Real-world workloads
with wider rows, more NULLs, or variable-length fields would likely
compress differently -- I'd expect better ratios in many cases, since
the benchmark data is relatively compact (bigint + md5 text).
> What bothers me a little bit is that systems generally are not under
> such pressure 24/7, but only for some part of a day. But people will
> mostly set the GUC in the config file.
That is a fair point. temp_file_compression can be set at the session
level (SET temp_file_compression = 'lz4'), so an application could
enable it only for known-heavy queries. On our I/O-constrained
machines the worst case for lz4 was ~94% (within noise). Your
results on fast NVMe showed higher overhead -- up to ~135% for lz4
with w=1, where CPU cost dominates and there's no I/O to save. So
for systems with plenty of RAM and fast storage, per-session or
per-query activation may indeed be more appropriate than a global
setting.
No rush on further work -- happy to run more tests in the meantime
if anything comes to mind.
regards
-Filip-
út 12. 5. 2026 v 16:14 odesílatel Tomas Vondra <[email protected]> napsal:
> On 5/11/26 09:09, Filip Janus wrote:
> >
> >
> > Hi Tomas,
> >
> > Thanks for the thorough benchmark and the script -- it was very helpful
> > as a starting point for my testing. I understand the results on
> > your machine were discouraging, and I appreciate the honest assessment.
> >
> > I ran a similar benchmark on different x86_64 hardware to see how the
> > results change under more I/O pressure. The short version: lz4 and
> > zstd show significant speedups once storage or page cache becomes a
> > bottleneck.
> >
>
> I'm glad you didn't just give up and decided to run some more tests.
>
> > Setup
> > -----
> >
> > I used your run-hashjoins.sh as a base, with the same parameters:
> > 100M rows, d in {1, 10, 100, 1000}, w in {1, 4, 8}, drop-caches
> > between runs. I also added zstd to the compression methods tested,
> > and tested with a larger compression block size (32 KB instead of
> > the default 8 KB BLCKSZ).
> >
> > Two x86_64 machines:
> >
> > (A) HPE BL460c Gen10, 2x Xeon Gold 6148, 64 GB RAM,
> > rotational HDD (5 disks), io_uring, Fedora 43
> >
> > (B) Dell MX840c, Xeon Gold 6148, SATA SSD (~224 GB),
> > RAM capped to 16 GB via systemd MemoryMax
> >
> > Both use 32 KB compression blocks (COMPRESS_BLCKSZ = 4*BLCKSZ).
> >
>
> What is COMPRESS_BLCKSZ? I don't see that in the patch anywhere. What am
> I missing?
>
> > Results
> > -------
> >
> > Below are the relative timings (% of uncompressed baseline), directly
> > comparable to your table. Values below 100% mean compression is faster.
> >
> > Your results (Xeon, 64 GB, SSD/NVMe, 8 KB blocks):
> >
> > pglz lz4
> > rows rep 1 4 8 1 4 8
> > -------------------------------------------------
> > 10 1 661 688 300 144 148 86
> > 10 1000 460 472 234 119 119 58
> > 100 1 471 303 204 132 135 102
> > 100 1000 378 262 164 107 91 81
> >
> > Our results, machine A -- x86 HDD, 64 GB, 32 KB blocks:
> >
> > pglz lz4 zstd
> > rows rep 1 4 8 1 4 8 1 4 8
> > ----------------------------------------------------------------
> > 100 1 200 119 69 91 82 67 80 50 35
> > 100 10 204 101 70 91 64 66 83 44 39
> > 100 100 220 104 72 94 75 69 85 50 34
> > 100 1000 170 92 54 79 58 52 74 42 28
> >
> > Our results, machine B -- x86 SATA SSD, 16 GB cap, 32 KB blocks:
> >
> > pglz lz4 zstd
> > rows rep 1 4 8 1 4 8 1 4 8
> > ----------------------------------------------------------------
> > 100 1 284 103 79 92 81 82 98 59 53
> > 100 10 262 99 77 92 80 85 96 57 50
> > 100 100 221 89 67 80 70 64 85 49 44
> > 100 1000 155 51 42 72 39 39 77 27 29
> >
> > Analysis
> > --------
> >
> > I think the key difference is page cache pressure. Your machine has
> > 64 GB RAM with 8 GB shared_buffers, leaving ~56 GB for the OS page
> > cache. Even with 8 connections x ~10 GB temp files = ~80 GB, a large
> > portion stays cached and synchronous I/O to storage is limited.
> >
> > On our machines, I/O is a real bottleneck:
> > - Machine A: rotational HDD with 8 concurrent streams
> > - Machine B: SATA SSD but only 16 GB RAM, so the page cache
> > cannot absorb 8 x 12 GB of temp data
> >
> > Under these conditions, reducing the bytes written translates
> > directly into wall-clock savings.
> >
>
> Seems like that. It's not a huge surprise that this matters more on
> systems with memory pressure and slower storage. I should have tested
> that on my machines too.
>
> I was going to question how common such systems are nowadays, when
> people can just spin a VM with plenty of RAM and SSDs. But given the
> current RAM shortage / costs, and relatively slow network storage (even
> if temporary files can use ephemeral disks), maybe it's not all that
> uncommon ...
>
> > Both your results and ours confirm that pglz is simply too slow for
> > this use case. Your benchmark shows 164-688% overhead; ours shows
> > 155-284% with w=1. Even under heavy I/O contention (w=8 on HDD)
> > where pglz eventually wins, it never outperforms lz4 or zstd. I
> > would recommend against offering pglz for temp file compression
> > altogether -- it creates a trap for users who might try it expecting
> > reasonable performance.
> >
>
> Right.
>
> > lz4 looks safe: the worst case in our data is 94% (w=1, d=100 on
> > HDD) -- barely distinguishable from noise. Under I/O pressure it
> > delivers 39-52% of baseline time (2-2.5x speedup).
> >
> > zstd is the most compelling option: it achieves the best compression
> > ratios (down to 22% of original size on the SATA SSD) and the best
> > speedups (27-28% of baseline = 3.5x faster), with no regression
> > exceeding 98% on x86_64. I would recommend zstd as the primary
> > option to document, with lz4 as a lighter-weight alternative.
> >
>
> Agreed. lz4 seems safe, zstd is good too. I wonder how much this depends
> on the particular data set (e.g. if we generate data differently, how
> much would it affect the results).
>
> > Compression block size
> > ----------------------
> >
> > I also tested 8 KB, 32 KB, and 64 KB compression block sizes.
> > 32 KB appears to be the sweet spot. Example for lz4, d=1000, w=8
> > on HDD:
> >
> > COMPRESS_BLCKSZ time (% of no) compressed bytes
> > --------------------------------------------------------
> > 8 KB (BLCKSZ) 58% 7.47 GB
> > 32 KB (4*BLCKSZ) 52% 7.22 GB
> > 64 KB (8*BLCKSZ) 56% 7.14 GB
> >
> > The 8K-to-32K improvement comes from fewer compress/decompress calls
> > (4x fewer), less per-block header overhead, and better compression
> > ratios. Going to 64K shows diminishing returns and slightly worse
> > timings, possibly due to increased cache pressure.
> >
>
> I'm still not quite sure what "compression block size" means here, and
> how did you change it.
>
> > Conclusion
> > ----------
> >
> > I think the data shows that the benefit of temporary file compression
> > depends heavily on the I/O characteristics of the system. On machines
> > with fast storage and ample page cache, compression is neutral -- it
> > means negligible overhead, which is a good outcome on its own. On
> > systems with real I/O pressure -- slower storage, limited RAM, or
> > concurrent workloads competing for page cache -- compression delivers
> > substantial speedups.
> >
>
> True.
>
> > The feature does not need to be enabled by default. Compression is
> > controlled by the temp_file_compression GUC, which defaults to "none".
> > That means there is no risk of regression for existing users. But for
> > administrators who know their systems are I/O-constrained -- spinning
> > disks, limited memory, heavy concurrent spilling -- having the option
> > to enable lz4 or zstd can make a real difference. The data above shows
> > up to 3.5x speedup in those scenarios, with no
> > downside when the setting is left at its default.
> >
> Yes, having it as opt-in for systems where it matters helps.
>
> What bothers me a little bit is that systems generally are not under
> such pressure 24/7, but only for some part of a day. But people will
> mostly set the GUC in the config file. I don't have a better solution to
> this, though.
>
>
> FYI I won't be able to do much work on this until ~June.
>
>
> regards
>
> --
> Tomas Vondra
>
>
^ permalink raw reply [nested|flat] 13+ messages in thread
end of thread, other threads:[~2026-05-25 07:23 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-01-04 23:43 Re: Proposal: Adding compression of temporary files Filip Janus <[email protected]>
2025-04-22 07:17 ` Filip Janus <[email protected]>
2025-04-25 21:54 ` Filip Janus <[email protected]>
2026-01-13 12:53 ` lakshmi <[email protected]>
2026-01-13 13:34 ` Filip Janus <[email protected]>
2026-01-18 15:50 ` Filip Janus <[email protected]>
2026-01-19 22:40 ` Zsolt Parragi <[email protected]>
2026-01-20 10:51 ` lakshmi <[email protected]>
2026-01-21 06:32 ` lakshmi <[email protected]>
2026-01-23 16:40 ` Filip Janus <[email protected]>
2026-01-25 11:57 ` Filip Janus <[email protected]>
2026-02-03 17:07 ` lakshmi <[email protected]>
2026-05-25 07:23 ` Filip Janus <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox