Use log_newpage_range in HASH index build

public inbox for [email protected]  
help / color / mirror / Atom feed

Use log_newpage_range in HASH index build
6+ messages / 3 participants
[nested] [flat]

* Use log_newpage_range in HASH index build
@ 2025-10-23 19:21  Kirill Reshke <[email protected]>
  0 siblings, 2 replies; 6+ messages in thread

From: Kirill Reshke @ 2025-10-23 19:21 UTC (permalink / raw)
  To: pgsql-hackers

There exists an optimization to index creation process, when we omit
to write any WAL
for index build. It is currently supported in B Tree, GIN, GiST, spg indexes.
It works because we do not need to recover anything if index creation
fails, because if was not used by any query. So, the index can be
built on-disk, and then, just before making the index alive, we can
simply log all pages to WAL.

Hash index currently lacks this optimization.
PFA implementation.

During my testing, I checked the amount of WAL generated by index
build before and after patch applied. My script was something like:

select pg_current_wal_insert_lsn();

create index on t using hash (i);

select pg_current_wal_insert_lsn();

select pg_lsn_wal_diff(lsn1, lsn2);

Resulting numbers depend on index size, but I got 2.5-3.5 times less
WAL with this patch and 8 times less WAL with this patch +
wal_compression=on.
Index creation time, however, did not change much...

About implementation:
These are many types of record that can be generated during index build.
I know for sure these are possible (double-checked using pg_waldump):

SPLIT_COMPLETE
INSERT
SPLIT_ALLOCATE_PAGE
SPLIT_PAGE
ADD_OVFL_PAGE
SQUEEZE_PAGE
INIT_META_PAGE
INIT_BITMAP_PAGE


Looks like  SPLIT_COMPLETE and VACUUM_ONE_PAGE are never generated
during index build. I'm not sure about  MOVE_PAGE_CONTENTS.

So, implementation is simply pass isbuild flag everywhere something is
wal-logged. Looks like it is less invasive than alternatives.

-- 
Best regards,
Kirill Reshke


Attachments:

  [application/octet-stream] v2-0001-Use-log_newpage_range-in-HASH-index-build.patch (16.1K, 2-v2-0001-Use-log_newpage_range-in-HASH-index-build.patch)
  download | inline diff:
From f9f4ecc74d91dd5ea613ed461280a49811820458 Mon Sep 17 00:00:00 2001
From: reshke <[email protected]>
Date: Thu, 23 Oct 2025 17:53:30 +0000
Subject: [PATCH v2] Use log_newpage_range in HASH index build

---
 src/backend/access/hash/hash.c       | 25 ++++++++++++++--------
 src/backend/access/hash/hashinsert.c |  8 +++----
 src/backend/access/hash/hashovfl.c   | 19 +++++++++--------
 src/backend/access/hash/hashpage.c   | 32 ++++++++++++++++------------
 src/backend/access/hash/hashsort.c   |  4 +++-
 src/include/access/hash.h            | 14 ++++++------
 6 files changed, 58 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819fb..4fff0634e6d 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -141,7 +141,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
 
 	/* Initialize the hash index metadata page and initial buckets */
-	num_buckets = _hash_init(index, reltuples, MAIN_FORKNUM);
+	num_buckets = _hash_init(index, reltuples, MAIN_FORKNUM, true);
 
 	/*
 	 * If we just insert the tuples into the index in scan order, then
@@ -190,6 +190,11 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		_h_spooldestroy(buildstate.spool);
 	}
 
+	if (RelationNeedsWAL(index))
+		log_newpage_range(index, MAIN_FORKNUM,
+								  0, RelationGetNumberOfBlocks(index),
+								  true);
+
 	/*
 	 * Return statistics
 	 */
@@ -207,7 +212,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 hashbuildempty(Relation index)
 {
-	_hash_init(index, 0, INIT_FORKNUM);
+	_hash_init(index, 0, INIT_FORKNUM, false);
 }
 
 /*
@@ -241,7 +246,7 @@ hashbuildCallback(Relation index,
 		itup = index_form_tuple(RelationGetDescr(index),
 								index_values, index_isnull);
 		itup->t_tid = *tid;
-		_hash_doinsert(index, itup, buildstate->heapRel, false);
+		_hash_doinsert(index, itup, buildstate->heapRel, false, true);
 		pfree(itup);
 	}
 
@@ -275,7 +280,7 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 	itup = index_form_tuple(RelationGetDescr(rel), index_values, index_isnull);
 	itup->t_tid = *ht_ctid;
 
-	_hash_doinsert(rel, itup, heapRel, false);
+	_hash_doinsert(rel, itup, heapRel, false, false);
 
 	pfree(itup);
 
@@ -556,7 +561,7 @@ loop_top:
 						  cachedmetap->hashm_highmask,
 						  cachedmetap->hashm_lowmask, &tuples_removed,
 						  &num_index_tuples, split_cleanup,
-						  callback, callback_state);
+						  callback, callback_state, false);
 
 		_hash_dropbuf(rel, bucket_buf);
 
@@ -692,7 +697,8 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 				  uint32 maxbucket, uint32 highmask, uint32 lowmask,
 				  double *tuples_removed, double *num_index_tuples,
 				  bool split_cleanup,
-				  IndexBulkDeleteCallback callback, void *callback_state)
+				  IndexBulkDeleteCallback callback, void *callback_state,
+				  bool isbuild)
 {
 	BlockNumber blkno;
 	Buffer		buf;
@@ -719,6 +725,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		bool		retain_pin = false;
 		bool		clear_dead_marking = false;
 
+		/* XXX: what if isbuild = true ? */
 		vacuum_delay_point(false);
 
 		page = BufferGetPage(buf);
@@ -817,7 +824,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-			if (RelationNeedsWAL(rel))
+			if (RelationNeedsWAL(rel) && !isbuild)
 			{
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
@@ -903,7 +910,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		MarkBufferDirty(bucket_buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (RelationNeedsWAL(rel) && !isbuild)
 		{
 			XLogRecPtr	recptr;
 
@@ -924,7 +931,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	 */
 	if (bucket_dirty && IsBufferCleanupOK(bucket_buf))
 		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
-							bstrategy);
+							bstrategy, isbuild);
 	else
 		LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
 }
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 10de1580dc2..3692cebb0ac 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -35,7 +35,7 @@ static void _hash_vacuum_one_page(Relation rel, Relation hrel,
  * order.
  */
 void
-_hash_doinsert(Relation rel, IndexTuple itup, Relation heapRel, bool sorted)
+_hash_doinsert(Relation rel, IndexTuple itup, Relation heapRel, bool sorted, bool isbuild)
 {
 	Buffer		buf = InvalidBuffer;
 	Buffer		bucket_buf;
@@ -178,7 +178,7 @@ restart_insert:
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
 			/* chain to a new overflow page */
-			buf = _hash_addovflpage(rel, metabuf, buf, (buf == bucket_buf));
+			buf = _hash_addovflpage(rel, metabuf, buf, (buf == bucket_buf), isbuild);
 			page = BufferGetPage(buf);
 
 			/* should fit now, given test above */
@@ -213,7 +213,7 @@ restart_insert:
 	MarkBufferDirty(metabuf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (RelationNeedsWAL(rel) && !isbuild)
 	{
 		xl_hash_insert xlrec;
 		XLogRecPtr	recptr;
@@ -249,7 +249,7 @@ restart_insert:
 
 	/* Attempt to split if a split is needed */
 	if (do_expand)
-		_hash_expandtable(rel, metabuf);
+		_hash_expandtable(rel, metabuf, isbuild);
 
 	/* Finally drop our pin on the metapage */
 	_hash_dropbuf(rel, metabuf);
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 4f5fd3b2837..401447ada3e 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -109,7 +109,7 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  * pages might have been added to the bucket chain in between.
  */
 Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin, bool isbuild)
 {
 	Buffer		ovflbuf;
 	Page		page;
@@ -379,7 +379,7 @@ found:
 	MarkBufferDirty(buf);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (RelationNeedsWAL(rel) && !isbuild)
 	{
 		XLogRecPtr	recptr;
 		xl_hash_add_ovfl_page xlrec;
@@ -490,7 +490,7 @@ BlockNumber
 _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
 				   Size *tups_size, uint16 nitups,
-				   BufferAccessStrategy bstrategy)
+				   BufferAccessStrategy bstrategy, bool isbuild)
 {
 	HashMetaPage metap;
 	Buffer		metabuf;
@@ -575,7 +575,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
 	/* This operation needs to log multiple tuples, prepare WAL for that */
-	if (RelationNeedsWAL(rel))
+	if (RelationNeedsWAL(rel) && !isbuild)
 		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
 
 	START_CRIT_SECTION();
@@ -642,7 +642,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	}
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (RelationNeedsWAL(rel) && !isbuild)
 	{
 		xl_hash_squeeze_page xlrec;
 		XLogRecPtr	recptr;
@@ -843,7 +843,8 @@ _hash_squeezebucket(Relation rel,
 					Bucket bucket,
 					BlockNumber bucket_blkno,
 					Buffer bucket_buf,
-					BufferAccessStrategy bstrategy)
+					BufferAccessStrategy bstrategy,
+					bool isbuild)
 {
 	BlockNumber wblkno;
 	BlockNumber rblkno;
@@ -965,7 +966,7 @@ readpage:
 					 * This operation needs to log multiple tuples, prepare
 					 * WAL for that.
 					 */
-					if (RelationNeedsWAL(rel))
+					if (RelationNeedsWAL(rel) && !isbuild)
 						XLogEnsureRecordSpace(0, 3 + nitups);
 
 					START_CRIT_SECTION();
@@ -984,7 +985,7 @@ readpage:
 					MarkBufferDirty(rbuf);
 
 					/* XLOG stuff */
-					if (RelationNeedsWAL(rel))
+					if (RelationNeedsWAL(rel) && !isbuild)
 					{
 						XLogRecPtr	recptr;
 						xl_hash_move_page_contents xlrec;
@@ -1094,7 +1095,7 @@ readpage:
 
 		/* free this overflow page (releases rbuf) */
 		_hash_freeovflpage(rel, bucket_buf, rbuf, wbuf, itups, itup_offsets,
-						   tups_size, nitups, bstrategy);
+						   tups_size, nitups, bstrategy, isbuild);
 
 		/* be tidy */
 		for (i = 0; i < nitups; i++)
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index b8e5bd005e5..f5e310ab2b3 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -45,7 +45,7 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 							  Buffer nbuf,
 							  HTAB *htab,
 							  uint32 maxbucket,
-							  uint32 highmask, uint32 lowmask);
+							  uint32 highmask, uint32 lowmask, bool isbuild);
 static void log_split_page(Relation rel, Buffer buf);
 
 
@@ -324,7 +324,7 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
  * multiple buffer locks is ignored.
  */
 uint32
-_hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
+_hash_init(Relation rel, double num_tuples, ForkNumber forkNum, bool skip_wal)
 {
 	Buffer		metabuf;
 	Buffer		buf;
@@ -349,7 +349,7 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 	 * init fork.  Init forks for unlogged relations always need to be WAL
 	 * logged.
 	 */
-	use_wal = RelationNeedsWAL(rel) || forkNum == INIT_FORKNUM;
+	use_wal = !skip_wal && (RelationNeedsWAL(rel) || forkNum == INIT_FORKNUM);
 
 	/*
 	 * Determine the target fill factor (in tuples per bucket) for this index.
@@ -611,7 +611,7 @@ _hash_pageinit(Page page, Size size)
  * The buffer is returned in the same state.
  */
 void
-_hash_expandtable(Relation rel, Buffer metabuf)
+_hash_expandtable(Relation rel, Buffer metabuf, bool isbuild)
 {
 	HashMetaPage metap;
 	Bucket		old_bucket;
@@ -759,7 +759,7 @@ restart_expand:
 
 		hashbucketcleanup(rel, old_bucket, buf_oblkno, start_oblkno, NULL,
 						  maxbucket, highmask, lowmask, NULL, NULL, true,
-						  NULL, NULL);
+						  NULL, NULL, isbuild);
 
 		_hash_dropbuf(rel, buf_oblkno);
 
@@ -897,7 +897,7 @@ restart_expand:
 	MarkBufferDirty(buf_nblkno);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(rel))
+	if (RelationNeedsWAL(rel) && !isbuild)
 	{
 		xl_hash_split_allocate_page xlrec;
 		XLogRecPtr	recptr;
@@ -948,7 +948,7 @@ restart_expand:
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
 					  buf_oblkno, buf_nblkno, NULL,
-					  maxbucket, highmask, lowmask);
+					  maxbucket, highmask, lowmask, isbuild);
 
 	/* all done, now release the pins on primary buckets. */
 	_hash_dropbuf(rel, buf_oblkno);
@@ -1079,7 +1079,8 @@ _hash_splitbucket(Relation rel,
 				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask,
-				  uint32 lowmask)
+				  uint32 lowmask,
+				  bool isbuild)
 {
 	Buffer		bucket_obuf;
 	Buffer		bucket_nbuf;
@@ -1186,7 +1187,8 @@ _hash_splitbucket(Relation rel,
 					_hash_pgaddmultitup(rel, nbuf, itups, itup_offsets, nitups);
 					MarkBufferDirty(nbuf);
 					/* log the split operation before releasing the lock */
-					log_split_page(rel, nbuf);
+					if (RelationNeedsWAL(rel) && !isbuild)
+						log_split_page(rel, nbuf);
 
 					END_CRIT_SECTION();
 
@@ -1200,7 +1202,7 @@ _hash_splitbucket(Relation rel,
 					all_tups_size = 0;
 
 					/* chain to a new overflow page */
-					nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf));
+					nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf), isbuild);
 					npage = BufferGetPage(nbuf);
 					nopaque = HashPageGetOpaque(npage);
 				}
@@ -1237,7 +1239,9 @@ _hash_splitbucket(Relation rel,
 			_hash_pgaddmultitup(rel, nbuf, itups, itup_offsets, nitups);
 			MarkBufferDirty(nbuf);
 			/* log the split operation before releasing the lock */
-			log_split_page(rel, nbuf);
+
+			if (RelationNeedsWAL(rel) && !isbuild)
+				log_split_page(rel, nbuf);
 
 			END_CRIT_SECTION();
 
@@ -1294,7 +1298,7 @@ _hash_splitbucket(Relation rel,
 	MarkBufferDirty(bucket_obuf);
 	MarkBufferDirty(bucket_nbuf);
 
-	if (RelationNeedsWAL(rel))
+	if (RelationNeedsWAL(rel) && !isbuild)
 	{
 		XLogRecPtr	recptr;
 		xl_hash_split_complete xlrec;
@@ -1331,7 +1335,7 @@ _hash_splitbucket(Relation rel,
 		hashbucketcleanup(rel, obucket, bucket_obuf,
 						  BufferGetBlockNumber(bucket_obuf), NULL,
 						  maxbucket, highmask, lowmask, NULL, NULL, true,
-						  NULL, NULL);
+						  NULL, NULL, isbuild);
 	}
 	else
 	{
@@ -1455,7 +1459,7 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 
 	_hash_splitbucket(rel, metabuf, obucket,
 					  nbucket, obuf, bucket_nbuf, tidhtab,
-					  maxbucket, highmask, lowmask);
+					  maxbucket, highmask, lowmask, false);
 
 	_hash_dropbuf(rel, bucket_nbuf);
 	hash_destroy(tidhtab);
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index 6e8c0e68a92..870396841de 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -61,6 +61,8 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 {
 	HSpool	   *hspool = (HSpool *) palloc0(sizeof(HSpool));
 
+	elog(WARNING, "spool init");
+
 	hspool->index = index;
 
 	/*
@@ -146,7 +148,7 @@ _h_indexbuild(HSpool *hspool, Relation heapRel)
 #endif
 
 		/* the tuples are sorted by hashkey, so pass 'sorted' as true */
-		_hash_doinsert(hspool->index, itup, heapRel, true);
+		_hash_doinsert(hspool->index, itup, heapRel, true, true);
 
 		/* allow insertion phase to be interrupted, and track progress */
 		CHECK_FOR_INTERRUPTS();
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 073ad29b19b..f4afbc78f38 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -394,7 +394,7 @@ extern StrategyNumber hashtranslatecmptype(CompareType cmptype, Oid opfamily);
 
 /* hashinsert.c */
 extern void _hash_doinsert(Relation rel, IndexTuple itup, Relation heapRel,
-						   bool sorted);
+						   bool sorted, bool isbuild);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 								   Size itemsize, IndexTuple itup,
 								   bool appendtup);
@@ -402,15 +402,15 @@ extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
 								OffsetNumber *itup_offsets, uint16 nitups);
 
 /* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin, bool isbuild);
 extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 									  Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
-									  Size *tups_size, uint16 nitups, BufferAccessStrategy bstrategy);
+									  Size *tups_size, uint16 nitups, BufferAccessStrategy bstrategy, bool isbuild);
 extern void _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage);
 extern void _hash_squeezebucket(Relation rel,
 								Bucket bucket, BlockNumber bucket_blkno,
 								Buffer bucket_buf,
-								BufferAccessStrategy bstrategy);
+								BufferAccessStrategy bstrategy, bool isbuild);
 extern uint32 _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno);
 
 /* hashpage.c */
@@ -435,11 +435,11 @@ extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
 extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
 extern uint32 _hash_init(Relation rel, double num_tuples,
-						 ForkNumber forkNum);
+						 ForkNumber forkNum, bool skip_wal);
 extern void _hash_init_metabuffer(Buffer buf, double num_tuples,
 								  RegProcedure procid, uint16 ffactor, bool initpage);
 extern void _hash_pageinit(Page page, Size size);
-extern void _hash_expandtable(Relation rel, Buffer metabuf);
+extern void _hash_expandtable(Relation rel, Buffer metabuf, bool isbuild);
 extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
 							   Bucket obucket, uint32 maxbucket, uint32 highmask,
 							   uint32 lowmask);
@@ -485,6 +485,6 @@ extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
 							  uint32 maxbucket, uint32 highmask, uint32 lowmask,
 							  double *tuples_removed, double *num_index_tuples,
 							  bool split_cleanup,
-							  IndexBulkDeleteCallback callback, void *callback_state);
+							  IndexBulkDeleteCallback callback, void *callback_state, bool isbuild);
 
 #endif							/* HASH_H */
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Use log_newpage_range in HASH index build
@ 2025-12-23 11:53  lakshmi <[email protected]>
  parent: Kirill Reshke <[email protected]>
  1 sibling, 1 reply; 6+ messages in thread

From: lakshmi @ 2025-12-23 11:53 UTC (permalink / raw)
  To: Kirill Reshke <[email protected]>; +Cc: pgsql-hackers

Hi Kirill,

I tested your patch on the current master and confirmed the WAL reduction
during HASH index build.

While testing, I noticed a possible small follow-up improvement in HASH
overflow handling. Currently, any free overflow page may be reused, which
can scatter overflow chains and hurt cache locality. Reusing recently freed
overflow pages first could help, without changing WAL behavior or on-disk
format.

I would like to work on this as a follow-up enhancement and would welcome
any suggestions.

Best regards,
Lakshmi

On Tue, Dec 23, 2025 at 2:31 PM Kirill Reshke <[email protected]>
wrote:

> There exists an optimization to index creation process, when we omit
> to write any WAL
> for index build. It is currently supported in B Tree, GIN, GiST, spg
> indexes.
> It works because we do not need to recover anything if index creation
> fails, because if was not used by any query. So, the index can be
> built on-disk, and then, just before making the index alive, we can
> simply log all pages to WAL.
>
> Hash index currently lacks this optimization.
> PFA implementation.
>
> During my testing, I checked the amount of WAL generated by index
> build before and after patch applied. My script was something like:
>
> select pg_current_wal_insert_lsn();
>
> create index on t using hash (i);
>
> select pg_current_wal_insert_lsn();
>
> select pg_lsn_wal_diff(lsn1, lsn2);
>
> Resulting numbers depend on index size, but I got 2.5-3.5 times less
> WAL with this patch and 8 times less WAL with this patch +
> wal_compression=on.
> Index creation time, however, did not change much...
>
> About implementation:
> These are many types of record that can be generated during index build.
> I know for sure these are possible (double-checked using pg_waldump):
>
> SPLIT_COMPLETE
> INSERT
> SPLIT_ALLOCATE_PAGE
> SPLIT_PAGE
> ADD_OVFL_PAGE
> SQUEEZE_PAGE
> INIT_META_PAGE
> INIT_BITMAP_PAGE
>
>
> Looks like  SPLIT_COMPLETE and VACUUM_ONE_PAGE are never generated
> during index build. I'm not sure about  MOVE_PAGE_CONTENTS.
>
> So, implementation is simply pass isbuild flag everywhere something is
> wal-logged. Looks like it is less invasive than alternatives.
>
> --
> Best regards,
> Kirill Reshke
>


^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Use log_newpage_range in HASH index build
@ 2026-01-05 08:50  lakshmi <[email protected]>
  parent: lakshmi <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: lakshmi @ 2026-01-05 08:50 UTC (permalink / raw)
  To: Kirill Reshke <[email protected]>; +Cc: pgsql-hackers

Hi Kirill,
Following up on my earlier note, I implemented the proposed HASH overflow
page reuse enhancement.Recently freed overflow pages are recorded in
_hash_freeovflpage( ),and _hash_addovflpage( ) now prefers reusing those
pages during allocation before falling back to the bitmap scan.

The change is backend-local and allocation-only,with no WAL or on-disk
format changes.I verified correctness using index build/drop and VACUUM
cycles,and confirmed WAL neutrality by comparing WAL generated during HASH
index builds with and without this change (no  observable difference beyond
normal noise).

The patch is attached for review.Feedback is welcome.

Best regards,
Lakshmi

On Tue, Dec 23, 2025 at 5:23 PM lakshmi <[email protected]> wrote:

> Hi Kirill,
>
> I tested your patch on the current master and confirmed the WAL reduction
> during HASH index build.
>
> While testing, I noticed a possible small follow-up improvement in HASH
> overflow handling. Currently, any free overflow page may be reused, which
> can scatter overflow chains and hurt cache locality. Reusing recently freed
> overflow pages first could help, without changing WAL behavior or on-disk
> format.
>
> I would like to work on this as a follow-up enhancement and would welcome
> any suggestions.
>
> Best regards,
> Lakshmi
>
> On Tue, Dec 23, 2025 at 2:31 PM Kirill Reshke <[email protected]>
> wrote:
>
>> There exists an optimization to index creation process, when we omit
>> to write any WAL
>> for index build. It is currently supported in B Tree, GIN, GiST, spg
>> indexes.
>> It works because we do not need to recover anything if index creation
>> fails, because if was not used by any query. So, the index can be
>> built on-disk, and then, just before making the index alive, we can
>> simply log all pages to WAL.
>>
>> Hash index currently lacks this optimization.
>> PFA implementation.
>>
>> During my testing, I checked the amount of WAL generated by index
>> build before and after patch applied. My script was something like:
>>
>> select pg_current_wal_insert_lsn();
>>
>> create index on t using hash (i);
>>
>> select pg_current_wal_insert_lsn();
>>
>> select pg_lsn_wal_diff(lsn1, lsn2);
>>
>> Resulting numbers depend on index size, but I got 2.5-3.5 times less
>> WAL with this patch and 8 times less WAL with this patch +
>> wal_compression=on.
>> Index creation time, however, did not change much...
>>
>> About implementation:
>> These are many types of record that can be generated during index build.
>> I know for sure these are possible (double-checked using pg_waldump):
>>
>> SPLIT_COMPLETE
>> INSERT
>> SPLIT_ALLOCATE_PAGE
>> SPLIT_PAGE
>> ADD_OVFL_PAGE
>> SQUEEZE_PAGE
>> INIT_META_PAGE
>> INIT_BITMAP_PAGE
>>
>>
>> Looks like  SPLIT_COMPLETE and VACUUM_ONE_PAGE are never generated
>> during index build. I'm not sure about  MOVE_PAGE_CONTENTS.
>>
>> So, implementation is simply pass isbuild flag everywhere something is
>> wal-logged. Looks like it is less invasive than alternatives.
>>
>> --
>> Best regards,
>> Kirill Reshke
>>
>


Attachments:

  [text/x-patch] 0001-hash-reuse-recently-freed-overflow-pages.patch (3.4K, 3-0001-hash-reuse-recently-freed-overflow-pages.patch)
  download | inline diff:
From 34d4d99baa5377a4dabb8cf9c4bcd0e9bb2c6376 Mon Sep 17 00:00:00 2001
From: Lakshmi <[email protected]>
Date: Mon, 5 Jan 2026 10:38:23 +0530
Subject: [PATCH] hash: reuse recently freed overflow pages

---
 src/backend/access/hash/hashovfl.c | 77 ++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)

diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 3277da19840..c66f8adc056 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -22,6 +22,63 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
+/*-------------------------------------------------------------------------
+ * Backend-local reuse cache for recently freed overflow pages.
+ * Best-effort hint only: no correctness or WAL dependency.
+ *-------------------------------------------------------------------------
+ */
+#define HASH_OVFL_REUSE_CACHE_SIZE 32
+
+typedef struct HashOvflReuseCache
+{
+	BlockNumber blocks[HASH_OVFL_REUSE_CACHE_SIZE];
+	int			head;
+	int			count;
+}			HashOvflReuseCache;
+
+static HashOvflReuseCache ovfl_reuse_cache;
+
+/* Record a freed overflow page */
+static void
+hash_record_freed_ovflpage(BlockNumber blkno)
+{
+	ovfl_reuse_cache.blocks[ovfl_reuse_cache.head] = blkno;
+	ovfl_reuse_cache.head =
+		(ovfl_reuse_cache.head + 1) % HASH_OVFL_REUSE_CACHE_SIZE;
+
+	if (ovfl_reuse_cache.count < HASH_OVFL_REUSE_CACHE_SIZE)
+		ovfl_reuse_cache.count++;
+}
+
+/* Try to reuse a recently freed overflow page */
+static bool
+hash_try_reuse_cached_ovflpage(Relation rel, Buffer metabuf,
+							   BlockNumber *reuse_blkno)
+{
+	HashMetaPage metap = HashPageGetMeta(BufferGetPage(metabuf));
+	int			i;
+
+	for (i = 0; i < ovfl_reuse_cache.count; i++)
+	{
+		int			idx =
+			(ovfl_reuse_cache.head - 1 - i + HASH_OVFL_REUSE_CACHE_SIZE)
+			% HASH_OVFL_REUSE_CACHE_SIZE;
+
+		BlockNumber blkno = ovfl_reuse_cache.blocks[idx];
+		uint32		bitno = _hash_ovflblkno_to_bitno(metap, blkno);
+
+		/* Bitmap is authoritative */
+		if (_hash_isbitset(metap->hashm_mapp, bitno))
+		{
+			_hash_clrbit(metap->hashm_mapp, bitno);
+			*reuse_blkno = blkno;
+			return true;
+		}
+	}
+
+	return false;
+}
+
 
 
 static uint32 _hash_firstfreebit(uint32 map);
@@ -132,6 +189,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin, boo
 	uint32		i,
 				j;
 	bool		page_found = false;
+	BlockNumber reuse_blkno;
+
+
+
 
 	/*
 	 * Write-lock the tail page.  Here, we need to maintain locking order such
@@ -186,6 +247,16 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin, boo
 	_hash_checkpage(rel, metabuf, LH_META_PAGE);
 	metap = HashPageGetMeta(BufferGetPage(metabuf));
 
+	/*
+	 * NEW: Try reuse cache before scanning bitmap
+	 */
+	if (hash_try_reuse_cached_ovflpage(rel, metabuf, &reuse_blkno))
+	{
+		ovflbuf = _hash_getinitbuf(rel, reuse_blkno);
+		page_found = true;
+		goto found;
+	}
+
 	/* start search at hashm_firstfree */
 	orig_firstfree = metap->hashm_firstfree;
 	first_page = orig_firstfree >> BMPG_SHIFT(metap);
@@ -633,6 +704,12 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	CLRBIT(freep, bitmapbit);
 	MarkBufferDirty(mapbuf);
 
+	/*
+	 * NEW: Remember this overflow page for reuse
+	 */
+
+	hash_record_freed_ovflpage(ovflblkno);
+
 	/* if this is now the first free page, update hashm_firstfree */
 	if (ovflbitno < metap->hashm_firstfree)
 	{
-- 
2.39.5



^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Use log_newpage_range in HASH index build
@ 2026-04-26 17:09  Robert Haas <[email protected]>
  parent: Kirill Reshke <[email protected]>
  1 sibling, 0 replies; 6+ messages in thread

From: Robert Haas @ 2026-04-26 17:09 UTC (permalink / raw)
  To: Kirill Reshke <[email protected]>; +Cc: pgsql-hackers

On Thu, Oct 23, 2025 at 3:21 PM Kirill Reshke <[email protected]> wrote:
> So, implementation is simply pass isbuild flag everywhere something is
> wal-logged. Looks like it is less invasive than alternatives.

I think that in order to be seriously considered, this patch will need
more than zero words of comments and more than a one-line commit
message.

The XXX comment would need addressing, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com





^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Use log_newpage_range in HASH index build
@ 2026-04-26 17:12  Robert Haas <[email protected]>
  parent: lakshmi <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Robert Haas @ 2026-04-26 17:12 UTC (permalink / raw)
  To: lakshmi <[email protected]>; +Cc: Kirill Reshke <[email protected]>; pgsql-hackers

On Mon, Jan 5, 2026 at 3:48 AM lakshmi <[email protected]> wrote:
> Following up on my earlier note, I implemented the proposed HASH overflow page reuse enhancement.Recently freed overflow pages are recorded in
> _hash_freeovflpage( ),and _hash_addovflpage( ) now prefers reusing those pages during allocation before falling back to the bitmap scan.
>
> The change is backend-local and allocation-only,with no WAL or on-disk format changes.I verified correctness using index build/drop and VACUUM cycles,and confirmed WAL neutrality by comparing WAL generated during HASH index builds with and without this change (no  observable difference beyond normal noise).
>
> The patch is attached for review.Feedback is welcome.

It is probably best to start a separate email thread specifically
about this patch, because it's doing something quite different from
the original patch. I suggest presenting some specific performance
results, as it's not very clear how much benefit this might have. It
might be good to test a favorable scenario and also an unfavorable
scenario, describing and showing results for each.

-- 
Robert Haas
EDB: http://www.enterprisedb.com





^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Use log_newpage_range in HASH index build
@ 2026-04-27 17:10  lakshmi <[email protected]>
  parent: Robert Haas <[email protected]>
  0 siblings, 0 replies; 6+ messages in thread

From: lakshmi @ 2026-04-27 17:10 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: Kirill Reshke <[email protected]>; pgsql-hackers

Hi,

> It is probably best to start a separate email thread specifically
> about this patch, because it's doing something quite different from
> the original patch. I suggest presenting some specific performance
> results, as it's not very clear how much benefit this might have. It
> might be good to test a favorable scenario and also an unfavorable
> scenario, describing and showing results for each.
>
>


> Thanks for the suggestion. I’ll start a separate thread for this patch
> with performance results.
>
> Regards,
> Lakshmi
>


^ permalink  raw  reply  [nested|flat] 6+ messages in thread

end of thread, other threads:[~2026-04-27 17:10 UTC | newest]

Thread overview: 6+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-10-23 19:21 Use log_newpage_range in HASH index build Kirill Reshke <[email protected]>
2025-12-23 11:53 ` lakshmi <[email protected]>
2026-01-05 08:50   ` lakshmi <[email protected]>
2026-04-26 17:12     ` Robert Haas <[email protected]>
2026-04-27 17:10       ` lakshmi <[email protected]>
2026-04-26 17:09 ` Robert Haas <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox