public inbox for [email protected]help / color / mirror / Atom feed
Use log_newpage_range in HASH index build 6+ messages / 3 participants [nested] [flat]
* Use log_newpage_range in HASH index build @ 2025-10-23 19:21 Kirill Reshke <[email protected]> 0 siblings, 2 replies; 6+ messages in thread From: Kirill Reshke @ 2025-10-23 19:21 UTC (permalink / raw) To: pgsql-hackers There exists an optimization to index creation process, when we omit to write any WAL for index build. It is currently supported in B Tree, GIN, GiST, spg indexes. It works because we do not need to recover anything if index creation fails, because if was not used by any query. So, the index can be built on-disk, and then, just before making the index alive, we can simply log all pages to WAL. Hash index currently lacks this optimization. PFA implementation. During my testing, I checked the amount of WAL generated by index build before and after patch applied. My script was something like: select pg_current_wal_insert_lsn(); create index on t using hash (i); select pg_current_wal_insert_lsn(); select pg_lsn_wal_diff(lsn1, lsn2); Resulting numbers depend on index size, but I got 2.5-3.5 times less WAL with this patch and 8 times less WAL with this patch + wal_compression=on. Index creation time, however, did not change much... About implementation: These are many types of record that can be generated during index build. I know for sure these are possible (double-checked using pg_waldump): SPLIT_COMPLETE INSERT SPLIT_ALLOCATE_PAGE SPLIT_PAGE ADD_OVFL_PAGE SQUEEZE_PAGE INIT_META_PAGE INIT_BITMAP_PAGE Looks like SPLIT_COMPLETE and VACUUM_ONE_PAGE are never generated during index build. I'm not sure about MOVE_PAGE_CONTENTS. So, implementation is simply pass isbuild flag everywhere something is wal-logged. Looks like it is less invasive than alternatives. -- Best regards, Kirill Reshke Attachments: [application/octet-stream] v2-0001-Use-log_newpage_range-in-HASH-index-build.patch (16.1K, 2-v2-0001-Use-log_newpage_range-in-HASH-index-build.patch) download | inline diff: From f9f4ecc74d91dd5ea613ed461280a49811820458 Mon Sep 17 00:00:00 2001 From: reshke <[email protected]> Date: Thu, 23 Oct 2025 17:53:30 +0000 Subject: [PATCH v2] Use log_newpage_range in HASH index build --- src/backend/access/hash/hash.c | 25 ++++++++++++++-------- src/backend/access/hash/hashinsert.c | 8 +++---- src/backend/access/hash/hashovfl.c | 19 +++++++++-------- src/backend/access/hash/hashpage.c | 32 ++++++++++++++++------------ src/backend/access/hash/hashsort.c | 4 +++- src/include/access/hash.h | 14 ++++++------ 6 files changed, 58 insertions(+), 44 deletions(-) diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index 53061c819fb..4fff0634e6d 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -141,7 +141,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo) estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac); /* Initialize the hash index metadata page and initial buckets */ - num_buckets = _hash_init(index, reltuples, MAIN_FORKNUM); + num_buckets = _hash_init(index, reltuples, MAIN_FORKNUM, true); /* * If we just insert the tuples into the index in scan order, then @@ -190,6 +190,11 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo) _h_spooldestroy(buildstate.spool); } + if (RelationNeedsWAL(index)) + log_newpage_range(index, MAIN_FORKNUM, + 0, RelationGetNumberOfBlocks(index), + true); + /* * Return statistics */ @@ -207,7 +212,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo) void hashbuildempty(Relation index) { - _hash_init(index, 0, INIT_FORKNUM); + _hash_init(index, 0, INIT_FORKNUM, false); } /* @@ -241,7 +246,7 @@ hashbuildCallback(Relation index, itup = index_form_tuple(RelationGetDescr(index), index_values, index_isnull); itup->t_tid = *tid; - _hash_doinsert(index, itup, buildstate->heapRel, false); + _hash_doinsert(index, itup, buildstate->heapRel, false, true); pfree(itup); } @@ -275,7 +280,7 @@ hashinsert(Relation rel, Datum *values, bool *isnull, itup = index_form_tuple(RelationGetDescr(rel), index_values, index_isnull); itup->t_tid = *ht_ctid; - _hash_doinsert(rel, itup, heapRel, false); + _hash_doinsert(rel, itup, heapRel, false, false); pfree(itup); @@ -556,7 +561,7 @@ loop_top: cachedmetap->hashm_highmask, cachedmetap->hashm_lowmask, &tuples_removed, &num_index_tuples, split_cleanup, - callback, callback_state); + callback, callback_state, false); _hash_dropbuf(rel, bucket_buf); @@ -692,7 +697,8 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf, uint32 maxbucket, uint32 highmask, uint32 lowmask, double *tuples_removed, double *num_index_tuples, bool split_cleanup, - IndexBulkDeleteCallback callback, void *callback_state) + IndexBulkDeleteCallback callback, void *callback_state, + bool isbuild) { BlockNumber blkno; Buffer buf; @@ -719,6 +725,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf, bool retain_pin = false; bool clear_dead_marking = false; + /* XXX: what if isbuild = true ? */ vacuum_delay_point(false); page = BufferGetPage(buf); @@ -817,7 +824,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf, MarkBufferDirty(buf); /* XLOG stuff */ - if (RelationNeedsWAL(rel)) + if (RelationNeedsWAL(rel) && !isbuild) { xl_hash_delete xlrec; XLogRecPtr recptr; @@ -903,7 +910,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf, MarkBufferDirty(bucket_buf); /* XLOG stuff */ - if (RelationNeedsWAL(rel)) + if (RelationNeedsWAL(rel) && !isbuild) { XLogRecPtr recptr; @@ -924,7 +931,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf, */ if (bucket_dirty && IsBufferCleanupOK(bucket_buf)) _hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf, - bstrategy); + bstrategy, isbuild); else LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK); } diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c index 10de1580dc2..3692cebb0ac 100644 --- a/src/backend/access/hash/hashinsert.c +++ b/src/backend/access/hash/hashinsert.c @@ -35,7 +35,7 @@ static void _hash_vacuum_one_page(Relation rel, Relation hrel, * order. */ void -_hash_doinsert(Relation rel, IndexTuple itup, Relation heapRel, bool sorted) +_hash_doinsert(Relation rel, IndexTuple itup, Relation heapRel, bool sorted, bool isbuild) { Buffer buf = InvalidBuffer; Buffer bucket_buf; @@ -178,7 +178,7 @@ restart_insert: LockBuffer(buf, BUFFER_LOCK_UNLOCK); /* chain to a new overflow page */ - buf = _hash_addovflpage(rel, metabuf, buf, (buf == bucket_buf)); + buf = _hash_addovflpage(rel, metabuf, buf, (buf == bucket_buf), isbuild); page = BufferGetPage(buf); /* should fit now, given test above */ @@ -213,7 +213,7 @@ restart_insert: MarkBufferDirty(metabuf); /* XLOG stuff */ - if (RelationNeedsWAL(rel)) + if (RelationNeedsWAL(rel) && !isbuild) { xl_hash_insert xlrec; XLogRecPtr recptr; @@ -249,7 +249,7 @@ restart_insert: /* Attempt to split if a split is needed */ if (do_expand) - _hash_expandtable(rel, metabuf); + _hash_expandtable(rel, metabuf, isbuild); /* Finally drop our pin on the metapage */ _hash_dropbuf(rel, metabuf); diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c index 4f5fd3b2837..401447ada3e 100644 --- a/src/backend/access/hash/hashovfl.c +++ b/src/backend/access/hash/hashovfl.c @@ -109,7 +109,7 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno) * pages might have been added to the bucket chain in between. */ Buffer -_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin) +_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin, bool isbuild) { Buffer ovflbuf; Page page; @@ -379,7 +379,7 @@ found: MarkBufferDirty(buf); /* XLOG stuff */ - if (RelationNeedsWAL(rel)) + if (RelationNeedsWAL(rel) && !isbuild) { XLogRecPtr recptr; xl_hash_add_ovfl_page xlrec; @@ -490,7 +490,7 @@ BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf, Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets, Size *tups_size, uint16 nitups, - BufferAccessStrategy bstrategy) + BufferAccessStrategy bstrategy, bool isbuild) { HashMetaPage metap; Buffer metabuf; @@ -575,7 +575,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf, LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE); /* This operation needs to log multiple tuples, prepare WAL for that */ - if (RelationNeedsWAL(rel)) + if (RelationNeedsWAL(rel) && !isbuild) XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups); START_CRIT_SECTION(); @@ -642,7 +642,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf, } /* XLOG stuff */ - if (RelationNeedsWAL(rel)) + if (RelationNeedsWAL(rel) && !isbuild) { xl_hash_squeeze_page xlrec; XLogRecPtr recptr; @@ -843,7 +843,8 @@ _hash_squeezebucket(Relation rel, Bucket bucket, BlockNumber bucket_blkno, Buffer bucket_buf, - BufferAccessStrategy bstrategy) + BufferAccessStrategy bstrategy, + bool isbuild) { BlockNumber wblkno; BlockNumber rblkno; @@ -965,7 +966,7 @@ readpage: * This operation needs to log multiple tuples, prepare * WAL for that. */ - if (RelationNeedsWAL(rel)) + if (RelationNeedsWAL(rel) && !isbuild) XLogEnsureRecordSpace(0, 3 + nitups); START_CRIT_SECTION(); @@ -984,7 +985,7 @@ readpage: MarkBufferDirty(rbuf); /* XLOG stuff */ - if (RelationNeedsWAL(rel)) + if (RelationNeedsWAL(rel) && !isbuild) { XLogRecPtr recptr; xl_hash_move_page_contents xlrec; @@ -1094,7 +1095,7 @@ readpage: /* free this overflow page (releases rbuf) */ _hash_freeovflpage(rel, bucket_buf, rbuf, wbuf, itups, itup_offsets, - tups_size, nitups, bstrategy); + tups_size, nitups, bstrategy, isbuild); /* be tidy */ for (i = 0; i < nitups; i++) diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c index b8e5bd005e5..f5e310ab2b3 100644 --- a/src/backend/access/hash/hashpage.c +++ b/src/backend/access/hash/hashpage.c @@ -45,7 +45,7 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf, Buffer nbuf, HTAB *htab, uint32 maxbucket, - uint32 highmask, uint32 lowmask); + uint32 highmask, uint32 lowmask, bool isbuild); static void log_split_page(Relation rel, Buffer buf); @@ -324,7 +324,7 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so) * multiple buffer locks is ignored. */ uint32 -_hash_init(Relation rel, double num_tuples, ForkNumber forkNum) +_hash_init(Relation rel, double num_tuples, ForkNumber forkNum, bool skip_wal) { Buffer metabuf; Buffer buf; @@ -349,7 +349,7 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum) * init fork. Init forks for unlogged relations always need to be WAL * logged. */ - use_wal = RelationNeedsWAL(rel) || forkNum == INIT_FORKNUM; + use_wal = !skip_wal && (RelationNeedsWAL(rel) || forkNum == INIT_FORKNUM); /* * Determine the target fill factor (in tuples per bucket) for this index. @@ -611,7 +611,7 @@ _hash_pageinit(Page page, Size size) * The buffer is returned in the same state. */ void -_hash_expandtable(Relation rel, Buffer metabuf) +_hash_expandtable(Relation rel, Buffer metabuf, bool isbuild) { HashMetaPage metap; Bucket old_bucket; @@ -759,7 +759,7 @@ restart_expand: hashbucketcleanup(rel, old_bucket, buf_oblkno, start_oblkno, NULL, maxbucket, highmask, lowmask, NULL, NULL, true, - NULL, NULL); + NULL, NULL, isbuild); _hash_dropbuf(rel, buf_oblkno); @@ -897,7 +897,7 @@ restart_expand: MarkBufferDirty(buf_nblkno); /* XLOG stuff */ - if (RelationNeedsWAL(rel)) + if (RelationNeedsWAL(rel) && !isbuild) { xl_hash_split_allocate_page xlrec; XLogRecPtr recptr; @@ -948,7 +948,7 @@ restart_expand: _hash_splitbucket(rel, metabuf, old_bucket, new_bucket, buf_oblkno, buf_nblkno, NULL, - maxbucket, highmask, lowmask); + maxbucket, highmask, lowmask, isbuild); /* all done, now release the pins on primary buckets. */ _hash_dropbuf(rel, buf_oblkno); @@ -1079,7 +1079,8 @@ _hash_splitbucket(Relation rel, HTAB *htab, uint32 maxbucket, uint32 highmask, - uint32 lowmask) + uint32 lowmask, + bool isbuild) { Buffer bucket_obuf; Buffer bucket_nbuf; @@ -1186,7 +1187,8 @@ _hash_splitbucket(Relation rel, _hash_pgaddmultitup(rel, nbuf, itups, itup_offsets, nitups); MarkBufferDirty(nbuf); /* log the split operation before releasing the lock */ - log_split_page(rel, nbuf); + if (RelationNeedsWAL(rel) && !isbuild) + log_split_page(rel, nbuf); END_CRIT_SECTION(); @@ -1200,7 +1202,7 @@ _hash_splitbucket(Relation rel, all_tups_size = 0; /* chain to a new overflow page */ - nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf)); + nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf), isbuild); npage = BufferGetPage(nbuf); nopaque = HashPageGetOpaque(npage); } @@ -1237,7 +1239,9 @@ _hash_splitbucket(Relation rel, _hash_pgaddmultitup(rel, nbuf, itups, itup_offsets, nitups); MarkBufferDirty(nbuf); /* log the split operation before releasing the lock */ - log_split_page(rel, nbuf); + + if (RelationNeedsWAL(rel) && !isbuild) + log_split_page(rel, nbuf); END_CRIT_SECTION(); @@ -1294,7 +1298,7 @@ _hash_splitbucket(Relation rel, MarkBufferDirty(bucket_obuf); MarkBufferDirty(bucket_nbuf); - if (RelationNeedsWAL(rel)) + if (RelationNeedsWAL(rel) && !isbuild) { XLogRecPtr recptr; xl_hash_split_complete xlrec; @@ -1331,7 +1335,7 @@ _hash_splitbucket(Relation rel, hashbucketcleanup(rel, obucket, bucket_obuf, BufferGetBlockNumber(bucket_obuf), NULL, maxbucket, highmask, lowmask, NULL, NULL, true, - NULL, NULL); + NULL, NULL, isbuild); } else { @@ -1455,7 +1459,7 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket, _hash_splitbucket(rel, metabuf, obucket, nbucket, obuf, bucket_nbuf, tidhtab, - maxbucket, highmask, lowmask); + maxbucket, highmask, lowmask, false); _hash_dropbuf(rel, bucket_nbuf); hash_destroy(tidhtab); diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c index 6e8c0e68a92..870396841de 100644 --- a/src/backend/access/hash/hashsort.c +++ b/src/backend/access/hash/hashsort.c @@ -61,6 +61,8 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets) { HSpool *hspool = (HSpool *) palloc0(sizeof(HSpool)); + elog(WARNING, "spool init"); + hspool->index = index; /* @@ -146,7 +148,7 @@ _h_indexbuild(HSpool *hspool, Relation heapRel) #endif /* the tuples are sorted by hashkey, so pass 'sorted' as true */ - _hash_doinsert(hspool->index, itup, heapRel, true); + _hash_doinsert(hspool->index, itup, heapRel, true, true); /* allow insertion phase to be interrupted, and track progress */ CHECK_FOR_INTERRUPTS(); diff --git a/src/include/access/hash.h b/src/include/access/hash.h index 073ad29b19b..f4afbc78f38 100644 --- a/src/include/access/hash.h +++ b/src/include/access/hash.h @@ -394,7 +394,7 @@ extern StrategyNumber hashtranslatecmptype(CompareType cmptype, Oid opfamily); /* hashinsert.c */ extern void _hash_doinsert(Relation rel, IndexTuple itup, Relation heapRel, - bool sorted); + bool sorted, bool isbuild); extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf, Size itemsize, IndexTuple itup, bool appendtup); @@ -402,15 +402,15 @@ extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups, OffsetNumber *itup_offsets, uint16 nitups); /* hashovfl.c */ -extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin); +extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin, bool isbuild); extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf, Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets, - Size *tups_size, uint16 nitups, BufferAccessStrategy bstrategy); + Size *tups_size, uint16 nitups, BufferAccessStrategy bstrategy, bool isbuild); extern void _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage); extern void _hash_squeezebucket(Relation rel, Bucket bucket, BlockNumber bucket_blkno, Buffer bucket_buf, - BufferAccessStrategy bstrategy); + BufferAccessStrategy bstrategy, bool isbuild); extern uint32 _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno); /* hashpage.c */ @@ -435,11 +435,11 @@ extern void _hash_relbuf(Relation rel, Buffer buf); extern void _hash_dropbuf(Relation rel, Buffer buf); extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so); extern uint32 _hash_init(Relation rel, double num_tuples, - ForkNumber forkNum); + ForkNumber forkNum, bool skip_wal); extern void _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid, uint16 ffactor, bool initpage); extern void _hash_pageinit(Page page, Size size); -extern void _hash_expandtable(Relation rel, Buffer metabuf); +extern void _hash_expandtable(Relation rel, Buffer metabuf, bool isbuild); extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket, uint32 maxbucket, uint32 highmask, uint32 lowmask); @@ -485,6 +485,6 @@ extern void hashbucketcleanup(Relation rel, Bucket cur_bucket, uint32 maxbucket, uint32 highmask, uint32 lowmask, double *tuples_removed, double *num_index_tuples, bool split_cleanup, - IndexBulkDeleteCallback callback, void *callback_state); + IndexBulkDeleteCallback callback, void *callback_state, bool isbuild); #endif /* HASH_H */ -- 2.43.0 ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Use log_newpage_range in HASH index build @ 2025-12-23 11:53 lakshmi <[email protected]> parent: Kirill Reshke <[email protected]> 1 sibling, 1 reply; 6+ messages in thread From: lakshmi @ 2025-12-23 11:53 UTC (permalink / raw) To: Kirill Reshke <[email protected]>; +Cc: pgsql-hackers Hi Kirill, I tested your patch on the current master and confirmed the WAL reduction during HASH index build. While testing, I noticed a possible small follow-up improvement in HASH overflow handling. Currently, any free overflow page may be reused, which can scatter overflow chains and hurt cache locality. Reusing recently freed overflow pages first could help, without changing WAL behavior or on-disk format. I would like to work on this as a follow-up enhancement and would welcome any suggestions. Best regards, Lakshmi On Tue, Dec 23, 2025 at 2:31 PM Kirill Reshke <[email protected]> wrote: > There exists an optimization to index creation process, when we omit > to write any WAL > for index build. It is currently supported in B Tree, GIN, GiST, spg > indexes. > It works because we do not need to recover anything if index creation > fails, because if was not used by any query. So, the index can be > built on-disk, and then, just before making the index alive, we can > simply log all pages to WAL. > > Hash index currently lacks this optimization. > PFA implementation. > > During my testing, I checked the amount of WAL generated by index > build before and after patch applied. My script was something like: > > select pg_current_wal_insert_lsn(); > > create index on t using hash (i); > > select pg_current_wal_insert_lsn(); > > select pg_lsn_wal_diff(lsn1, lsn2); > > Resulting numbers depend on index size, but I got 2.5-3.5 times less > WAL with this patch and 8 times less WAL with this patch + > wal_compression=on. > Index creation time, however, did not change much... > > About implementation: > These are many types of record that can be generated during index build. > I know for sure these are possible (double-checked using pg_waldump): > > SPLIT_COMPLETE > INSERT > SPLIT_ALLOCATE_PAGE > SPLIT_PAGE > ADD_OVFL_PAGE > SQUEEZE_PAGE > INIT_META_PAGE > INIT_BITMAP_PAGE > > > Looks like SPLIT_COMPLETE and VACUUM_ONE_PAGE are never generated > during index build. I'm not sure about MOVE_PAGE_CONTENTS. > > So, implementation is simply pass isbuild flag everywhere something is > wal-logged. Looks like it is less invasive than alternatives. > > -- > Best regards, > Kirill Reshke > ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Use log_newpage_range in HASH index build @ 2026-01-05 08:50 lakshmi <[email protected]> parent: lakshmi <[email protected]> 0 siblings, 1 reply; 6+ messages in thread From: lakshmi @ 2026-01-05 08:50 UTC (permalink / raw) To: Kirill Reshke <[email protected]>; +Cc: pgsql-hackers Hi Kirill, Following up on my earlier note, I implemented the proposed HASH overflow page reuse enhancement.Recently freed overflow pages are recorded in _hash_freeovflpage( ),and _hash_addovflpage( ) now prefers reusing those pages during allocation before falling back to the bitmap scan. The change is backend-local and allocation-only,with no WAL or on-disk format changes.I verified correctness using index build/drop and VACUUM cycles,and confirmed WAL neutrality by comparing WAL generated during HASH index builds with and without this change (no observable difference beyond normal noise). The patch is attached for review.Feedback is welcome. Best regards, Lakshmi On Tue, Dec 23, 2025 at 5:23 PM lakshmi <[email protected]> wrote: > Hi Kirill, > > I tested your patch on the current master and confirmed the WAL reduction > during HASH index build. > > While testing, I noticed a possible small follow-up improvement in HASH > overflow handling. Currently, any free overflow page may be reused, which > can scatter overflow chains and hurt cache locality. Reusing recently freed > overflow pages first could help, without changing WAL behavior or on-disk > format. > > I would like to work on this as a follow-up enhancement and would welcome > any suggestions. > > Best regards, > Lakshmi > > On Tue, Dec 23, 2025 at 2:31 PM Kirill Reshke <[email protected]> > wrote: > >> There exists an optimization to index creation process, when we omit >> to write any WAL >> for index build. It is currently supported in B Tree, GIN, GiST, spg >> indexes. >> It works because we do not need to recover anything if index creation >> fails, because if was not used by any query. So, the index can be >> built on-disk, and then, just before making the index alive, we can >> simply log all pages to WAL. >> >> Hash index currently lacks this optimization. >> PFA implementation. >> >> During my testing, I checked the amount of WAL generated by index >> build before and after patch applied. My script was something like: >> >> select pg_current_wal_insert_lsn(); >> >> create index on t using hash (i); >> >> select pg_current_wal_insert_lsn(); >> >> select pg_lsn_wal_diff(lsn1, lsn2); >> >> Resulting numbers depend on index size, but I got 2.5-3.5 times less >> WAL with this patch and 8 times less WAL with this patch + >> wal_compression=on. >> Index creation time, however, did not change much... >> >> About implementation: >> These are many types of record that can be generated during index build. >> I know for sure these are possible (double-checked using pg_waldump): >> >> SPLIT_COMPLETE >> INSERT >> SPLIT_ALLOCATE_PAGE >> SPLIT_PAGE >> ADD_OVFL_PAGE >> SQUEEZE_PAGE >> INIT_META_PAGE >> INIT_BITMAP_PAGE >> >> >> Looks like SPLIT_COMPLETE and VACUUM_ONE_PAGE are never generated >> during index build. I'm not sure about MOVE_PAGE_CONTENTS. >> >> So, implementation is simply pass isbuild flag everywhere something is >> wal-logged. Looks like it is less invasive than alternatives. >> >> -- >> Best regards, >> Kirill Reshke >> > Attachments: [text/x-patch] 0001-hash-reuse-recently-freed-overflow-pages.patch (3.4K, 3-0001-hash-reuse-recently-freed-overflow-pages.patch) download | inline diff: From 34d4d99baa5377a4dabb8cf9c4bcd0e9bb2c6376 Mon Sep 17 00:00:00 2001 From: Lakshmi <[email protected]> Date: Mon, 5 Jan 2026 10:38:23 +0530 Subject: [PATCH] hash: reuse recently freed overflow pages --- src/backend/access/hash/hashovfl.c | 77 ++++++++++++++++++++++++++++++ 1 file changed, 77 insertions(+) diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c index 3277da19840..c66f8adc056 100644 --- a/src/backend/access/hash/hashovfl.c +++ b/src/backend/access/hash/hashovfl.c @@ -22,6 +22,63 @@ #include "access/xloginsert.h" #include "miscadmin.h" #include "utils/rel.h" +/*------------------------------------------------------------------------- + * Backend-local reuse cache for recently freed overflow pages. + * Best-effort hint only: no correctness or WAL dependency. + *------------------------------------------------------------------------- + */ +#define HASH_OVFL_REUSE_CACHE_SIZE 32 + +typedef struct HashOvflReuseCache +{ + BlockNumber blocks[HASH_OVFL_REUSE_CACHE_SIZE]; + int head; + int count; +} HashOvflReuseCache; + +static HashOvflReuseCache ovfl_reuse_cache; + +/* Record a freed overflow page */ +static void +hash_record_freed_ovflpage(BlockNumber blkno) +{ + ovfl_reuse_cache.blocks[ovfl_reuse_cache.head] = blkno; + ovfl_reuse_cache.head = + (ovfl_reuse_cache.head + 1) % HASH_OVFL_REUSE_CACHE_SIZE; + + if (ovfl_reuse_cache.count < HASH_OVFL_REUSE_CACHE_SIZE) + ovfl_reuse_cache.count++; +} + +/* Try to reuse a recently freed overflow page */ +static bool +hash_try_reuse_cached_ovflpage(Relation rel, Buffer metabuf, + BlockNumber *reuse_blkno) +{ + HashMetaPage metap = HashPageGetMeta(BufferGetPage(metabuf)); + int i; + + for (i = 0; i < ovfl_reuse_cache.count; i++) + { + int idx = + (ovfl_reuse_cache.head - 1 - i + HASH_OVFL_REUSE_CACHE_SIZE) + % HASH_OVFL_REUSE_CACHE_SIZE; + + BlockNumber blkno = ovfl_reuse_cache.blocks[idx]; + uint32 bitno = _hash_ovflblkno_to_bitno(metap, blkno); + + /* Bitmap is authoritative */ + if (_hash_isbitset(metap->hashm_mapp, bitno)) + { + _hash_clrbit(metap->hashm_mapp, bitno); + *reuse_blkno = blkno; + return true; + } + } + + return false; +} + static uint32 _hash_firstfreebit(uint32 map); @@ -132,6 +189,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin, boo uint32 i, j; bool page_found = false; + BlockNumber reuse_blkno; + + + /* * Write-lock the tail page. Here, we need to maintain locking order such @@ -186,6 +247,16 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin, boo _hash_checkpage(rel, metabuf, LH_META_PAGE); metap = HashPageGetMeta(BufferGetPage(metabuf)); + /* + * NEW: Try reuse cache before scanning bitmap + */ + if (hash_try_reuse_cached_ovflpage(rel, metabuf, &reuse_blkno)) + { + ovflbuf = _hash_getinitbuf(rel, reuse_blkno); + page_found = true; + goto found; + } + /* start search at hashm_firstfree */ orig_firstfree = metap->hashm_firstfree; first_page = orig_firstfree >> BMPG_SHIFT(metap); @@ -633,6 +704,12 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf, CLRBIT(freep, bitmapbit); MarkBufferDirty(mapbuf); + /* + * NEW: Remember this overflow page for reuse + */ + + hash_record_freed_ovflpage(ovflblkno); + /* if this is now the first free page, update hashm_firstfree */ if (ovflbitno < metap->hashm_firstfree) { -- 2.39.5 ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Use log_newpage_range in HASH index build @ 2026-04-26 17:09 Robert Haas <[email protected]> parent: Kirill Reshke <[email protected]> 1 sibling, 0 replies; 6+ messages in thread From: Robert Haas @ 2026-04-26 17:09 UTC (permalink / raw) To: Kirill Reshke <[email protected]>; +Cc: pgsql-hackers On Thu, Oct 23, 2025 at 3:21 PM Kirill Reshke <[email protected]> wrote: > So, implementation is simply pass isbuild flag everywhere something is > wal-logged. Looks like it is less invasive than alternatives. I think that in order to be seriously considered, this patch will need more than zero words of comments and more than a one-line commit message. The XXX comment would need addressing, too. -- Robert Haas EDB: http://www.enterprisedb.com ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Use log_newpage_range in HASH index build @ 2026-04-26 17:12 Robert Haas <[email protected]> parent: lakshmi <[email protected]> 0 siblings, 1 reply; 6+ messages in thread From: Robert Haas @ 2026-04-26 17:12 UTC (permalink / raw) To: lakshmi <[email protected]>; +Cc: Kirill Reshke <[email protected]>; pgsql-hackers On Mon, Jan 5, 2026 at 3:48 AM lakshmi <[email protected]> wrote: > Following up on my earlier note, I implemented the proposed HASH overflow page reuse enhancement.Recently freed overflow pages are recorded in > _hash_freeovflpage( ),and _hash_addovflpage( ) now prefers reusing those pages during allocation before falling back to the bitmap scan. > > The change is backend-local and allocation-only,with no WAL or on-disk format changes.I verified correctness using index build/drop and VACUUM cycles,and confirmed WAL neutrality by comparing WAL generated during HASH index builds with and without this change (no observable difference beyond normal noise). > > The patch is attached for review.Feedback is welcome. It is probably best to start a separate email thread specifically about this patch, because it's doing something quite different from the original patch. I suggest presenting some specific performance results, as it's not very clear how much benefit this might have. It might be good to test a favorable scenario and also an unfavorable scenario, describing and showing results for each. -- Robert Haas EDB: http://www.enterprisedb.com ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Use log_newpage_range in HASH index build @ 2026-04-27 17:10 lakshmi <[email protected]> parent: Robert Haas <[email protected]> 0 siblings, 0 replies; 6+ messages in thread From: lakshmi @ 2026-04-27 17:10 UTC (permalink / raw) To: Robert Haas <[email protected]>; +Cc: Kirill Reshke <[email protected]>; pgsql-hackers Hi, > It is probably best to start a separate email thread specifically > about this patch, because it's doing something quite different from > the original patch. I suggest presenting some specific performance > results, as it's not very clear how much benefit this might have. It > might be good to test a favorable scenario and also an unfavorable > scenario, describing and showing results for each. > > > Thanks for the suggestion. I’ll start a separate thread for this patch > with performance results. > > Regards, > Lakshmi > ^ permalink raw reply [nested|flat] 6+ messages in thread
end of thread, other threads:[~2026-04-27 17:10 UTC | newest] Thread overview: 6+ messages (download: mbox mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2025-10-23 19:21 Use log_newpage_range in HASH index build Kirill Reshke <[email protected]> 2025-12-23 11:53 ` lakshmi <[email protected]> 2026-01-05 08:50 ` lakshmi <[email protected]> 2026-04-26 17:12 ` Robert Haas <[email protected]> 2026-04-27 17:10 ` lakshmi <[email protected]> 2026-04-26 17:09 ` Robert Haas <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox