public inbox for [email protected]help / color / mirror / Atom feed
Addressing buffer private reference count scalability issue 6+ messages / 2 participants [nested] [flat]
* Addressing buffer private reference count scalability issue @ 2026-03-08 16:09 Alexandre Felipe <[email protected]> 0 siblings, 1 reply; 6+ messages in thread From: Alexandre Felipe @ 2026-03-08 16:09 UTC (permalink / raw) To: PostgreSQL Hackers <[email protected]>; Andres Freund <[email protected]> Hi Hackers, This patch addresses a performance issue pointed out by Andres Freund, 1. Benchmark buffer pinning: You know benchmark code, implemented a few functions that can be use in postgres queries, and a python script that runs them and produces CSV files and SVG plots for the current build. 2. Refactoring reference counting: Before starting to change code and potentially breaking things I considered prudent to isolate it to limit the damage. This code was part of a 8k+ LOC file. 3. Using simplehash: Simply replacing the HTAB for a simplehash, and adding a new set of macros SH_ENTRY_EMPTY, SH_MAKE_EMPTY, SH_MAKE_IN_USE. To allow using the InvalidBuffer special value instead of allocating extra space for a validity flag. Here I assume that the buffer buffer sequence is independent enough from the array size, so I use the buffer as the hash key directly, omitting a hash function call. 4. Compact PrivateRefCountEntry: The original implementation used a 4-byte key and 8-byte value. Reference count uses 32 bits, and it is unreasonable to expect one backend to pin the same buffer 1 billion times. The lock mode uses 32 bits but can only assume 4 values. So I packed them in one single uint32, giving 30 bits for count and 2 bits for lock mode. This makes the entries 8-byte long, on 64-bit CPUs this represents more than a 1/3 reduction in memory. This makes the array aligned with the 64-bit words, copying one entry can be completed in one instruction, and every entry will be aligned. 5. REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array lookup, it is worth trying to remove it completely and keep only the hash. For small values we would be trading a few branches for a buffer % SIZE, for the use case of prefetch where pin/unpin in a FIFO fashion, it will save an 8-entry array lookup, and some extra data moves. In addition to the patch I am including - A bash script to apply and benchmark the patches sequentially. You might have to adjust REPO_ROOT, in my case it gets it relative to the script path, that is under $REPO_ROOT/.patches/pins/. - A compare-patches.py script that can be copied to src/test/modules/test_buffer_pin to process the benchmark CSV in figures showing one metric for different patches instead of different metrics for one patch as the benchmark.py produces. - A nicely formatted post about this [2] Regards, Alexandre [1] https://www.postgresql.org/message-id/s5p7iou7pdhxhvmv4rohmskwqmr36dc4rybvwlep5yvwrjs4pa%406oxsemms5... [2] https://afelipe.hashnode.dev/postgres-backend-buffer-pinning-algorithm Attachments: [application/octet-stream] v1-0003-Using-simplehash.patch (12.8K, 3-v1-0003-Using-simplehash.patch) download | inline diff: From 077520420223d3bc14c9f7b073c15021aae20388 Mon Sep 17 00:00:00 2001 From: Alexandre Felipe <[email protected]> Date: Fri, 6 Mar 2026 16:55:43 +0000 Subject: [PATCH 3/5] Using simplehash This patch replaces the HTAB implementation with a simplehash as suggested by Andres Freund [1]. The simplehash implements a templated open addressing hash, which have entry size information at compile time. The access strategy of the simplehash is very close to the plain array that was seen to be very efficient compared to the HTAB implementation, with the additional advantage of using the key value to initialize the search closer to where the key actually is, instead of always scanning the entire array. --- src/backend/storage/buffer/buf_refcount.c | 86 +++++++++++------------ src/include/lib/simplehash.h | 59 ++++++++++------ 2 files changed, 81 insertions(+), 64 deletions(-) diff --git a/src/backend/storage/buffer/buf_refcount.c b/src/backend/storage/buffer/buf_refcount.c index 1c0bec29c93..ff37355a61e 100644 --- a/src/backend/storage/buffer/buf_refcount.c +++ b/src/backend/storage/buffer/buf_refcount.c @@ -40,10 +40,10 @@ */ #include "postgres.h" +#include "common/hashfn.h" #include "storage/buf_internals.h" #include "storage/buf_refcount.h" #include "storage/proc.h" -#include "utils/hsearch.h" @@ -55,15 +55,36 @@ typedef struct PrivateRefCountData struct PrivateRefCountEntry { - Buffer buffer; + Buffer buffer; /* hash key - InvalidBuffer means empty */ PrivateRefCountData data; }; +/* + * Define simplehash parameters for the overflow hash table. + * We use buffer == InvalidBuffer as the "empty" marker to avoid needing + * a separate status field. + */ +#define SH_PREFIX refcount +#define SH_ELEMENT_TYPE PrivateRefCountEntry +#define SH_KEY_TYPE Buffer +#define SH_KEY buffer +#define SH_HASH_KEY(tb, key) murmurhash32(key) +#define SH_EQUAL(tb, a, b) ((a) == (b)) +#define SH_SCOPE static inline +#define SH_DEFINE +#define SH_DECLARE +/* Use buffer field as empty marker - no separate status needed */ +#define SH_ENTRY_EMPTY(entry) ((entry)->buffer == InvalidBuffer) +#define SH_MAKE_EMPTY(entry) ((entry)->buffer = InvalidBuffer) +#define SH_MAKE_IN_USE(entry) ((void)0) /* key assignment handles this */ +#include "lib/simplehash.h" + struct PrivateRefCountIterator { int array_index; bool in_hash; - HASH_SEQ_STATUS *hash_status; + refcount_iterator hash_iter; + bool hash_iter_valid; }; /* Private refcount array and keys */ @@ -72,7 +93,7 @@ static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES]; static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES]; /* Overflow hash table for when array is full */ -static HTAB *PrivateRefCountHash = NULL; +static refcount_hash *PrivateRefCountHash = NULL; /* Count of entries that have overflowed into the hash table */ static int32 PrivateRefCountOverflowed = 0; @@ -100,26 +121,18 @@ static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buff void InitPrivateRefCount(void) { - HASHCTL hash_ctl; - - /* * An advisory limit on the number of pins each backend should hold, based * on shared_buffers and the maximum number of connections possible. * That's very pessimistic, but outside toy-sized shared_buffers it should * allow plenty of pins. LimitAdditionalPins() and * GetAdditionalPinLimit() can be used to check the remaining balance. - */ - MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS); - + */ + MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS); memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray)); memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys)); - hash_ctl.keysize = sizeof(Buffer); - hash_ctl.entrysize = sizeof(PrivateRefCountEntry); - - PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl, - HASH_ELEM | HASH_BLOBS); + PrivateRefCountHash = refcount_create(CurrentMemoryContext, 16, NULL); } /* @@ -173,10 +186,9 @@ ReservePrivateRefCountEntry(void) Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer); /* enter victim array entry into hashtable */ - hashent = hash_search(PrivateRefCountHash, - &PrivateRefCountArrayKeys[victim_slot], - HASH_ENTER, - &found); + hashent = refcount_insert(PrivateRefCountHash, + PrivateRefCountArrayKeys[victim_slot], + &found); Assert(!found); hashent->data = victim_entry->data; @@ -253,7 +265,7 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move) if (PrivateRefCountOverflowed == 0) return NULL; - res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL); + res = refcount_lookup(PrivateRefCountHash, buffer); if (res == NULL) return NULL; @@ -264,7 +276,6 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move) else { /* move buffer from hashtable into the free array slot */ - bool found; PrivateRefCountEntry *free; ReservePrivateRefCountEntry(); @@ -280,8 +291,7 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move) ReservedRefCountSlot = -1; - hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found); - Assert(found); + refcount_delete_item(PrivateRefCountHash, res); Assert(PrivateRefCountOverflowed > 0); PrivateRefCountOverflowed--; @@ -384,11 +394,7 @@ SharedBufferUnref(PrivateRefCountEntry *ref) } else { - bool found; - Buffer buffer = ref->buffer; - - hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found); - Assert(found); + refcount_delete_item(PrivateRefCountHash, ref); Assert(PrivateRefCountOverflowed > 0); PrivateRefCountOverflowed--; } @@ -456,10 +462,10 @@ CheckPrivateRefCountLeaks(void) /* if necessary search the hash */ if (PrivateRefCountOverflowed) { - HASH_SEQ_STATUS hstat; + refcount_iterator iter; - hash_seq_init(&hstat, PrivateRefCountHash); - while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL) + refcount_start_iterate(PrivateRefCountHash, &iter); + while ((res = refcount_iterate(PrivateRefCountHash, &iter)) != NULL) { s = DebugPrintBufferRefcount(res->buffer); elog(WARNING, "buffer refcount leak: %s", s); @@ -482,7 +488,7 @@ InitPrivateRefCountIterator(void) iter->array_index = 0; iter->in_hash = false; - iter->hash_status = NULL; + iter->hash_iter_valid = false; return iter; } @@ -508,21 +514,20 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter) iter->in_hash = true; if (PrivateRefCountOverflowed > 0) { - iter->hash_status = palloc(sizeof(HASH_SEQ_STATUS)); - hash_seq_init(iter->hash_status, PrivateRefCountHash); + refcount_start_iterate(PrivateRefCountHash, &iter->hash_iter); + iter->hash_iter_valid = true; } } - if (iter->hash_status != NULL) + if (iter->hash_iter_valid) { PrivateRefCountEntry *res; - res = (PrivateRefCountEntry *) hash_seq_search(iter->hash_status); + res = refcount_iterate(PrivateRefCountHash, &iter->hash_iter); if (res != NULL) return res; - pfree(iter->hash_status); - iter->hash_status = NULL; + iter->hash_iter_valid = false; } return NULL; @@ -534,11 +539,6 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter) void FreePrivateRefCountIterator(PrivateRefCountIterator *iter) { - if (iter->hash_status != NULL) - { - hash_seq_term(iter->hash_status); - pfree(iter->hash_status); - } pfree(iter); } diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h index 848719232a4..3c03a7e9c9b 100644 --- a/src/include/lib/simplehash.h +++ b/src/include/lib/simplehash.h @@ -287,6 +287,20 @@ SH_SCOPE void SH_STAT(SH_TYPE * tb); #define SH_COMPARE_KEYS(tb, ahash, akey, b) (SH_EQUAL(tb, b->SH_KEY, akey)) #endif +/* + * Macros to check/set entry status. Users can override these to avoid + * needing a separate status field if their key type has an "invalid" value. + */ +#ifndef SH_ENTRY_EMPTY +#define SH_ENTRY_EMPTY(entry) ((entry)->status == SH_STATUS_EMPTY) +#endif +#ifndef SH_MAKE_EMPTY +#define SH_MAKE_EMPTY(entry) ((entry)->status = SH_STATUS_EMPTY) +#endif +#ifndef SH_MAKE_IN_USE +#define SH_MAKE_IN_USE(entry) ((entry)->status = SH_STATUS_IN_USE) +#endif + /* * Wrap the following definitions in include guards, to avoid multiple * definition errors if this header is included more than once. The rest of @@ -544,7 +558,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize) uint32 hash; uint32 optimal; - if (oldentry->status != SH_STATUS_IN_USE) + if (SH_ENTRY_EMPTY(oldentry)) { startelem = i; break; @@ -566,7 +580,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize) { SH_ELEMENT_TYPE *oldentry = &olddata[copyelem]; - if (oldentry->status == SH_STATUS_IN_USE) + if (!SH_ENTRY_EMPTY(oldentry)) { uint32 hash; uint32 startelem2; @@ -582,7 +596,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize) { newentry = &newdata[curelem]; - if (newentry->status == SH_STATUS_EMPTY) + if (SH_ENTRY_EMPTY(newentry)) { break; } @@ -653,14 +667,14 @@ restart: SH_ELEMENT_TYPE *entry = &data[curelem]; /* any empty bucket can directly be used */ - if (entry->status == SH_STATUS_EMPTY) + if (SH_ENTRY_EMPTY(entry)) { tb->members++; entry->SH_KEY = key; #ifdef SH_STORE_HASH SH_GET_HASH(tb, entry) = hash; #endif - entry->status = SH_STATUS_IN_USE; + SH_MAKE_IN_USE(entry); *found = false; return entry; } @@ -675,7 +689,7 @@ restart: if (SH_COMPARE_KEYS(tb, hash, key, entry)) { - Assert(entry->status == SH_STATUS_IN_USE); + Assert(!SH_ENTRY_EMPTY(entry)); *found = true; return entry; } @@ -699,7 +713,7 @@ restart: emptyelem = SH_NEXT(tb, emptyelem, startelem); emptyentry = &data[emptyelem]; - if (emptyentry->status == SH_STATUS_EMPTY) + if (SH_ENTRY_EMPTY(emptyentry)) { lastentry = emptyentry; break; @@ -748,7 +762,7 @@ restart: #ifdef SH_STORE_HASH SH_GET_HASH(tb, entry) = hash; #endif - entry->status = SH_STATUS_IN_USE; + SH_MAKE_IN_USE(entry); *found = false; return entry; } @@ -810,12 +824,12 @@ SH_LOOKUP_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash) { SH_ELEMENT_TYPE *entry = &tb->data[curelem]; - if (entry->status == SH_STATUS_EMPTY) + if (SH_ENTRY_EMPTY(entry)) { return NULL; } - Assert(entry->status == SH_STATUS_IN_USE); + Assert(!SH_ENTRY_EMPTY(entry)); if (SH_COMPARE_KEYS(tb, hash, key, entry)) return entry; @@ -868,10 +882,10 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key) { SH_ELEMENT_TYPE *entry = &tb->data[curelem]; - if (entry->status == SH_STATUS_EMPTY) + if (SH_ENTRY_EMPTY(entry)) return false; - if (entry->status == SH_STATUS_IN_USE && + if (!SH_ENTRY_EMPTY(entry) && SH_COMPARE_KEYS(tb, hash, key, entry)) { SH_ELEMENT_TYPE *lastentry = entry; @@ -894,9 +908,9 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key) curelem = SH_NEXT(tb, curelem, startelem); curentry = &tb->data[curelem]; - if (curentry->status != SH_STATUS_IN_USE) + if (SH_ENTRY_EMPTY(curentry)) { - lastentry->status = SH_STATUS_EMPTY; + SH_MAKE_EMPTY(lastentry); break; } @@ -906,7 +920,7 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key) /* current is at optimal position, done */ if (curoptimal == curelem) { - lastentry->status = SH_STATUS_EMPTY; + SH_MAKE_EMPTY(lastentry); break; } @@ -957,9 +971,9 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry) curelem = SH_NEXT(tb, curelem, startelem); curentry = &tb->data[curelem]; - if (curentry->status != SH_STATUS_IN_USE) + if (SH_ENTRY_EMPTY(curentry)) { - lastentry->status = SH_STATUS_EMPTY; + SH_MAKE_EMPTY(lastentry); break; } @@ -969,7 +983,7 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry) /* current is at optimal position, done */ if (curoptimal == curelem) { - lastentry->status = SH_STATUS_EMPTY; + SH_MAKE_EMPTY(lastentry); break; } @@ -997,7 +1011,7 @@ SH_START_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter) { SH_ELEMENT_TYPE *entry = &tb->data[i]; - if (entry->status != SH_STATUS_IN_USE) + if (SH_ENTRY_EMPTY(entry)) { startelem = i; break; @@ -1063,7 +1077,7 @@ SH_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter) if ((iter->cur & tb->sizemask) == (iter->end & tb->sizemask)) iter->done = true; - if (elem->status == SH_STATUS_IN_USE) + if (!SH_ENTRY_EMPTY(elem)) { return elem; } @@ -1140,7 +1154,7 @@ SH_STAT(SH_TYPE * tb) elem = &tb->data[i]; - if (elem->status != SH_STATUS_IN_USE) + if (SH_ENTRY_EMPTY(elem)) continue; hash = SH_ENTRY_HASH(tb, elem); @@ -1205,6 +1219,9 @@ SH_STAT(SH_TYPE * tb) #undef SH_STORE_HASH #undef SH_USE_NONDEFAULT_ALLOCATOR #undef SH_EQUAL +#undef SH_ENTRY_EMPTY +#undef SH_MAKE_EMPTY +#undef SH_MAKE_IN_USE /* undefine locally declared macros */ #undef SH_MAKE_PREFIX -- 2.53.0 [application/octet-stream] v1-0004-Compact-PrivateRefCountEntry.patch (10.8K, 4-v1-0004-Compact-PrivateRefCountEntry.patch) download | inline diff: From 56bfdd6d7296397a7b3f0b282e239fdc86d6580d Mon Sep 17 00:00:00 2001 From: Alexandre Felipe <[email protected]> Date: Fri, 6 Mar 2026 17:15:44 +0000 Subject: [PATCH 4/5] Compact PrivateRefCountEntry The previous implementation used an 8-bite (64-bit) entry to store a uint32 count and an uint32 lock mode. That is fine when we store the data separate from the key (buffer). But in the simplehash {key, value} are stored together, so each entry is 12-bytes. This is somewhat awkward as we have to either pad the entry to 16-bytes, or the access will be an alternating aligned/misaligned addreses. However, we are probably not going to need even 16-bits for the count and 2-bits is enough for the lock mode. So we can easily pack these int to a single uint32. Incrementing/decrementing the count continue the same, just using 4 instead of 1, lock mode access will require one or two additional bitwise operations. No bit-shifts are required. --- src/backend/storage/buffer/buf_refcount.c | 167 +++++++++------------- 1 file changed, 70 insertions(+), 97 deletions(-) diff --git a/src/backend/storage/buffer/buf_refcount.c b/src/backend/storage/buffer/buf_refcount.c index ff37355a61e..29dfb720997 100644 --- a/src/backend/storage/buffer/buf_refcount.c +++ b/src/backend/storage/buffer/buf_refcount.c @@ -40,53 +40,54 @@ */ #include "postgres.h" -#include "common/hashfn.h" #include "storage/buf_internals.h" #include "storage/buf_refcount.h" #include "storage/proc.h" -typedef struct PrivateRefCountData +/* + * Implementation details - opaque to callers. + * Packed refcount and lockmode in a single uint32: + * Bits 0-1: lockmode (4 values: UNLOCK, SHARE, SHARE_EXCLUSIVE, EXCLUSIVE) + * Bits 2-31: refcount (30 bits, max ~1 billion) + */ +struct PrivateRefCountEntry { - int32 refcount; - BufferLockMode lockmode; -} PrivateRefCountData; + Buffer buffer; + uint32 data; +}; -struct PrivateRefCountEntry +#define PRIVATEREFCOUNT_LOCKMODE_MASK 0x3 +#define ONE_PRIVATE_REFERENCE 4 /* 1 << 2 */ + +#define PrivateRefCountGetLockMode(d) ((BufferLockMode)((d) & PRIVATEREFCOUNT_LOCKMODE_MASK)) +#define PrivateRefCountSetLockMode(d, m) ((d) = ((d) & ~PRIVATEREFCOUNT_LOCKMODE_MASK) | (m)) +#define PrivateRefCountGetRefCount(d) ((int32)((d) >> 2)) +#define PrivateRefCountIsZero(d) ((d) < ONE_PRIVATE_REFERENCE) + +struct PrivateRefCountIterator { - Buffer buffer; /* hash key - InvalidBuffer means empty */ - PrivateRefCountData data; + int array_index; + bool in_hash; + void *hash_status; }; -/* - * Define simplehash parameters for the overflow hash table. - * We use buffer == InvalidBuffer as the "empty" marker to avoid needing - * a separate status field. - */ +/* Define simplehash for private refcount overflow hash table */ #define SH_PREFIX refcount #define SH_ELEMENT_TYPE PrivateRefCountEntry #define SH_KEY_TYPE Buffer #define SH_KEY buffer -#define SH_HASH_KEY(tb, key) murmurhash32(key) +#define SH_HASH_KEY(tb, key) ((uint32) (key)) #define SH_EQUAL(tb, a, b) ((a) == (b)) #define SH_SCOPE static inline -#define SH_DEFINE -#define SH_DECLARE -/* Use buffer field as empty marker - no separate status needed */ #define SH_ENTRY_EMPTY(entry) ((entry)->buffer == InvalidBuffer) #define SH_MAKE_EMPTY(entry) ((entry)->buffer = InvalidBuffer) -#define SH_MAKE_IN_USE(entry) ((void)0) /* key assignment handles this */ +#define SH_MAKE_IN_USE(entry) ((void) 0) +#define SH_DECLARE +#define SH_DEFINE #include "lib/simplehash.h" -struct PrivateRefCountIterator -{ - int array_index; - bool in_hash; - refcount_iterator hash_iter; - bool hash_iter_valid; -}; - /* Private refcount array and keys */ #define REFCOUNT_ARRAY_ENTRIES 8 static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES]; @@ -113,7 +114,7 @@ static uint32 MaxProportionalPins = 0; /* Forward declarations */ static void ReservePrivateRefCountEntry(void); static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer); -static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move); +static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer); /* * Initialize private refcount tracking for this backend. @@ -132,7 +133,7 @@ InitPrivateRefCount(void) memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray)); memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys)); - PrivateRefCountHash = refcount_create(CurrentMemoryContext, 16, NULL); + PrivateRefCountHash = refcount_create(CurrentMemoryContext, 64, NULL); } /* @@ -158,6 +159,11 @@ ReservePrivateRefCountEntry(void) if (PrivateRefCountArrayKeys[i] == InvalidBuffer) { ReservedRefCountSlot = i; + + /* + * We could return immediately, but iterating till the end of + * the array allows compiler-autovectorization. + */ } } @@ -195,10 +201,7 @@ ReservePrivateRefCountEntry(void) /* clear the now free array slot */ PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer; victim_entry->buffer = InvalidBuffer; - - memset(&victim_entry->data, 0, sizeof(victim_entry->data)); - victim_entry->data.refcount = 0; - victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK; + victim_entry->data = 0; PrivateRefCountOverflowed++; } @@ -221,8 +224,7 @@ NewPrivateRefCountEntry(Buffer buffer) /* and fill it */ PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer; res->buffer = buffer; - res->data.refcount = 0; - res->data.lockmode = BUFFER_LOCK_UNLOCK; + res->data = 0; /* update cache for the next lookup */ PrivateRefCountEntryLast = ReservedRefCountSlot; @@ -236,7 +238,7 @@ NewPrivateRefCountEntry(Buffer buffer) * Slow-path for GetSharedBufferEntry(). */ static pg_noinline PrivateRefCountEntry * -GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move) +GetPrivateRefCountEntrySlow(Buffer buffer) { PrivateRefCountEntry *res; int match = -1; @@ -266,41 +268,11 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move) return NULL; res = refcount_lookup(PrivateRefCountHash, buffer); - - if (res == NULL) - return NULL; - else if (!do_move) - { - return res; - } - else - { - /* move buffer from hashtable into the free array slot */ - PrivateRefCountEntry *free; - - ReservePrivateRefCountEntry(); - - Assert(ReservedRefCountSlot != -1); - free = &PrivateRefCountArray[ReservedRefCountSlot]; - Assert(free->buffer == InvalidBuffer); - - free->buffer = buffer; - free->data = res->data; - PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer; - PrivateRefCountEntryLast = ReservedRefCountSlot; - - ReservedRefCountSlot = -1; - - refcount_delete_item(PrivateRefCountHash, res); - Assert(PrivateRefCountOverflowed > 0); - PrivateRefCountOverflowed--; - - return free; - } + return res; } /* - * Return the PrivateRefCountEntry for the passed buffer. + * Return the PrivateRefCount entry for the passed buffer. * Returns NULL if the buffer is not currently pinned. */ PrivateRefCountEntry * @@ -316,7 +288,7 @@ GetSharedBufferEntry(Buffer buffer) return &PrivateRefCountArray[PrivateRefCountEntryLast]; } - return GetPrivateRefCountEntrySlow(buffer, false); + return GetPrivateRefCountEntrySlow(buffer); } /* @@ -332,25 +304,20 @@ SharedBufferRef(Buffer buffer) Assert(BufferIsValid(buffer)); Assert(!BufferIsLocal(buffer)); - /* Check cache first, then slow path */ - if (likely(PrivateRefCountEntryLast != -1) && - likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer)) - { - ref = &PrivateRefCountArray[PrivateRefCountEntryLast]; - } - else - { - ref = GetPrivateRefCountEntrySlow(buffer, true); - } + ref = GetSharedBufferEntry(buffer); if (ref == NULL) { /* New pin - create entry */ ReservePrivateRefCountEntry(); ref = NewPrivateRefCountEntry(buffer); + ref->data = ONE_PRIVATE_REFERENCE; + } + else + { + /* Already pinned - increment */ + ref->data += ONE_PRIVATE_REFERENCE; } - - ref->data.refcount++; return ref; } @@ -363,8 +330,8 @@ void SharedBufferRefExisting(PrivateRefCountEntry *ref) { Assert(ref != NULL); - Assert(ref->data.refcount > 0); - ref->data.refcount++; + Assert(!PrivateRefCountIsZero(ref->data)); + ref->data += ONE_PRIVATE_REFERENCE; } /* @@ -376,14 +343,14 @@ bool SharedBufferUnref(PrivateRefCountEntry *ref) { Assert(ref != NULL); - Assert(ref->data.refcount > 0); + Assert(!PrivateRefCountIsZero(ref->data)); - ref->data.refcount--; + ref->data -= ONE_PRIVATE_REFERENCE; - if (ref->data.refcount == 0) + if (PrivateRefCountIsZero(ref->data)) { /* No more references - clean up the entry */ - Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK); + Assert(SharedBufferGetLockMode(ref) == BUFFER_LOCK_UNLOCK); if (ref >= &PrivateRefCountArray[0] && ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES]) @@ -394,7 +361,8 @@ SharedBufferUnref(PrivateRefCountEntry *ref) } else { - refcount_delete_item(PrivateRefCountHash, ref); + /* could make slightly more efficient by using the pointer */ + refcount_delete(PrivateRefCountHash, ref->buffer); Assert(PrivateRefCountOverflowed > 0); PrivateRefCountOverflowed--; } @@ -406,24 +374,24 @@ SharedBufferUnref(PrivateRefCountEntry *ref) } /* - * Accessors for refcount entry fields. + * Accessors for refcount entry fields - opaque to callers. */ int32 SharedBufferRefCount(PrivateRefCountEntry *ref) { - return ref->data.refcount; + return PrivateRefCountGetRefCount(ref->data); } BufferLockMode SharedBufferGetLockMode(PrivateRefCountEntry *ref) { - return ref->data.lockmode; + return PrivateRefCountGetLockMode(ref->data); } void SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode) { - ref->data.lockmode = mode; + PrivateRefCountSetLockMode(ref->data, mode); } Buffer @@ -488,7 +456,7 @@ InitPrivateRefCountIterator(void) iter->array_index = 0; iter->in_hash = false; - iter->hash_iter_valid = false; + iter->hash_status = NULL; return iter; } @@ -514,20 +482,23 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter) iter->in_hash = true; if (PrivateRefCountOverflowed > 0) { - refcount_start_iterate(PrivateRefCountHash, &iter->hash_iter); - iter->hash_iter_valid = true; + refcount_iterator *hiter = palloc(sizeof(refcount_iterator)); + + refcount_start_iterate(PrivateRefCountHash, hiter); + iter->hash_status = hiter; } } - if (iter->hash_iter_valid) + if (iter->hash_status != NULL) { PrivateRefCountEntry *res; - res = refcount_iterate(PrivateRefCountHash, &iter->hash_iter); + res = refcount_iterate(PrivateRefCountHash, (refcount_iterator *) iter->hash_status); if (res != NULL) return res; - iter->hash_iter_valid = false; + pfree(iter->hash_status); + iter->hash_status = NULL; } return NULL; @@ -539,6 +510,8 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter) void FreePrivateRefCountIterator(PrivateRefCountIterator *iter) { + if (iter->hash_status != NULL) + pfree(iter->hash_status); pfree(iter); } -- 2.53.0 [application/octet-stream] v1-0002-Refactoring-reference-counting.patch (50.6K, 5-v1-0002-Refactoring-reference-counting.patch) download | inline diff: From c8b90725fd033465c68688f4663892ce1196a48e Mon Sep 17 00:00:00 2001 From: Alexandre Felipe <[email protected]> Date: Fri, 6 Mar 2026 16:31:00 +0000 Subject: [PATCH 2/5] Refactoring reference counting This patch refactors the reference counting mechanism moving the implementation details away from bufmgr. Unfortunately, this comes with additional calls overhead, but I think that the ease of maintenance will pay off. And with the next optimisations, we will end up better than before. --- src/backend/storage/buffer/Makefile | 1 + src/backend/storage/buffer/buf_refcount.c | 602 ++++++++++++++++++++ src/backend/storage/buffer/bufmgr.c | 661 +++------------------- src/include/storage/buf_refcount.h | 58 ++ 4 files changed, 727 insertions(+), 595 deletions(-) create mode 100644 src/backend/storage/buffer/buf_refcount.c create mode 100644 src/include/storage/buf_refcount.h diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile index fd7c40dcb08..c81271aabf6 100644 --- a/src/backend/storage/buffer/Makefile +++ b/src/backend/storage/buffer/Makefile @@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global OBJS = \ buf_init.o \ + buf_refcount.o \ buf_table.o \ bufmgr.o \ freelist.o \ diff --git a/src/backend/storage/buffer/buf_refcount.c b/src/backend/storage/buffer/buf_refcount.c new file mode 100644 index 00000000000..1c0bec29c93 --- /dev/null +++ b/src/backend/storage/buffer/buf_refcount.c @@ -0,0 +1,602 @@ +/*------------------------------------------------------------------------- + * + * buf_refcount.c + * Backend-private buffer refcount tracking + * + * Each buffer has a private refcount that keeps track of the number of + * times the buffer is pinned in the current process. This is so that the + * shared refcount needs to be modified only once if a buffer is pinned more + * than once by an individual backend. This mechanism is also used to track + * whether this backend has a buffer locked, and, if so, in what mode. + * + * To avoid - as we used to - requiring an array with NBuffers entries to keep + * track of local buffers, we use a small sequentially searched array + * (PrivateRefCountArrayKeys, with the corresponding data stored in + * PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to + * keep track of backend local pins. + * + * Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all + * refcounts are kept track of in the array; after that, new array entries + * displace old ones into the hash table. That way a frequently used entry + * can't get "stuck" in the hashtable while infrequent ones clog the array. + * + * This was initially designed trying to optimize for the case where the + * number of pinned buffers is expected to not exceed REFCOUNT_ARRAY_ENTRIES. + * However this might not be the case with the introduction of prefetching. + * + * To enter a buffer into the refcount tracking mechanism first reserve a free + * entry using ReservePrivateRefCountEntry() and then later, if necessary, + * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing + * memory allocations in NewPrivateRefCountEntry() which can be important + * because in some scenarios it's called with a spinlock held... + * + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/storage/buffer/buf_refcount.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "storage/buf_internals.h" +#include "storage/buf_refcount.h" +#include "storage/proc.h" +#include "utils/hsearch.h" + + + +typedef struct PrivateRefCountData +{ + int32 refcount; + BufferLockMode lockmode; +} PrivateRefCountData; + +struct PrivateRefCountEntry +{ + Buffer buffer; + PrivateRefCountData data; +}; + +struct PrivateRefCountIterator +{ + int array_index; + bool in_hash; + HASH_SEQ_STATUS *hash_status; +}; + +/* Private refcount array and keys */ +#define REFCOUNT_ARRAY_ENTRIES 8 +static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES]; +static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES]; + +/* Overflow hash table for when array is full */ +static HTAB *PrivateRefCountHash = NULL; + +/* Count of entries that have overflowed into the hash table */ +static int32 PrivateRefCountOverflowed = 0; + +/* Clock hand for selecting victim when array is full */ +static uint32 PrivateRefCountClock = 0; + +/* Reserved slot index, or -1 if none reserved */ +static int ReservedRefCountSlot = -1; + +/* Cache for last accessed entry */ +static int PrivateRefCountEntryLast = -1; + +/* Advisory limit on the number of pins each backend should hold */ +static uint32 MaxProportionalPins = 0; + +/* Forward declarations */ +static void ReservePrivateRefCountEntry(void); +static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer); +static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move); + +/* + * Initialize private refcount tracking for this backend. + */ +void +InitPrivateRefCount(void) +{ + HASHCTL hash_ctl; + + + /* + * An advisory limit on the number of pins each backend should hold, based + * on shared_buffers and the maximum number of connections possible. + * That's very pessimistic, but outside toy-sized shared_buffers it should + * allow plenty of pins. LimitAdditionalPins() and + * GetAdditionalPinLimit() can be used to check the remaining balance. + */ + MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS); + + memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray)); + memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys)); + + hash_ctl.keysize = sizeof(Buffer); + hash_ctl.entrysize = sizeof(PrivateRefCountEntry); + + PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl, + HASH_ELEM | HASH_BLOBS); +} + +/* + * Ensure that the PrivateRefCountArray has sufficient space to store one more + * entry. + */ +static void +ReservePrivateRefCountEntry(void) +{ + /* Already reserved (or freed), nothing to do */ + if (ReservedRefCountSlot != -1) + return; + + /* + * First search for a free entry the array, that'll be sufficient in the + * majority of cases. + */ + { + int i; + + for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) + { + if (PrivateRefCountArrayKeys[i] == InvalidBuffer) + { + ReservedRefCountSlot = i; + } + } + + if (ReservedRefCountSlot != -1) + return; + } + + /* + * No luck. All array entries are full. Move one array entry into the hash + * table. + */ + { + int victim_slot; + PrivateRefCountEntry *victim_entry; + PrivateRefCountEntry *hashent; + bool found; + + /* select victim slot */ + victim_slot = PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES; + victim_entry = &PrivateRefCountArray[victim_slot]; + ReservedRefCountSlot = victim_slot; + + /* Better be used, otherwise we shouldn't get here. */ + Assert(PrivateRefCountArrayKeys[victim_slot] != InvalidBuffer); + Assert(PrivateRefCountArray[victim_slot].buffer != InvalidBuffer); + Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer); + + /* enter victim array entry into hashtable */ + hashent = hash_search(PrivateRefCountHash, + &PrivateRefCountArrayKeys[victim_slot], + HASH_ENTER, + &found); + Assert(!found); + hashent->data = victim_entry->data; + + /* clear the now free array slot */ + PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer; + victim_entry->buffer = InvalidBuffer; + + memset(&victim_entry->data, 0, sizeof(victim_entry->data)); + victim_entry->data.refcount = 0; + victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK; + + PrivateRefCountOverflowed++; + } +} + +/* + * Create a new refcount entry for the given buffer. + */ +static PrivateRefCountEntry * +NewPrivateRefCountEntry(Buffer buffer) +{ + PrivateRefCountEntry *res; + + /* only allowed to be called when a reservation has been made */ + Assert(ReservedRefCountSlot != -1); + + /* use up the reserved entry */ + res = &PrivateRefCountArray[ReservedRefCountSlot]; + + /* and fill it */ + PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer; + res->buffer = buffer; + res->data.refcount = 0; + res->data.lockmode = BUFFER_LOCK_UNLOCK; + + /* update cache for the next lookup */ + PrivateRefCountEntryLast = ReservedRefCountSlot; + + ReservedRefCountSlot = -1; + + return res; +} + +/* + * Slow-path for GetSharedBufferEntry(). + */ +static pg_noinline PrivateRefCountEntry * +GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move) +{ + PrivateRefCountEntry *res; + int match = -1; + int i; + + /* + * First search for references in the array. + */ + for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) + { + if (PrivateRefCountArrayKeys[i] == buffer) + { + match = i; + } + } + + if (likely(match != -1)) + { + PrivateRefCountEntryLast = match; + return &PrivateRefCountArray[match]; + } + + /* + * Only look up the buffer in the hashtable if we've previously overflowed. + */ + if (PrivateRefCountOverflowed == 0) + return NULL; + + res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL); + + if (res == NULL) + return NULL; + else if (!do_move) + { + return res; + } + else + { + /* move buffer from hashtable into the free array slot */ + bool found; + PrivateRefCountEntry *free; + + ReservePrivateRefCountEntry(); + + Assert(ReservedRefCountSlot != -1); + free = &PrivateRefCountArray[ReservedRefCountSlot]; + Assert(free->buffer == InvalidBuffer); + + free->buffer = buffer; + free->data = res->data; + PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer; + PrivateRefCountEntryLast = ReservedRefCountSlot; + + ReservedRefCountSlot = -1; + + hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found); + Assert(found); + Assert(PrivateRefCountOverflowed > 0); + PrivateRefCountOverflowed--; + + return free; + } +} + +/* + * Return the PrivateRefCountEntry for the passed buffer. + * Returns NULL if the buffer is not currently pinned. + */ +PrivateRefCountEntry * +GetSharedBufferEntry(Buffer buffer) +{ + Assert(BufferIsValid(buffer)); + Assert(!BufferIsLocal(buffer)); + + /* Fast path: check one-entry cache */ + if (likely(PrivateRefCountEntryLast != -1) && + likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer)) + { + return &PrivateRefCountArray[PrivateRefCountEntryLast]; + } + + return GetPrivateRefCountEntrySlow(buffer, false); +} + +/* + * Increment the private refcount for a shared buffer. + * Creates a new entry if one doesn't exist. + * Returns the entry pointer. + */ +PrivateRefCountEntry * +SharedBufferRef(Buffer buffer) +{ + PrivateRefCountEntry *ref; + + Assert(BufferIsValid(buffer)); + Assert(!BufferIsLocal(buffer)); + + /* Check cache first, then slow path */ + if (likely(PrivateRefCountEntryLast != -1) && + likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer)) + { + ref = &PrivateRefCountArray[PrivateRefCountEntryLast]; + } + else + { + ref = GetPrivateRefCountEntrySlow(buffer, true); + } + + if (ref == NULL) + { + /* New pin - create entry */ + ReservePrivateRefCountEntry(); + ref = NewPrivateRefCountEntry(buffer); + } + + ref->data.refcount++; + + return ref; +} + +/* + * Increment the private refcount for an existing entry. + * Use when you already have the entry from a previous lookup. + */ +void +SharedBufferRefExisting(PrivateRefCountEntry *ref) +{ + Assert(ref != NULL); + Assert(ref->data.refcount > 0); + ref->data.refcount++; +} + +/* + * Decrement the private refcount for a buffer. + * If the refcount reaches zero, removes the entry and returns true. + * Returns false if the buffer still has references. + */ +bool +SharedBufferUnref(PrivateRefCountEntry *ref) +{ + Assert(ref != NULL); + Assert(ref->data.refcount > 0); + + ref->data.refcount--; + + if (ref->data.refcount == 0) + { + /* No more references - clean up the entry */ + Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK); + + if (ref >= &PrivateRefCountArray[0] && + ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES]) + { + ref->buffer = InvalidBuffer; + PrivateRefCountArrayKeys[ref - PrivateRefCountArray] = InvalidBuffer; + ReservedRefCountSlot = ref - PrivateRefCountArray; + } + else + { + bool found; + Buffer buffer = ref->buffer; + + hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found); + Assert(found); + Assert(PrivateRefCountOverflowed > 0); + PrivateRefCountOverflowed--; + } + + return true; + } + + return false; +} + +/* + * Accessors for refcount entry fields. + */ +int32 +SharedBufferRefCount(PrivateRefCountEntry *ref) +{ + return ref->data.refcount; +} + +BufferLockMode +SharedBufferGetLockMode(PrivateRefCountEntry *ref) +{ + return ref->data.lockmode; +} + +void +SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode) +{ + ref->data.lockmode = mode; +} + +Buffer +SharedBufferGetBuffer(PrivateRefCountEntry *ref) +{ + return ref->buffer; +} + +/* + * Check for buffer refcount leaks. + */ +void +CheckPrivateRefCountLeaks(void) +{ +#ifdef USE_ASSERT_CHECKING + int RefCountErrors = 0; + PrivateRefCountEntry *res; + int i; + char *s; + + /* check the array */ + for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) + { + if (PrivateRefCountArrayKeys[i] != InvalidBuffer) + { + res = &PrivateRefCountArray[i]; + + s = DebugPrintBufferRefcount(res->buffer); + elog(WARNING, "buffer refcount leak: %s", s); + pfree(s); + + RefCountErrors++; + } + } + + /* if necessary search the hash */ + if (PrivateRefCountOverflowed) + { + HASH_SEQ_STATUS hstat; + + hash_seq_init(&hstat, PrivateRefCountHash); + while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL) + { + s = DebugPrintBufferRefcount(res->buffer); + elog(WARNING, "buffer refcount leak: %s", s); + pfree(s); + RefCountErrors++; + } + } + + Assert(RefCountErrors == 0); +#endif +} + +/* + * Initialize an iterator for walking all private refcount entries. + */ +PrivateRefCountIterator * +InitPrivateRefCountIterator(void) +{ + PrivateRefCountIterator *iter = palloc(sizeof(PrivateRefCountIterator)); + + iter->array_index = 0; + iter->in_hash = false; + iter->hash_status = NULL; + return iter; +} + +/* + * Get the next private refcount entry. + * Returns NULL when iteration is complete. + */ +PrivateRefCountEntry * +GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter) +{ + /* First iterate through the array */ + while (!iter->in_hash && iter->array_index < REFCOUNT_ARRAY_ENTRIES) + { + int idx = iter->array_index++; + + if (PrivateRefCountArrayKeys[idx] != InvalidBuffer) + return &PrivateRefCountArray[idx]; + } + + /* Then iterate through the hash if there are overflowed entries */ + if (!iter->in_hash) + { + iter->in_hash = true; + if (PrivateRefCountOverflowed > 0) + { + iter->hash_status = palloc(sizeof(HASH_SEQ_STATUS)); + hash_seq_init(iter->hash_status, PrivateRefCountHash); + } + } + + if (iter->hash_status != NULL) + { + PrivateRefCountEntry *res; + + res = (PrivateRefCountEntry *) hash_seq_search(iter->hash_status); + if (res != NULL) + return res; + + pfree(iter->hash_status); + iter->hash_status = NULL; + } + + return NULL; +} + +/* + * Free an iterator from InitPrivateRefCountIterator. + */ +void +FreePrivateRefCountIterator(PrivateRefCountIterator *iter) +{ + if (iter->hash_status != NULL) + { + hash_seq_term(iter->hash_status); + pfree(iter->hash_status); + } + pfree(iter); +} + + +/* + * Return the maximum number of buffers that a backend should try to pin once, + * to avoid exceeding its fair share. This is the highest value that + * GetAdditionalPinLimit() could ever return. Note that it may be zero on a + * system with a very small buffer pool relative to max_connections. + */ + uint32 + GetPinLimit(void) + { + return MaxProportionalPins; + } + + /* + * Return the maximum number of additional buffers that this backend should + * pin if it wants to stay under the per-backend limit, considering the number + * of buffers it has already pinned. Unlike LimitAdditionalPins(), the limit + * return by this function can be zero. + */ + uint32 + GetAdditionalPinLimit(void) + { + uint32 estimated_pins_held; + + /* + * We get the number of "overflowed" pins for free, but don't know the + * number of pins in PrivateRefCountArray. The cost of calculating that + * exactly doesn't seem worth it, so just assume the max. + */ + estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES; + + /* Is this backend already holding more than its fair share? */ + if (estimated_pins_held > MaxProportionalPins) + return 0; + + return MaxProportionalPins - estimated_pins_held; + } + + /* + * Limit the number of pins a batch operation may additionally acquire, to + * avoid running out of pinnable buffers. + * + * One additional pin is always allowed, on the assumption that the operation + * requires at least one to make progress. + */ + void + LimitAdditionalPins(uint32 *additional_pins) + { + uint32 limit; + + if (*additional_pins <= 1) + return; + + limit = GetAdditionalPinLimit(); + limit = Max(limit, 1); + if (limit < *additional_pins) + *additional_pins = limit; + } diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 5f3d083e938..aa99e97e286 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -54,6 +54,7 @@ #include "storage/aio.h" #include "storage/buf_internals.h" #include "storage/bufmgr.h" +#include "storage/buf_refcount.h" #include "storage/fd.h" #include "storage/ipc.h" #include "storage/lmgr.h" @@ -93,43 +94,6 @@ */ #define BUF_DROP_FULL_SCAN_THRESHOLD (uint64) (NBuffers / 32) -/* - * This is separated out from PrivateRefCountEntry to allow for copying all - * the data members via struct assignment. - */ -typedef struct PrivateRefCountData -{ - /* - * How many times has the buffer been pinned by this backend. - */ - int32 refcount; - - /* - * Is the buffer locked by this backend? BUFFER_LOCK_UNLOCK indicates that - * the buffer is not locked. - */ - BufferLockMode lockmode; -} PrivateRefCountData; - -typedef struct PrivateRefCountEntry -{ - /* - * Note that this needs to be same as the entry's corresponding - * PrivateRefCountArrayKeys[i], if the entry is stored in the array. We - * store it in both places as this is used for the hashtable key and - * because it is more convenient (passing around a PrivateRefCountEntry - * suffices to identify the buffer) and faster (checking the keys array is - * faster when checking many entries, checking the entry is faster if just - * checking a single entry). - */ - Buffer buffer; - - PrivateRefCountData data; -} PrivateRefCountEntry; - -/* 64 bytes, about the size of a cache line on common systems */ -#define REFCOUNT_ARRAY_ENTRIES 8 - /* * Status of buffers to checkpoint for a particular tablespace, used * internally in BufferSync. @@ -213,55 +177,6 @@ int backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER; /* local state for LockBufferForCleanup */ static BufferDesc *PinCountWaitBuf = NULL; -/* - * Backend-Private refcount management: - * - * Each buffer also has a private refcount that keeps track of the number of - * times the buffer is pinned in the current process. This is so that the - * shared refcount needs to be modified only once if a buffer is pinned more - * than once by an individual backend. It's also used to check that no - * buffers are still pinned at the end of transactions and when exiting. We - * also use this mechanism to track whether this backend has a buffer locked, - * and, if so, in what mode. - * - * - * To avoid - as we used to - requiring an array with NBuffers entries to keep - * track of local buffers, we use a small sequentially searched array - * (PrivateRefCountArrayKeys, with the corresponding data stored in - * PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to - * keep track of backend local pins. - * - * Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all - * refcounts are kept track of in the array; after that, new array entries - * displace old ones into the hash table. That way a frequently used entry - * can't get "stuck" in the hashtable while infrequent ones clog the array. - * - * Note that in most scenarios the number of pinned buffers will not exceed - * REFCOUNT_ARRAY_ENTRIES. - * - * - * To enter a buffer into the refcount tracking mechanism first reserve a free - * entry using ReservePrivateRefCountEntry() and then later, if necessary, - * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing - * memory allocations in NewPrivateRefCountEntry() which can be important - * because in some scenarios it's called with a spinlock held... - */ -static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES]; -static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES]; -static HTAB *PrivateRefCountHash = NULL; -static int32 PrivateRefCountOverflowed = 0; -static uint32 PrivateRefCountClock = 0; -static int ReservedRefCountSlot = -1; -static int PrivateRefCountEntryLast = -1; - -static uint32 MaxProportionalPins; - -static void ReservePrivateRefCountEntry(void); -static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer); -static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move); -static inline int32 GetPrivateRefCount(Buffer buffer); -static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref); - /* ResourceOwner callbacks to hold in-progress I/Os and buffer pins */ static void ResOwnerReleaseBufferIO(Datum res); static char *ResOwnerPrintBufferIO(Datum res); @@ -286,301 +201,6 @@ const ResourceOwnerDesc buffer_resowner_desc = .DebugPrint = ResOwnerPrintBuffer }; -/* - * Ensure that the PrivateRefCountArray has sufficient space to store one more - * entry. This has to be called before using NewPrivateRefCountEntry() to fill - * a new entry - but it's perfectly fine to not use a reserved entry. - */ -static void -ReservePrivateRefCountEntry(void) -{ - /* Already reserved (or freed), nothing to do */ - if (ReservedRefCountSlot != -1) - return; - - /* - * First search for a free entry the array, that'll be sufficient in the - * majority of cases. - */ - { - int i; - - for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) - { - if (PrivateRefCountArrayKeys[i] == InvalidBuffer) - { - ReservedRefCountSlot = i; - - /* - * We could return immediately, but iterating till the end of - * the array allows compiler-autovectorization. - */ - } - } - - if (ReservedRefCountSlot != -1) - return; - } - - /* - * No luck. All array entries are full. Move one array entry into the hash - * table. - */ - { - /* - * Move entry from the current clock position in the array into the - * hashtable. Use that slot. - */ - int victim_slot; - PrivateRefCountEntry *victim_entry; - PrivateRefCountEntry *hashent; - bool found; - - /* select victim slot */ - victim_slot = PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES; - victim_entry = &PrivateRefCountArray[victim_slot]; - ReservedRefCountSlot = victim_slot; - - /* Better be used, otherwise we shouldn't get here. */ - Assert(PrivateRefCountArrayKeys[victim_slot] != InvalidBuffer); - Assert(PrivateRefCountArray[victim_slot].buffer != InvalidBuffer); - Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer); - - /* enter victim array entry into hashtable */ - hashent = hash_search(PrivateRefCountHash, - &PrivateRefCountArrayKeys[victim_slot], - HASH_ENTER, - &found); - Assert(!found); - /* move data from the entry in the array to the hash entry */ - hashent->data = victim_entry->data; - - /* clear the now free array slot */ - PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer; - victim_entry->buffer = InvalidBuffer; - - /* clear the whole data member, just for future proofing */ - memset(&victim_entry->data, 0, sizeof(victim_entry->data)); - victim_entry->data.refcount = 0; - victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK; - - PrivateRefCountOverflowed++; - } -} - -/* - * Fill a previously reserved refcount entry. - */ -static PrivateRefCountEntry * -NewPrivateRefCountEntry(Buffer buffer) -{ - PrivateRefCountEntry *res; - - /* only allowed to be called when a reservation has been made */ - Assert(ReservedRefCountSlot != -1); - - /* use up the reserved entry */ - res = &PrivateRefCountArray[ReservedRefCountSlot]; - - /* and fill it */ - PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer; - res->buffer = buffer; - res->data.refcount = 0; - res->data.lockmode = BUFFER_LOCK_UNLOCK; - - /* update cache for the next lookup */ - PrivateRefCountEntryLast = ReservedRefCountSlot; - - ReservedRefCountSlot = -1; - - return res; -} - -/* - * Slow-path for GetPrivateRefCountEntry(). This is big enough to not be worth - * inlining. This particularly seems to be true if the compiler is capable of - * auto-vectorizing the code, as that imposes additional stack-alignment - * requirements etc. - */ -static pg_noinline PrivateRefCountEntry * -GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move) -{ - PrivateRefCountEntry *res; - int match = -1; - int i; - - /* - * First search for references in the array, that'll be sufficient in the - * majority of cases. - */ - for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) - { - if (PrivateRefCountArrayKeys[i] == buffer) - { - match = i; - /* see ReservePrivateRefCountEntry() for why we don't return */ - } - } - - if (likely(match != -1)) - { - /* update cache for the next lookup */ - PrivateRefCountEntryLast = match; - - return &PrivateRefCountArray[match]; - } - - /* - * By here we know that the buffer, if already pinned, isn't residing in - * the array. - * - * Only look up the buffer in the hashtable if we've previously overflowed - * into it. - */ - if (PrivateRefCountOverflowed == 0) - return NULL; - - res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL); - - if (res == NULL) - return NULL; - else if (!do_move) - { - /* caller doesn't want us to move the hash entry into the array */ - return res; - } - else - { - /* move buffer from hashtable into the free array slot */ - bool found; - PrivateRefCountEntry *free; - - /* Ensure there's a free array slot */ - ReservePrivateRefCountEntry(); - - /* Use up the reserved slot */ - Assert(ReservedRefCountSlot != -1); - free = &PrivateRefCountArray[ReservedRefCountSlot]; - Assert(PrivateRefCountArrayKeys[ReservedRefCountSlot] == free->buffer); - Assert(free->buffer == InvalidBuffer); - - /* and fill it */ - free->buffer = buffer; - free->data = res->data; - PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer; - /* update cache for the next lookup */ - PrivateRefCountEntryLast = match; - - ReservedRefCountSlot = -1; - - - /* delete from hashtable */ - hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found); - Assert(found); - Assert(PrivateRefCountOverflowed > 0); - PrivateRefCountOverflowed--; - - return free; - } -} - -/* - * Return the PrivateRefCount entry for the passed buffer. - * - * Returns NULL if a buffer doesn't have a refcount entry. Otherwise, if - * do_move is true, and the entry resides in the hashtable the entry is - * optimized for frequent access by moving it to the array. - */ -static inline PrivateRefCountEntry * -GetPrivateRefCountEntry(Buffer buffer, bool do_move) -{ - Assert(BufferIsValid(buffer)); - Assert(!BufferIsLocal(buffer)); - - /* - * It's very common to look up the same buffer repeatedly. To make that - * fast, we have a one-entry cache. - * - * In contrast to the loop in GetPrivateRefCountEntrySlow(), here it - * faster to check PrivateRefCountArray[].buffer, as in the case of a hit - * fewer addresses are computed and fewer cachelines are accessed. Whereas - * in GetPrivateRefCountEntrySlow()'s case, checking - * PrivateRefCountArrayKeys saves a lot of memory accesses. - */ - if (likely(PrivateRefCountEntryLast != -1) && - likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer)) - { - return &PrivateRefCountArray[PrivateRefCountEntryLast]; - } - - /* - * The code for the cached lookup is small enough to be worth inlining - * into the caller. In the miss case however, that empirically doesn't - * seem worth it. - */ - return GetPrivateRefCountEntrySlow(buffer, do_move); -} - -/* - * Returns how many times the passed buffer is pinned by this backend. - * - * Only works for shared memory buffers! - */ -static inline int32 -GetPrivateRefCount(Buffer buffer) -{ - PrivateRefCountEntry *ref; - - Assert(BufferIsValid(buffer)); - Assert(!BufferIsLocal(buffer)); - - /* - * Not moving the entry - that's ok for the current users, but we might - * want to change this one day. - */ - ref = GetPrivateRefCountEntry(buffer, false); - - if (ref == NULL) - return 0; - return ref->data.refcount; -} - -/* - * Release resources used to track the reference count of a buffer which we no - * longer have pinned and don't want to pin again immediately. - */ -static void -ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref) -{ - Assert(ref->data.refcount == 0); - Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK); - - if (ref >= &PrivateRefCountArray[0] && - ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES]) - { - ref->buffer = InvalidBuffer; - PrivateRefCountArrayKeys[ref - PrivateRefCountArray] = InvalidBuffer; - - - /* - * Mark the just used entry as reserved - in many scenarios that - * allows us to avoid ever having to search the array/hash for free - * entries. - */ - ReservedRefCountSlot = ref - PrivateRefCountArray; - } - else - { - bool found; - Buffer buffer = ref->buffer; - - hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found); - Assert(found); - Assert(PrivateRefCountOverflowed > 0); - PrivateRefCountOverflowed--; - } -} - /* * BufferIsPinned * True iff the buffer is pinned (also checks for valid buffer number). @@ -596,7 +216,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref) BufferIsLocal(bufnum) ? \ (LocalRefCount[-(bufnum) - 1] > 0) \ : \ - (GetPrivateRefCount(bufnum) > 0) \ + (GetSharedBufferEntry(bufnum) != NULL) \ ) @@ -653,7 +273,6 @@ static void RelationCopyStorageUsingBuffer(RelFileLocator srclocator, RelFileLocator dstlocator, ForkNumber forkNum, bool permanent); static void AtProcExit_Buffers(int code, Datum arg); -static void CheckForBufferLeaks(void); #ifdef USE_ASSERT_CHECKING static void AssertNotCatalogBufferLock(Buffer buffer, BufferLockMode mode); #endif @@ -812,7 +431,6 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN Assert(BufferIsValid(recent_buffer)); ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); InitBufferTag(&tag, &rlocator, forkNum, blockNum); if (BufferIsLocal(recent_buffer)) @@ -2115,7 +1733,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, /* Make sure we will have room to remember the buffer pin */ ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); /* create a tag so we can lookup the buffer */ InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum); @@ -2327,7 +1944,7 @@ retry: UnlockBufHdr(buf); LWLockRelease(oldPartitionLock); /* safety check: should definitely not be our *own* pin */ - if (GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) > 0) + if (GetSharedBufferEntry(BufferDescriptorGetBuffer(buf)) != NULL) elog(ERROR, "buffer is pinned in InvalidateBuffer"); WaitIO(buf); goto retry; @@ -2380,7 +1997,7 @@ InvalidateVictimBuffer(BufferDesc *buf_hdr) LWLock *partition_lock; BufferTag tag; - Assert(GetPrivateRefCount(BufferDescriptorGetBuffer(buf_hdr)) == 1); + Assert(GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr)) != NULL); /* have buffer pinned, so it's safe to read tag without lock */ tag = buf_hdr->tag; @@ -2461,7 +2078,6 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context) * Ensure, before we pin a victim buffer, that there's a free refcount * entry and resource owner slot for the pin. */ - ReservePrivateRefCountEntry(); ResourceOwnerEnlarge(CurrentResourceOwner); /* we return here if a prospective victim buffer gets used concurrently */ @@ -2595,64 +2211,6 @@ again: return buf; } -/* - * Return the maximum number of buffers that a backend should try to pin once, - * to avoid exceeding its fair share. This is the highest value that - * GetAdditionalPinLimit() could ever return. Note that it may be zero on a - * system with a very small buffer pool relative to max_connections. - */ -uint32 -GetPinLimit(void) -{ - return MaxProportionalPins; -} - -/* - * Return the maximum number of additional buffers that this backend should - * pin if it wants to stay under the per-backend limit, considering the number - * of buffers it has already pinned. Unlike LimitAdditionalPins(), the limit - * return by this function can be zero. - */ -uint32 -GetAdditionalPinLimit(void) -{ - uint32 estimated_pins_held; - - /* - * We get the number of "overflowed" pins for free, but don't know the - * number of pins in PrivateRefCountArray. The cost of calculating that - * exactly doesn't seem worth it, so just assume the max. - */ - estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES; - - /* Is this backend already holding more than its fair share? */ - if (estimated_pins_held > MaxProportionalPins) - return 0; - - return MaxProportionalPins - estimated_pins_held; -} - -/* - * Limit the number of pins a batch operation may additionally acquire, to - * avoid running out of pinnable buffers. - * - * One additional pin is always allowed, on the assumption that the operation - * requires at least one to make progress. - */ -void -LimitAdditionalPins(uint32 *additional_pins) -{ - uint32 limit; - - if (*additional_pins <= 1) - return; - - limit = GetAdditionalPinLimit(); - limit = Max(limit, 1); - if (limit < *additional_pins) - *additional_pins = limit; -} - /* * Logic shared between ExtendBufferedRelBy(), ExtendBufferedRelTo(). Just to * avoid duplicating the tracing and relpersistence related logic. @@ -2816,7 +2374,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr, /* in case we need to pin an existing buffer below */ ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); InitBufferTag(&tag, &BMR_GET_SMGR(bmr)->smgr_rlocator.locator, fork, first_block + i); @@ -3188,9 +2745,8 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy, PrivateRefCountEntry *ref; Assert(!BufferIsLocal(b)); - Assert(ReservedRefCountSlot != -1); - ref = GetPrivateRefCountEntry(b, true); + ref = GetSharedBufferEntry(b); if (ref == NULL) { @@ -3260,8 +2816,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy, */ result = (pg_atomic_read_u64(&buf->state) & BM_VALID) != 0; - Assert(ref->data.refcount > 0); - ref->data.refcount++; + SharedBufferRefExisting(ref); ResourceOwnerRememberBuffer(CurrentResourceOwner, b); } @@ -3299,7 +2854,7 @@ PinBuffer_Locked(BufferDesc *buf) * As explained, We don't expect any preexisting pins. That allows us to * manipulate the PrivateRefCount after releasing the spinlock */ - Assert(GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf), false) == NULL); + Assert(GetSharedBufferEntry(BufferDescriptorGetBuffer(buf)) == NULL); /* * Since we hold the buffer spinlock, we can update the buffer state and @@ -3376,11 +2931,10 @@ UnpinBufferNoOwner(BufferDesc *buf) Assert(!BufferIsLocal(b)); /* not moving as we're likely deleting it soon anyway */ - ref = GetPrivateRefCountEntry(b, false); + ref = GetSharedBufferEntry(b); Assert(ref != NULL); - Assert(ref->data.refcount > 0); - ref->data.refcount--; - if (ref->data.refcount == 0) + + if (SharedBufferUnref(ref)) { uint64 old_buf_state; @@ -3405,8 +2959,6 @@ UnpinBufferNoOwner(BufferDesc *buf) /* Support LockBufferForCleanup() */ if (old_buf_state & BM_PIN_COUNT_WAITER) WakePinCountWaiter(buf); - - ForgetPrivateRefCountEntry(ref); } } @@ -3417,10 +2969,7 @@ UnpinBufferNoOwner(BufferDesc *buf) inline void TrackNewBufferPin(Buffer buf) { - PrivateRefCountEntry *ref; - - ref = NewPrivateRefCountEntry(buf); - ref->data.refcount++; + SharedBufferRef(buf); ResourceOwnerRememberBuffer(CurrentResourceOwner, buf); @@ -4040,7 +3589,6 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context) BufferTag tag; /* Make sure we can handle the pin */ - ReservePrivateRefCountEntry(); ResourceOwnerEnlarge(CurrentResourceOwner); /* @@ -4104,11 +3652,9 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context) void AtEOXact_Buffers(bool isCommit) { - CheckForBufferLeaks(); + CheckPrivateRefCountLeaks(); AtEOXact_LocalBuffers(isCommit); - - Assert(PrivateRefCountOverflowed == 0); } /* @@ -4121,25 +3667,8 @@ AtEOXact_Buffers(bool isCommit) void InitBufferManagerAccess(void) { - HASHCTL hash_ctl; - - /* - * An advisory limit on the number of pins each backend should hold, based - * on shared_buffers and the maximum number of connections possible. - * That's very pessimistic, but outside toy-sized shared_buffers it should - * allow plenty of pins. LimitAdditionalPins() and - * GetAdditionalPinLimit() can be used to check the remaining balance. - */ - MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS); - - memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray)); - memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys)); - - hash_ctl.keysize = sizeof(Buffer); - hash_ctl.entrysize = sizeof(PrivateRefCountEntry); - PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl, - HASH_ELEM | HASH_BLOBS); + InitPrivateRefCount(); /* * AtProcExit_Buffers needs LWLock access, and thereby has to be called at @@ -4158,62 +3687,12 @@ AtProcExit_Buffers(int code, Datum arg) { UnlockBuffers(); - CheckForBufferLeaks(); + CheckPrivateRefCountLeaks(); /* localbuf.c needs a chance too */ AtProcExit_LocalBuffers(); } -/* - * CheckForBufferLeaks - ensure this backend holds no buffer pins - * - * As of PostgreSQL 8.0, buffer pins should get released by the - * ResourceOwner mechanism. This routine is just a debugging - * cross-check that no pins remain. - */ -static void -CheckForBufferLeaks(void) -{ -#ifdef USE_ASSERT_CHECKING - int RefCountErrors = 0; - PrivateRefCountEntry *res; - int i; - char *s; - - /* check the array */ - for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) - { - if (PrivateRefCountArrayKeys[i] != InvalidBuffer) - { - res = &PrivateRefCountArray[i]; - - s = DebugPrintBufferRefcount(res->buffer); - elog(WARNING, "buffer refcount leak: %s", s); - pfree(s); - - RefCountErrors++; - } - } - - /* if necessary search the hash */ - if (PrivateRefCountOverflowed) - { - HASH_SEQ_STATUS hstat; - - hash_seq_init(&hstat, PrivateRefCountHash); - while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL) - { - s = DebugPrintBufferRefcount(res->buffer); - elog(WARNING, "buffer refcount leak: %s", s); - pfree(s); - RefCountErrors++; - } - } - - Assert(RefCountErrors == 0); -#endif -} - #ifdef USE_ASSERT_CHECKING /* * Check for exclusive-locked catalog buffers. This is the core of @@ -4235,33 +3714,20 @@ CheckForBufferLeaks(void) void AssertBufferLocksPermitCatalogRead(void) { + PrivateRefCountIterator *iter; PrivateRefCountEntry *res; - /* check the array */ - for (int i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) + iter = InitPrivateRefCountIterator(); + while ((res = GetNextPrivateRefCountEntry(iter)) != NULL) { - if (PrivateRefCountArrayKeys[i] != InvalidBuffer) - { - res = &PrivateRefCountArray[i]; - - if (res->buffer == InvalidBuffer) - continue; - - AssertNotCatalogBufferLock(res->buffer, res->data.lockmode); - } - } + Buffer buf = SharedBufferGetBuffer(res); - /* if necessary search the hash */ - if (PrivateRefCountOverflowed) - { - HASH_SEQ_STATUS hstat; + if (buf == InvalidBuffer) + continue; - hash_seq_init(&hstat, PrivateRefCountHash); - while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL) - { - AssertNotCatalogBufferLock(res->buffer, res->data.lockmode); - } + AssertNotCatalogBufferLock(buf, SharedBufferGetLockMode(res)); } + FreePrivateRefCountIterator(iter); } static void @@ -4315,8 +3781,10 @@ DebugPrintBufferRefcount(Buffer buffer) } else { + PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer); + buf = GetBufferDescriptor(buffer - 1); - loccount = GetPrivateRefCount(buffer); + loccount = ref ? SharedBufferRefCount(ref) : 0; backend = INVALID_PROC_NUMBER; } @@ -5102,7 +4570,6 @@ FlushRelationBuffers(Relation rel) error_context_stack = &errcallback; /* Make sure we can handle the pin */ - ReservePrivateRefCountEntry(); ResourceOwnerEnlarge(CurrentResourceOwner); /* @@ -5138,7 +4605,6 @@ FlushRelationBuffers(Relation rel) continue; /* Make sure we can handle the pin */ - ReservePrivateRefCountEntry(); ResourceOwnerEnlarge(CurrentResourceOwner); buf_state = LockBufHdr(bufHdr); @@ -5233,7 +4699,6 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels) continue; /* Make sure we can handle the pin */ - ReservePrivateRefCountEntry(); ResourceOwnerEnlarge(CurrentResourceOwner); buf_state = LockBufHdr(bufHdr); @@ -5459,7 +4924,6 @@ FlushDatabaseBuffers(Oid dbid) continue; /* Make sure we can handle the pin */ - ReservePrivateRefCountEntry(); ResourceOwnerEnlarge(CurrentResourceOwner); buf_state = LockBufHdr(bufHdr); @@ -5534,17 +4998,18 @@ UnlockReleaseBuffer(Buffer buffer) void IncrBufferRefCount(Buffer buffer) { - Assert(BufferIsPinned(buffer)); ResourceOwnerEnlarge(CurrentResourceOwner); if (BufferIsLocal(buffer)) + { + Assert(LocalRefCount[-buffer - 1] > 0); LocalRefCount[-buffer - 1]++; + } else { - PrivateRefCountEntry *ref; + PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer); - ref = GetPrivateRefCountEntry(buffer, true); Assert(ref != NULL); - ref->data.refcount++; + SharedBufferRefExisting(ref); } ResourceOwnerRememberBuffer(CurrentResourceOwner, buffer); } @@ -5580,7 +5045,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std) bufHdr = GetBufferDescriptor(buffer - 1); - Assert(GetPrivateRefCount(buffer) > 0); + Assert(GetSharedBufferEntry(buffer) != NULL); /* here, either share or exclusive lock is OK */ Assert(BufferIsLockedByMe(buffer)); @@ -5763,12 +5228,12 @@ BufferLockAcquire(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode) * Get reference to the refcount entry before we hold the lock, it seems * better to do before holding the lock. */ - entry = GetPrivateRefCountEntry(buffer, true); + entry = GetSharedBufferEntry(buffer); /* * We better not already hold a lock on the buffer. */ - Assert(entry->data.lockmode == BUFFER_LOCK_UNLOCK); + Assert(SharedBufferGetLockMode(entry) == BUFFER_LOCK_UNLOCK); /* * Lock out cancel/die interrupts until we exit the code section protected @@ -5857,7 +5322,7 @@ BufferLockAcquire(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode) } /* Remember that we now hold this lock */ - entry->data.lockmode = mode; + SharedBufferSetLockMode(entry, mode); /* * Fix the process wait semaphore's count for any absorbed wakeups. @@ -5908,7 +5373,7 @@ BufferLockUnlock(Buffer buffer, BufferDesc *buf_hdr) static bool BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode) { - PrivateRefCountEntry *entry = GetPrivateRefCountEntry(buffer, true); + PrivateRefCountEntry *entry = GetSharedBufferEntry(buffer); bool mustwait; /* @@ -5916,7 +5381,7 @@ BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode) * already has locked, return false, independent of the existing and * desired lock level. */ - if (entry->data.lockmode != BUFFER_LOCK_UNLOCK) + if (SharedBufferGetLockMode(entry) != BUFFER_LOCK_UNLOCK) return false; /* @@ -5936,7 +5401,7 @@ BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode) } else { - entry->data.lockmode = mode; + SharedBufferSetLockMode(entry, mode); } return !mustwait; @@ -6146,11 +5611,11 @@ BufferLockDisownInternal(Buffer buffer, BufferDesc *buf_hdr) BufferLockMode mode; PrivateRefCountEntry *ref; - ref = GetPrivateRefCountEntry(buffer, false); + ref = GetSharedBufferEntry(buffer); if (ref == NULL) elog(ERROR, "lock %d is not held", buffer); - mode = ref->data.lockmode; - ref->data.lockmode = BUFFER_LOCK_UNLOCK; + mode = SharedBufferGetLockMode(ref); + SharedBufferSetLockMode(ref, BUFFER_LOCK_UNLOCK); return mode; } @@ -6384,12 +5849,12 @@ static bool BufferLockHeldByMeInMode(BufferDesc *buf_hdr, BufferLockMode mode) { PrivateRefCountEntry *entry = - GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf_hdr), false); + GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr)); if (!entry) return false; else - return entry->data.lockmode == mode; + return SharedBufferGetLockMode(entry) == mode; } /* @@ -6402,12 +5867,12 @@ static bool BufferLockHeldByMe(BufferDesc *buf_hdr) { PrivateRefCountEntry *entry = - GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf_hdr), false); + GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr)); if (!entry) return false; else - return entry->data.lockmode != BUFFER_LOCK_UNLOCK; + return SharedBufferGetLockMode(entry) != BUFFER_LOCK_UNLOCK; } /* @@ -6503,9 +5968,13 @@ CheckBufferIsPinnedOnce(Buffer buffer) } else { - if (GetPrivateRefCount(buffer) != 1) - elog(ERROR, "incorrect local pin count: %d", - GetPrivateRefCount(buffer)); + { + PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer); + int32 refcount = ref ? SharedBufferRefCount(ref) : 0; + + if (refcount != 1) + elog(ERROR, "incorrect local pin count: %d", refcount); + } } } @@ -6686,7 +6155,7 @@ HoldingBufferPinThatDelaysRecovery(void) if (bufid < 0) return false; - if (GetPrivateRefCount(bufid + 1) > 0) + if (GetSharedBufferEntry(bufid + 1) != NULL) return true; return false; @@ -6721,8 +6190,12 @@ ConditionalLockBufferForCleanup(Buffer buffer) } /* There should be exactly one local pin */ - refcount = GetPrivateRefCount(buffer); - Assert(refcount); + { + PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer); + + refcount = ref ? SharedBufferRefCount(ref) : 0; + Assert(refcount); + } if (refcount != 1) return false; @@ -6776,8 +6249,12 @@ IsBufferCleanupOK(Buffer buffer) } /* There should be exactly one local pin */ - if (GetPrivateRefCount(buffer) != 1) - return false; + { + PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer); + + if (!ref || SharedBufferRefCount(ref) != 1) + return false; + } bufHdr = GetBufferDescriptor(buffer - 1); @@ -7447,7 +6924,7 @@ ResOwnerReleaseBuffer(Datum res) { PrivateRefCountEntry *ref; - ref = GetPrivateRefCountEntry(buffer, false); + ref = GetSharedBufferEntry(buffer); /* not having a private refcount would imply resowner corruption */ Assert(ref != NULL); @@ -7456,7 +6933,7 @@ ResOwnerReleaseBuffer(Datum res) * If the buffer was locked at the time of the resowner release, * release the lock now. This should only happen after errors. */ - if (ref->data.lockmode != BUFFER_LOCK_UNLOCK) + if (SharedBufferGetLockMode(ref) != BUFFER_LOCK_UNLOCK) { BufferDesc *buf = GetBufferDescriptor(buffer - 1); @@ -7549,7 +7026,6 @@ EvictUnpinnedBuffer(Buffer buf, bool *buffer_flushed) /* Make sure we can pin the buffer. */ ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); desc = GetBufferDescriptor(buf - 1); LockBufHdr(desc); @@ -7590,7 +7066,6 @@ EvictAllUnpinnedBuffers(int32 *buffers_evicted, int32 *buffers_flushed, continue; ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); LockBufHdr(desc); @@ -7644,7 +7119,6 @@ EvictRelUnpinnedBuffers(Relation rel, int32 *buffers_evicted, /* Make sure we can pin the buffer. */ ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); buf_state = LockBufHdr(desc); @@ -7736,7 +7210,6 @@ MarkDirtyUnpinnedBuffer(Buffer buf, bool *buffer_already_dirty) /* Make sure we can pin the buffer. */ ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); desc = GetBufferDescriptor(buf - 1); LockBufHdr(desc); @@ -7789,7 +7262,6 @@ MarkDirtyRelUnpinnedBuffers(Relation rel, /* Make sure we can pin the buffer. */ ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); buf_state = LockBufHdr(desc); @@ -7841,7 +7313,6 @@ MarkDirtyAllUnpinnedBuffers(int32 *buffers_dirtied, continue; ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); LockBufHdr(desc); diff --git a/src/include/storage/buf_refcount.h b/src/include/storage/buf_refcount.h new file mode 100644 index 00000000000..842760ad2ee --- /dev/null +++ b/src/include/storage/buf_refcount.h @@ -0,0 +1,58 @@ +/*------------------------------------------------------------------------- + * + * buf_refcount.h + * Backend-private buffer refcount tracking + * + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/storage/buf_refcount.h + * + *------------------------------------------------------------------------- + */ +#ifndef BUF_REFCOUNT_H +#define BUF_REFCOUNT_H + +#include "storage/buf.h" +#include "storage/bufmgr.h" + +/* Opaque handle to a private refcount entry */ +typedef struct PrivateRefCountEntry PrivateRefCountEntry; + +/* Initialization */ +extern void InitPrivateRefCount(void); + +/* Pure lookup */ +extern PrivateRefCountEntry *GetSharedBufferEntry(Buffer buffer); + +/* Reference counting - complex operations */ +extern PrivateRefCountEntry *SharedBufferRef(Buffer buffer); +extern void SharedBufferRefExisting(PrivateRefCountEntry *ref); +extern bool SharedBufferUnref(PrivateRefCountEntry *ref); + +/* Accessors */ +extern int32 SharedBufferRefCount(PrivateRefCountEntry *ref); +extern BufferLockMode SharedBufferGetLockMode(PrivateRefCountEntry *ref); +extern void SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode); +extern Buffer SharedBufferGetBuffer(PrivateRefCountEntry *ref); + +/* Pin limiting */ +extern uint32 GetPinLimit(void); +extern uint32 GetAdditionalPinLimit(void); +extern void LimitAdditionalPins(uint32 *additional_pins); + +/* Leak checking */ +extern void CheckPrivateRefCountLeaks(void); + +/* + * Iterator for walking all private refcount entries. + * Used by assertion checking code in bufmgr.c. + */ +typedef struct PrivateRefCountIterator PrivateRefCountIterator; + +extern PrivateRefCountIterator *InitPrivateRefCountIterator(void); +extern PrivateRefCountEntry *GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter); +extern void FreePrivateRefCountIterator(PrivateRefCountIterator *iter); + + +#endif /* BUF_REFCOUNT_H */ -- 2.53.0 [application/octet-stream] v1-0001-Benchmark-buffer-pinning.patch (26.6K, 6-v1-0001-Benchmark-buffer-pinning.patch) download | inline diff: From 272fa376cf33c5e3d8fefc3b39678d01925cba1f Mon Sep 17 00:00:00 2001 From: Alexandre Felipe <[email protected]> Date: Wed, 4 Mar 2026 15:13:53 +0000 Subject: [PATCH] Benchmark buffer pinning Introduces a benchmark facility that will be used to establish a baseline and evaluate each of the subsequent patches. It includes a test module, and a convenience python script to run and plot the results. --- .gitignore | 3 + src/test/modules/test_buffer_pin/Makefile | 18 + src/test/modules/test_buffer_pin/benchmark.py | 190 ++++++++ .../modules/test_buffer_pin/requirements.txt | 9 + .../test_buffer_pin/test_buffer_pin--1.0.sql | 88 ++++ .../modules/test_buffer_pin/test_buffer_pin.c | 421 ++++++++++++++++++ .../test_buffer_pin/test_buffer_pin.control | 4 + 7 files changed, 733 insertions(+) create mode 100644 src/test/modules/test_buffer_pin/Makefile create mode 100755 src/test/modules/test_buffer_pin/benchmark.py create mode 100644 src/test/modules/test_buffer_pin/requirements.txt create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin.c create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin.control diff --git a/.gitignore b/.gitignore index 4e911395fe3..bc7a0314380 100644 --- a/.gitignore +++ b/.gitignore @@ -43,3 +43,6 @@ lib*.pc /Release/ /tmp_install/ /portlock/ + +# hidden files +.* \ No newline at end of file diff --git a/src/test/modules/test_buffer_pin/Makefile b/src/test/modules/test_buffer_pin/Makefile new file mode 100644 index 00000000000..2707f84c5b0 --- /dev/null +++ b/src/test/modules/test_buffer_pin/Makefile @@ -0,0 +1,18 @@ +MODULE_big = test_buffer_pin +OBJS = test_buffer_pin.o + +EXTENSION = test_buffer_pin +DATA = test_buffer_pin--1.0.sql + +PGFILEDESC = "test_buffer_pin - buffer pinning benchmarks" + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = src/test/modules/test_buffer_pin +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/src/test/modules/test_buffer_pin/benchmark.py b/src/test/modules/test_buffer_pin/benchmark.py new file mode 100755 index 00000000000..0cf795a50c1 --- /dev/null +++ b/src/test/modules/test_buffer_pin/benchmark.py @@ -0,0 +1,190 @@ +#!/usr/bin/env python3 +""" +Buffer pinning benchmark - runs SQL benchmarks and saves results. + +Usage: + python3 benchmark.py [options] + +Examples: + python3 benchmark.py --port 5434 --name pg18-baseline --title "PostgreSQL 18 Baseline" + python3 benchmark.py --port 5434 --name patch-0005 --title "With Patch 0005" +""" + +import argparse +import numpy as np +import pandas as pd +from sqlalchemy import create_engine, text +import matplotlib +matplotlib.use('Agg') +import matplotlib.pyplot as plt +from pathlib import Path + + +def geomspace_int(start, stop, num): + """Generate geometrically spaced integers, pushing duplicates forward.""" + raw = np.geomspace(start, stop, num) + result = [] + for v in raw: + candidate = max(int(round(v)), result[-1] + 1 if result else 1) + result.append(candidate) + return result + + +def ensure_test_table(conn, table_name, min_blocks): + """Create a table with enough blocks for the benchmark.""" + rows_needed = min_blocks * 226 + 10000 + print(f"Creating table {table_name} with ~{min_blocks} blocks...") + conn.execute(text(f"DROP TABLE IF EXISTS {table_name}")) + conn.execute(text(f"CREATE UNLOGGED TABLE {table_name} AS SELECT generate_series(1, {rows_needed}) AS id")) + result = conn.execute(text(f"SELECT pg_relation_size('{table_name}') / 8192 as blocks")).fetchone() + assert result[0] >= min_blocks, f"Table {table_name} has {result[0]} blocks, expected at least {min_blocks}" + +def run_benchmark(conn, func_template, buffer_counts): + """Run a benchmark function across all buffer counts and patterns in one query.""" + array_str = str(buffer_counts) + sql = text(f""" + SELECT + CASE WHEN r.random THEN 'random' ELSE 'sequential' END as pattern, + b.num_buffers, + percentile_cont(0.5) WITHIN GROUP (ORDER BY bench.per_op_ns) as median_ns + FROM + unnest(ARRAY[false, true]) AS r(random), + unnest(ARRAY[{array_str}]) AS b(num_buffers), + LATERAL {func_template} AS bench + WHERE bench.iteration > 0 + GROUP BY r.random, b.num_buffers + ORDER BY r.random, b.num_buffers + """) + return [dict(row._mapping) for row in conn.execute(sql)] + + +def save_results(results_dir, name, results_dict, title): + """Save benchmark results to CSV and generate SVG plot.""" + results_dir = Path(results_dir) + results_dir.mkdir(parents=True, exist_ok=True) + + all_data = [] + for bench_name, data in results_dict.items(): + df = pd.DataFrame(data) + df['benchmark'] = bench_name + all_data.append(df) + + combined_df = pd.concat(all_data, ignore_index=True) + + csv_path = results_dir / f"{name}.csv" + combined_df.to_csv(csv_path, index=False) + print(f"Saved CSV: {csv_path}") + return combined_df + + +def plot_results(results_dir, name, results_dict, title): + """Generate SVG plot from benchmark results.""" + results_dir = Path(results_dir) + + fig, ax = plt.subplots(figsize=(10, 6)) + + benchmarks = [ + ('read', 'blue', 'o', 'ReadBuffer/ReleaseBuffer'), + ('pinning', 'green', 's', 'IncrBufferRefCount/ReleaseBuffer'), + ('locking', 'purple', 'd', 'LockBuffer/UnlockBuffer'), + ('resowner', 'red', '^', 'ResourceOwner Remember/Forget'), + ] + + for bench_name, color, marker, label in benchmarks: + if bench_name not in results_dict: + continue + df = pd.DataFrame(results_dict[bench_name]) + + seq_data = df[df['pattern'] == 'sequential'] + ax.plot(seq_data['num_buffers'], seq_data['median_ns'], + color=color, linestyle='-', marker=marker, + linewidth=2, markersize=6, label=f'{label} (seq)') + + rand_data = df[df['pattern'] == 'random'] + ax.plot(rand_data['num_buffers'], rand_data['median_ns'], + color=color, linestyle='--', marker=marker, + linewidth=2, markersize=6, label=f'{label} (rand)') + + ax.set_xlabel('Number of Buffers (sliding window)') + ax.set_ylabel('Time per operation (ns)') + ax.set_title(title) + ax.set_ylim(0, None) + ax.grid(True, alpha=0.3) + ax.set_xscale('log', base=2) + ax.legend(fontsize=8, loc='upper left') + + plt.tight_layout() + svg_path = results_dir / f"{name}.svg" + plt.savefig(svg_path, format='svg') + plt.close() + print(f"Saved SVG: {svg_path}") + + +def main(): + parser = argparse.ArgumentParser(description='Buffer pinning benchmark') + parser.add_argument('--iterations', '-n', type=int, default=10000, + help='Number of operations per sample (default: 10000)') + parser.add_argument('--samples', type=int, default=50, + help='Number of samples per data point (default: 50)') + parser.add_argument('--port', type=int, default=5432, + help='PostgreSQL port (default: 5432)') + parser.add_argument('--points', type=int, default=50, + help='Number of data points (default: 200)') + parser.add_argument('--max-dist', type=int, default=500, + help='Maximum buffer count to test (default: 512)') + parser.add_argument('--name', default='benchmark', + help='Name for output files (default: benchmark)') + parser.add_argument('--title', default='PostgreSQL Buffer Pinning Performance', + help='Title for the plot') + parser.add_argument('--results-dir', default=None, + help='Directory for results (default: ./results)') + parser.add_argument('--table', default='bench_large', + help='Table name to use for benchmarks (default: bench_large)') + args = parser.parse_args() + + if args.results_dir is None: + args.results_dir = Path(__file__).parent / 'results' + + engine = create_engine(f'postgresql://localhost:{args.port}/postgres') + buffer_counts = geomspace_int(1, args.max_dist, args.points) + + min_blocks_needed = args.max_dist + 100 + + with engine.connect().execution_options(isolation_level="AUTOCOMMIT") as conn: + conn.execute(text("CREATE EXTENSION IF NOT EXISTS test_buffer_pin")) + ensure_test_table(conn, args.table, min_blocks_needed) + + + n, s = args.iterations, args.samples + print(f"\nRunning benchmarks: {n} iterations, {s} samples, max {args.max_dist} buffers") + + results = {} + + print(" Running: bench_pinning...") + results['pinning'] = run_benchmark(conn, + f"bench_pinning('{args.table}', b.num_buffers, {n}, {s}, r.random)", + buffer_counts) + + print(" Running: bench_locking...") + results['locking'] = run_benchmark(conn, + f"bench_locking('{args.table}', b.num_buffers, {n}, {s}, r.random)", + buffer_counts) + + print(" Running: bench_prefetch_pipeline...") + results['read'] = run_benchmark(conn, + f"bench_prefetch_pipeline('{args.table}', b.num_buffers, {n}, {s}, r.random)", + buffer_counts) + + print(" Running: bench_resowner...") + results['resowner'] = run_benchmark(conn, + f"bench_resowner(b.num_buffers, {n}, {s}, r.random)", + buffer_counts) + + save_results(args.results_dir, args.name, results, args.title) + plot_results(args.results_dir, args.name, results, args.title) + + print(f"\nDone! Results saved to {args.results_dir}/{args.name}.*") + + +if __name__ == '__main__': + main() diff --git a/src/test/modules/test_buffer_pin/requirements.txt b/src/test/modules/test_buffer_pin/requirements.txt new file mode 100644 index 00000000000..e73ab569275 --- /dev/null +++ b/src/test/modules/test_buffer_pin/requirements.txt @@ -0,0 +1,9 @@ +# Requirements for benchmark.py +# pip install -r requirements.txt +# not enforcing versions, as it might simply +# work with your installed versions +numpy # tested with 1.24+ +pandas # tested with 2.0+ +sqlalchemy # tested with 2.0+ +matplotlib # tested with 3.7+ +psycopg2-binary # tested with 2.9+ diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql b/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql new file mode 100644 index 00000000000..2fb5577fac2 --- /dev/null +++ b/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql @@ -0,0 +1,88 @@ +/* test_buffer_pin--1.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION test_buffer_pin" to load this file. \quit + +CREATE FUNCTION bench_prefetch_pipeline( + relname text, + prefetch_dist int, + num_ops int, + iterations int, + random_access bool +) +RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8) +AS 'MODULE_PATHNAME', 'bench_prefetch_pipeline' +LANGUAGE C STRICT; + +COMMENT ON FUNCTION bench_prefetch_pipeline IS +'Benchmark buffer pinning with sliding window prefetch simulation. +Arguments: + relname - name of the relation to use + prefetch_dist - number of buffers to keep pinned (sliding window size) + num_ops - number of pin/unpin operations to perform + iterations - number of times to repeat the benchmark + random_access - if true, access blocks randomly; if false, sequentially +'; + +CREATE FUNCTION bench_pinning( + relname text, + num_buffers int, + num_ops int, + iterations int, + random_access bool +) +RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8) +AS 'MODULE_PATHNAME', 'bench_pinning' +LANGUAGE C STRICT; + +COMMENT ON FUNCTION bench_pinning IS +'Benchmark pure pin/unpin operations without buffer lookup. +Uses IncrBufferRefCount/ReleaseBuffer on pre-pinned buffers to isolate +refcount tracking overhead from buffer table lookups and I/O. +Arguments: + relname - name of the relation to use for pinning buffers + num_buffers - number of buffers to keep as base pins + num_ops - number of pin/unpin pairs to perform + iterations - number of times to repeat the benchmark + random_access - if true, access buffers randomly; if false, sequentially'; + +CREATE FUNCTION bench_locking( + relname text, + num_buffers int, + num_ops int, + iterations int, + random_access bool +) +RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8) +AS 'MODULE_PATHNAME', 'bench_locking' +LANGUAGE C STRICT; + +COMMENT ON FUNCTION bench_locking IS +'Benchmark buffer lock/unlock operations. +Pre-pins all blocks then times LockBuffer/UnlockBuffer cycles over +different buffers to measure locking overhead separately from pinning. +Arguments: + relname - name of the relation to use + num_buffers - sliding window size for locks + num_ops - number of lock/unlock pairs to perform + iterations - number of times to repeat the benchmark + random_access - if true, access buffers randomly; if false, sequentially'; + +CREATE FUNCTION bench_resowner( + num_buffers int, + num_ops int, + iterations int, + random_access bool +) +RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8) +AS 'MODULE_PATHNAME', 'bench_resowner' +LANGUAGE C STRICT; + +COMMENT ON FUNCTION bench_resowner IS +'Benchmark ResourceOwner remember/forget operations only. +Uses fake resource values - no actual resources are tracked. +Arguments: + num_buffers - number of fake resources to track + num_ops - number of remember/forget pairs to perform + iterations - number of times to repeat the benchmark + random_access - if true, access resources randomly; if false, sequentially'; diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin.c b/src/test/modules/test_buffer_pin/test_buffer_pin.c new file mode 100644 index 00000000000..d5e59974dcb --- /dev/null +++ b/src/test/modules/test_buffer_pin/test_buffer_pin.c @@ -0,0 +1,421 @@ +/* + * test_buffer_pin.c - Buffer pinning benchmark + */ + +#include "postgres.h" + +#include "access/heapam.h" +#include "catalog/namespace.h" +#include "common/pg_prng.h" +#include "fmgr.h" +#include "funcapi.h" +#include "miscadmin.h" +#include "portability/instr_time.h" +#include "storage/buf_internals.h" +#include "storage/bufmgr.h" +#include "utils/builtins.h" +#include "utils/rel.h" +#include "utils/resowner.h" +#include "utils/varlena.h" + +PG_MODULE_MAGIC; + +PG_FUNCTION_INFO_V1(bench_prefetch_pipeline); +PG_FUNCTION_INFO_V1(bench_pinning); +PG_FUNCTION_INFO_V1(bench_locking); +PG_FUNCTION_INFO_V1(bench_resowner); + +/* Custom ResourceOwnerDesc for benchmark - does nothing on release */ +static void bench_release_resource(Datum res) { /* no-op */ } + +static const ResourceOwnerDesc bench_resowner_desc = { + .name = "BenchmarkResource", + .release_phase = RESOURCE_RELEASE_AFTER_LOCKS, + .release_priority = RELEASE_PRIO_FIRST, + .ReleaseResource = bench_release_resource, + .DebugPrint = NULL, +}; + +/* + * Generate an access sequence for benchmarking. + * If random_access is true, uses Fisher-Yates shuffle. + * Otherwise, generates sequential pattern modulo num_items. + */ +static void +generate_access_sequence(int *sequence, int num_operations, int num_items, bool random_access) +{ + for (int i = 0; i < num_operations; i++) + sequence[i] = i % num_items; + + if (random_access) + { + for (int i = num_operations - 1; i > 0; i--) + { + int j = pg_prng_uint64_range(&pg_global_prng_state, 0, i); + int tmp = sequence[i]; + sequence[i] = sequence[j]; + sequence[j] = tmp; + } + } +} + + +/* + * bench_pinning - benchmark pure pin/unpin operations + * + * Warms up the cache by reading all blocks in the relation. + * Precompute a block sequence in which the buffers will be read. + * Then scan times repeatedly a scan following the block sequence. + * with a fixed distance of `num_buffers` (plus ramp up and ramp down) +*/ +Datum +bench_prefetch_pipeline(PG_FUNCTION_ARGS) +{ + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + text *relname_text = PG_GETARG_TEXT_PP(0); + int num_buffers = PG_GETARG_INT32(1); + int num_operations = PG_GETARG_INT32(2); + int iterations = PG_GETARG_INT32(3); + bool random_access = PG_GETARG_BOOL(4); + + Oid relid; + Relation rel; + BlockNumber nblocks; + Buffer *pipeline; + int *block_sequence; + int iter; + + Tuplestorestate *tupstore; + TupleDesc tupdesc; + Datum values[3]; + bool nulls[3] = {false, false, false}; + + if (num_buffers < 1 || num_operations < num_buffers) + ereport(ERROR, (errmsg("invalid parameters"))); + + InitMaterializedSRF(fcinfo, 0); + tupstore = rsinfo->setResult; + tupdesc = rsinfo->setDesc; + + relid = RangeVarGetRelid(makeRangeVarFromNameList(textToQualifiedNameList(relname_text)), AccessShareLock, false); + rel = relation_open(relid, AccessShareLock); + nblocks = RelationGetNumberOfBlocks(rel); + + if (nblocks == 0) + ereport(ERROR, (errmsg("relation has no blocks"))); + + pipeline = palloc0(num_buffers * sizeof(Buffer)); + block_sequence = palloc(num_operations * sizeof(int)); + + /* Warm up */ + for (int i = 0; i < num_operations && i < (int) nblocks; i++) + { + Buffer buf = ReadBuffer(rel, i); + ReleaseBuffer(buf); + } + generate_access_sequence(block_sequence, num_operations, nblocks, random_access); + + for (iter = 0; iter < iterations; iter++) + { + instr_time start_time, end_time; + int64 elapsed_ns; + + INSTR_TIME_SET_CURRENT(start_time); + + for (int i = 0; i < num_operations + num_buffers; i++) + { + if (i >= num_buffers) + ReleaseBuffer(pipeline[(i - num_buffers) % num_buffers]); + if (i < num_operations) + pipeline[i % num_buffers] = ReadBuffer(rel, block_sequence[i]); + } + + INSTR_TIME_SET_CURRENT(end_time); + INSTR_TIME_SUBTRACT(end_time, start_time); + elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time); + + values[0] = Int32GetDatum(iter); + values[1] = Int64GetDatum(elapsed_ns); + values[2] = Float8GetDatum((double) elapsed_ns / num_operations); + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + pfree(pipeline); + pfree(block_sequence); + relation_close(rel, AccessShareLock); + + return (Datum) 0; +} + +/* + * bench_pinning - benchmark pure pin/unpin operations + * + * Pre-reads buffers into cache, then times IncrBufferRefCount/ReleaseBuffer + * cycles in a pipelined fashion. This is intended to separate the + * pin count dependent part of the ReadBuffer/ReleaseBuffer operations + */ +Datum +bench_pinning(PG_FUNCTION_ARGS) +{ + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + text *relname_text = PG_GETARG_TEXT_PP(0); + int num_buffers = PG_GETARG_INT32(1); + int num_operations = PG_GETARG_INT32(2); + int iterations = PG_GETARG_INT32(3); + bool random_access = PG_GETARG_BOOL(4); + + Oid relid; + Relation rel; + BlockNumber nblocks; + Buffer *base_buffers; + Buffer *pipeline; + int *access_sequence; + int iter; + + Tuplestorestate *tupstore; + TupleDesc tupdesc; + Datum values[3]; + bool nulls[3] = {false, false, false}; + + if (num_buffers < 1 || num_operations < num_buffers) + ereport(ERROR, (errmsg("invalid parameters"))); + + InitMaterializedSRF(fcinfo, 0); + tupstore = rsinfo->setResult; + tupdesc = rsinfo->setDesc; + + relid = RangeVarGetRelid(makeRangeVarFromNameList(textToQualifiedNameList(relname_text)), AccessShareLock, false); + rel = relation_open(relid, AccessShareLock); + nblocks = RelationGetNumberOfBlocks(rel); + + if ((BlockNumber) num_buffers > nblocks) + ereport(ERROR, (errmsg("not enough blocks in relation"))); + + base_buffers = palloc(num_buffers * sizeof(Buffer)); + pipeline = palloc(num_buffers * sizeof(Buffer)); + access_sequence = palloc(num_operations * sizeof(int)); + + generate_access_sequence(access_sequence, num_operations, num_buffers, random_access); + + /* Pin the buffers as base pins (keeps them in cache) */ + for (int i = 0; i < num_buffers; i++) + base_buffers[i] = ReadBuffer(rel, i); + + for (iter = 0; iter < iterations; iter++) + { + instr_time start_time, end_time; + int64 elapsed_ns; + + INSTR_TIME_SET_CURRENT(start_time); + + for (int i = 0; i < num_operations + num_buffers; i++) + { + if (i >= num_buffers) + ReleaseBuffer(pipeline[(i - num_buffers) % num_buffers]); + if (i < num_operations) + { + Buffer buf = base_buffers[access_sequence[i]]; + IncrBufferRefCount(buf); + pipeline[i % num_buffers] = buf; + } + } + + INSTR_TIME_SET_CURRENT(end_time); + INSTR_TIME_SUBTRACT(end_time, start_time); + elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time); + + values[0] = Int32GetDatum(iter); + values[1] = Int64GetDatum(elapsed_ns); + values[2] = Float8GetDatum((double) elapsed_ns / num_operations); + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + /* Release base pins */ + for (int i = 0; i < num_buffers; i++) + ReleaseBuffer(base_buffers[i]); + + pfree(access_sequence); + pfree(pipeline); + pfree(base_buffers); + relation_close(rel, AccessShareLock); + + return (Datum) 0; +} + +/* + * bench_locking - benchmark buffer lock/unlock operations + * + * Pre-pins all blocks to keep them in cache, then times LockBuffer/UnlockBuffer + * cycles in a pipelined fashion over different buffers. This measures locking + * overhead separately from pinning, while accessing many different buffers. + */ +Datum +bench_locking(PG_FUNCTION_ARGS) +{ + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + text *relname_text = PG_GETARG_TEXT_PP(0); + int num_buffers = PG_GETARG_INT32(1); + int num_operations = PG_GETARG_INT32(2); + int iterations = PG_GETARG_INT32(3); + bool random_access = PG_GETARG_BOOL(4); + + Oid relid; + Relation rel; + BlockNumber nblocks; + Buffer *base_buffers; + Buffer *pipeline; + int *block_sequence; + int iter; + + Tuplestorestate *tupstore; + TupleDesc tupdesc; + Datum values[3]; + bool nulls[3] = {false, false, false}; + + if (num_buffers < 1 || num_operations < num_buffers) + ereport(ERROR, (errmsg("invalid parameters"))); + + InitMaterializedSRF(fcinfo, 0); + tupstore = rsinfo->setResult; + tupdesc = rsinfo->setDesc; + + relid = RangeVarGetRelid(makeRangeVarFromNameList(textToQualifiedNameList(relname_text)), AccessShareLock, false); + rel = relation_open(relid, AccessShareLock); + nblocks = RelationGetNumberOfBlocks(rel); + + if (nblocks == 0) + ereport(ERROR, (errmsg("relation has no blocks"))); + + base_buffers = palloc(nblocks * sizeof(Buffer)); + pipeline = palloc(num_buffers * sizeof(Buffer)); + block_sequence = palloc(num_operations * sizeof(int)); + + /* Generate access pattern over all blocks */ + generate_access_sequence(block_sequence, num_operations, nblocks, random_access); + + /* Pin ALL blocks as base pins to keep them in cache */ + for (BlockNumber i = 0; i < nblocks; i++) + base_buffers[i] = ReadBuffer(rel, i); + + for (iter = 0; iter < iterations; iter++) + { + instr_time start_time, end_time; + int64 elapsed_ns; + + INSTR_TIME_SET_CURRENT(start_time); + + for (int i = 0; i < num_operations + num_buffers; i++) + { + if (i >= num_buffers) + LockBuffer(pipeline[(i - num_buffers) % num_buffers], BUFFER_LOCK_UNLOCK); + if (i < num_operations) + { + Buffer buf = base_buffers[block_sequence[i]]; + + LockBuffer(buf, BUFFER_LOCK_SHARE); + pipeline[i % num_buffers] = buf; + } + } + + INSTR_TIME_SET_CURRENT(end_time); + INSTR_TIME_SUBTRACT(end_time, start_time); + elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time); + + values[0] = Int32GetDatum(iter); + values[1] = Int64GetDatum(elapsed_ns); + values[2] = Float8GetDatum((double) elapsed_ns / num_operations); + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + /* Release all base pins */ + for (BlockNumber i = 0; i < nblocks; i++) + ReleaseBuffer(base_buffers[i]); + + pfree(block_sequence); + pfree(pipeline); + pfree(base_buffers); + relation_close(rel, AccessShareLock); + + return (Datum) 0; +} + +/* + * bench_resowner - benchmark ResourceOwner remember/forget operations + * + * Uses fake resource values to test pure ResourceOwner tracking overhead + * without any actual resource operations. Uses same pipelined structure. + */ +Datum +bench_resowner(PG_FUNCTION_ARGS) +{ + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + int num_buffers = PG_GETARG_INT32(0); + int num_operations = PG_GETARG_INT32(1); + int iterations = PG_GETARG_INT32(2); + bool random_access = PG_GETARG_BOOL(3); + + Datum *resources; + Datum *pipeline; + int *access_sequence; + int iter; + + Tuplestorestate *tupstore; + TupleDesc tupdesc; + Datum values[3]; + bool nulls[3] = {false, false, false}; + + if (num_buffers < 1 || num_operations < num_buffers) + ereport(ERROR, (errmsg("invalid parameters"))); + + InitMaterializedSRF(fcinfo, 0); + tupstore = rsinfo->setResult; + tupdesc = rsinfo->setDesc; + + resources = palloc(num_buffers * sizeof(Datum)); + pipeline = palloc(num_buffers * sizeof(Datum)); + access_sequence = palloc(num_operations * sizeof(int)); + + /* Use fake resource values (pointers to array elements for uniqueness) */ + for (int i = 0; i < num_buffers; i++) + resources[i] = PointerGetDatum(&resources[i]); + + generate_access_sequence(access_sequence, num_operations, num_buffers, random_access); + + for (iter = 0; iter < iterations; iter++) + { + instr_time start_time, end_time; + int64 elapsed_ns; + + INSTR_TIME_SET_CURRENT(start_time); + + for (int i = 0; i < num_operations + num_buffers; i++) + { + if (i >= num_buffers) + ResourceOwnerForget(CurrentResourceOwner, + pipeline[(i - num_buffers) % num_buffers], + &bench_resowner_desc); + if (i < num_operations) + { + Datum res = resources[access_sequence[i]]; + ResourceOwnerEnlarge(CurrentResourceOwner); + ResourceOwnerRemember(CurrentResourceOwner, res, &bench_resowner_desc); + pipeline[i % num_buffers] = res; + } + } + + INSTR_TIME_SET_CURRENT(end_time); + INSTR_TIME_SUBTRACT(end_time, start_time); + elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time); + + values[0] = Int32GetDatum(iter); + values[1] = Int64GetDatum(elapsed_ns); + values[2] = Float8GetDatum((double) elapsed_ns / num_operations); + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + pfree(access_sequence); + pfree(pipeline); + pfree(resources); + + return (Datum) 0; +} diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin.control b/src/test/modules/test_buffer_pin/test_buffer_pin.control new file mode 100644 index 00000000000..f0659417127 --- /dev/null +++ b/src/test/modules/test_buffer_pin/test_buffer_pin.control @@ -0,0 +1,4 @@ +comment = 'test_buffer_pin - buffer pinning benchmarks' +default_version = '1.0' +module_pathname = '$libdir/test_buffer_pin' +relocatable = true -- 2.53.0 [application/octet-stream] v1-0005-REFCOUNT_ARRAY_ENTRIES-0.patch (6.5K, 7-v1-0005-REFCOUNT_ARRAY_ENTRIES-0.patch) download | inline diff: From 85abc829474a4b0f40461af169dd0803e2e22c88 Mon Sep 17 00:00:00 2001 From: Alexandre Felipe <[email protected]> Date: Fri, 6 Mar 2026 17:46:38 +0000 Subject: [PATCH 5/5] REFCOUNT_ARRAY_ENTRIES=0 The simple hash performance is is fairly close to the direct array. For few pins we are trading one small for loop for an index calculation (buffer % N), this could be a (buffer & (N-1)) if we restrict the simple array to use powers of 2 sizes. For more than REFCOUNT_ARRAY_ENTRIES on distinct buffers pinned/unpinned in a FIFO fashion, this is a strict improvement as every pin requires a hash table operation. --- src/backend/storage/buffer/buf_refcount.c | 60 +- src/include/storage/buf_refcount.h | 3 + 2 files changed diff --git a/src/backend/storage/buffer/buf_refcount.c b/src/backend/storage/buffer/buf_refcount.c index 29dfb720997..90cb42edbb5 100644 --- a/src/backend/storage/buffer/buf_refcount.c +++ b/src/backend/storage/buffer/buf_refcount.c @@ -88,17 +88,18 @@ struct PrivateRefCountIterator #define SH_DEFINE #include "lib/simplehash.h" -/* Private refcount array and keys */ -#define REFCOUNT_ARRAY_ENTRIES 8 + +/* + * Private refcount array and keys + * If set to 0, all the code handling the transfers between the array + * and the hash table is disabled at compilation time. +*/ +#define REFCOUNT_ARRAY_ENTRIES 0 + +#if REFCOUNT_ARRAY_ENTRIES > 0 static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES]; static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES]; -/* Overflow hash table for when array is full */ -static refcount_hash *PrivateRefCountHash = NULL; - -/* Count of entries that have overflowed into the hash table */ -static int32 PrivateRefCountOverflowed = 0; - /* Clock hand for selecting victim when array is full */ static uint32 PrivateRefCountClock = 0; @@ -107,14 +108,23 @@ static int ReservedRefCountSlot = -1; /* Cache for last accessed entry */ static int PrivateRefCountEntryLast = -1; +#endif /* Advisory limit on the number of pins each backend should hold */ static uint32 MaxProportionalPins = 0; +/* Hash table (overflow when array used, primary when hash-only) */ +static refcount_hash *PrivateRefCountHash = NULL; + +/* Count of entries in the hash table */ +static int32 PrivateRefCountOverflowed = 0; + +#if REFCOUNT_ARRAY_ENTRIES > 0 /* Forward declarations */ static void ReservePrivateRefCountEntry(void); static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer); static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer); +#endif /* * Initialize private refcount tracking for this backend. @@ -130,12 +140,15 @@ InitPrivateRefCount(void) * GetAdditionalPinLimit() can be used to check the remaining balance. */ MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS); +#if REFCOUNT_ARRAY_ENTRIES > 0 memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray)); memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys)); +#endif PrivateRefCountHash = refcount_create(CurrentMemoryContext, 64, NULL); } +#if REFCOUNT_ARRAY_ENTRIES > 0 /* * Ensure that the PrivateRefCountArray has sufficient space to store one more * entry. @@ -233,7 +246,9 @@ NewPrivateRefCountEntry(Buffer buffer) return res; } +#endif /* REFCOUNT_ARRAY_ENTRIES > 0 */ +#if REFCOUNT_ARRAY_ENTRIES > 0 /* * Slow-path for GetSharedBufferEntry(). */ @@ -270,6 +285,7 @@ GetPrivateRefCountEntrySlow(Buffer buffer) res = refcount_lookup(PrivateRefCountHash, buffer); return res; } +#endif /* REFCOUNT_ARRAY_ENTRIES > 0 */ /* * Return the PrivateRefCount entry for the passed buffer. @@ -281,6 +297,7 @@ GetSharedBufferEntry(Buffer buffer) Assert(BufferIsValid(buffer)); Assert(!BufferIsLocal(buffer)); +#if REFCOUNT_ARRAY_ENTRIES > 0 /* Fast path: check one-entry cache */ if (likely(PrivateRefCountEntryLast != -1) && likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer)) @@ -289,6 +306,10 @@ GetSharedBufferEntry(Buffer buffer) } return GetPrivateRefCountEntrySlow(buffer); +#else + /* Hash-only mode: direct lookup */ + return refcount_lookup(PrivateRefCountHash, buffer); +#endif } /* @@ -308,10 +329,20 @@ SharedBufferRef(Buffer buffer) if (ref == NULL) { - /* New pin - create entry */ +#if REFCOUNT_ARRAY_ENTRIES > 0 + /* New pin - create entry in array */ ReservePrivateRefCountEntry(); ref = NewPrivateRefCountEntry(buffer); ref->data = ONE_PRIVATE_REFERENCE; +#else + /* Hash-only mode: insert directly */ + bool found; + + ref = refcount_insert(PrivateRefCountHash, buffer, &found); + Assert(!found); + ref->data = ONE_PRIVATE_REFERENCE; + PrivateRefCountOverflowed++; +#endif } else { @@ -352,6 +383,7 @@ SharedBufferUnref(PrivateRefCountEntry *ref) /* No more references - clean up the entry */ Assert(SharedBufferGetLockMode(ref) == BUFFER_LOCK_UNLOCK); +#if REFCOUNT_ARRAY_ENTRIES > 0 if (ref >= &PrivateRefCountArray[0] && ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES]) { @@ -360,6 +392,7 @@ SharedBufferUnref(PrivateRefCountEntry *ref) ReservedRefCountSlot = ref - PrivateRefCountArray; } else +#endif { /* could make slightly more efficient by using the pointer */ refcount_delete(PrivateRefCountHash, ref->buffer); @@ -409,11 +442,11 @@ CheckPrivateRefCountLeaks(void) #ifdef USE_ASSERT_CHECKING int RefCountErrors = 0; PrivateRefCountEntry *res; - int i; char *s; +#if REFCOUNT_ARRAY_ENTRIES > 0 /* check the array */ - for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) + for (int i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) { if (PrivateRefCountArrayKeys[i] != InvalidBuffer) { @@ -426,8 +459,9 @@ CheckPrivateRefCountLeaks(void) RefCountErrors++; } } +#endif - /* if necessary search the hash */ + /* search the hash */ if (PrivateRefCountOverflowed) { refcount_iterator iter; @@ -467,6 +501,7 @@ InitPrivateRefCountIterator(void) PrivateRefCountEntry * GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter) { +#if REFCOUNT_ARRAY_ENTRIES > 0 /* First iterate through the array */ while (!iter->in_hash && iter->array_index < REFCOUNT_ARRAY_ENTRIES) { @@ -475,8 +510,9 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter) if (PrivateRefCountArrayKeys[idx] != InvalidBuffer) return &PrivateRefCountArray[idx]; } +#endif - /* Then iterate through the hash if there are overflowed entries */ + /* Then iterate through the hash if there are entries */ if (!iter->in_hash) { iter->in_hash = true; -- 2.53.0 [text/x-sh] run-all.sh (2.7K, 8-run-all.sh) download | inline: #!/bin/bash set -e SCRIPT_DIR=$(cd "$(dirname "$0")" && pwd) PATCH_DIR="$SCRIPT_DIR/" REPO_ROOT=$(cd "$SCRIPT_DIR/../.." && pwd) BUILDS_DIR="$REPO_ROOT/.builds" DBDATA_DIR="$REPO_ROOT/.dbdata" RESULTS_DIR="$REPO_ROOT/src/test/modules/test_buffer_pin/results" BENCHMARK_DIR="$REPO_ROOT/src/test/modules/test_buffer_pin" mkdir -p $BENCHMARK_DIR/results/ PORT=5434 cd "$REPO_ROOT" #AM_OPTS="--whitespace=nowarn --3way" AM_OPTS="" function clean_repo() { git reset --hard # git clean -fd } function build_and_test() { local NAME="$1" local TITLE="$2" local PREFIX="$BUILDS_DIR/$NAME" echo "" echo "============================================================" echo "Building: $NAME" echo "============================================================" # Configure and build make distclean >/dev/null 2>&1 || true ./configure --prefix="$PREFIX" --without-icu --without-readline --without-zlib >/dev/null make -j8 -s make install -s # Build and install extension cd "$BENCHMARK_DIR" make clean -s 2>/dev/null || true make PG_CONFIG="$PREFIX/bin/pg_config" USE_PGXS=1 install -s cd "$REPO_ROOT" echo "" echo "============================================================" echo "Testing: $NAME" echo "============================================================" # Stop any running server "$PREFIX/bin/pg_ctl" -D "$DBDATA_DIR" stop 2>/dev/null || true sleep 1 # Initialize fresh database rm -rf "$DBDATA_DIR" "$PREFIX/bin/initdb" -D "$DBDATA_DIR" >/dev/null # Start server "$PREFIX/bin/pg_ctl" -D "$DBDATA_DIR" -l "$REPO_ROOT/logfile_$NAME" -o "-p $PORT" start sleep 2 # Run benchmark mkdir -p "$RESULTS_DIR" cd "$BENCHMARK_DIR" python3 benchmark.py --port $PORT --name "$NAME" --title "$TITLE" \ --max-dist 10000 --points 200 --samples 50 --iterations 20000 cd "$REPO_ROOT" # Stop server "$PREFIX/bin/pg_ctl" -D "$DBDATA_DIR" stop echo "Results saved to: $RESULTS_DIR/$NAME.{svg,csv,txt}" } # # Master with patches applied incrementally git checkout -B pins origin/master # clean_repo # Then apply each numbered patch and benchmark after each for patch in "$PATCH_DIR"/*.patch; do [ -f "$patch" ] || continue # Extract patch number (e.g., 0001 -> 1) patchnum=$(basename "$patch" | cut -c1-4 | sed 's/^0*//') patchname=$(basename "$patch" .patch) git am $AM_OPTS "$patch" build_and_test "patch-$patchnum" "Patch $patchnum: $patchname" done echo "" echo "============================================================" echo "All benchmarks complete!" echo "Results in: $RESULTS_DIR/" echo "============================================================" [text/x-python-script] compare-patches.py (3.4K, 9-compare-patches.py) download | inline: #!/usr/bin/env python3 """Generate comparison charts for step*.csv benchmark results.""" import pandas as pd import matplotlib.pyplot as plt import glob import os RESULTS_DIR = os.path.dirname(os.path.abspath(__file__)) + '/results' # Find all step*.csv files csv_files = sorted(glob.glob(f'{RESULTS_DIR}/*.csv')) # Filter to main steps only main_steps = [f'patch-{i}' for i in [1, 2, 3, 4, 5]] csv_files = [f for f in csv_files if any(s in f for s in main_steps)] print(f"Found {len(csv_files)} step CSV files:") for f in csv_files: print(f" - {os.path.basename(f)}") # Load all data as pivoted tables all_data = [] for csv_file in csv_files: df = pd.read_csv(csv_file) step_name = os.path.basename(csv_file).replace('.csv', '') pivot = df.pivot_table(index=['pattern', 'num_buffers'], columns='benchmark', values='median_ns').reset_index() pivot['step'] = step_name # Compute derived metrics if 'read' in pivot.columns and 'resowner' in pivot.columns: pivot['read-resowner'] = pivot['read'] - pivot['resowner'] all_data.append(pivot) data = pd.concat(all_data, ignore_index=True) # Define nice labels step_labels = { 'v4-pg18-baseline': 'PostgreSQL 18 Baseline', 'patch-1': 'Patch 1: Pg19 Baseline', 'patch-2': 'Patch 2: Refactoring', 'patch-3': 'Patch 3: Simple hash', 'patch-4': 'Patch 4: Comopact entry', 'patch-5': 'Patch 5: No array', } # Color scheme colors = {k: c for k, c in zip( step_labels.keys(), ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00'] )} def plot_benchmark(benchmark_name, title, output_name): """Plot comparison for a specific benchmark.""" if benchmark_name not in data.columns: print(f"Skipping {benchmark_name}: column not found") return fig, axes = plt.subplots(1, 2, figsize=(10, 5)) fig.suptitle(title, fontsize=14, fontweight='bold') for idx, pattern in enumerate(['sequential', 'random']): ax = axes[idx] ax.set_title(f'{pattern.capitalize()} Access Pattern') ax.set_xlabel('Number of Buffers') ax.set_ylabel('Time (ns)') ax.set_xscale('log') ax.grid(True, alpha=0.3) for step in main_steps: subset = data[(data['pattern'] == pattern) & (data['step'] == step)] if len(subset) > 0 and benchmark_name in subset.columns: ax.plot(subset['num_buffers'], subset[benchmark_name], label=step_labels.get(step, step), color=colors.get(step, 'gray'), linewidth=2, marker='o', markersize=3) ax.legend(loc='upper left', fontsize=9) plt.tight_layout() output_path = f'{RESULTS_DIR}/{output_name}.svg' plt.savefig(output_path, format='svg', bbox_inches='tight') plt.savefig(output_path.replace('.svg', '.png'), format='png', dpi=150, bbox_inches='tight') print(f"Saved: {output_path}") plt.close() # Generate three comparison figures plot_benchmark('read-resowner', 'Read/Release excluding Resowner', 'compare-read-resowner') plot_benchmark('read', 'Read/Release Performance Comparison', 'compare-read') plot_benchmark('resowner', 'ResourceOwner Performance Comparison', 'compare-resowner') plot_benchmark('pinning', 'Pin/Unpin Performance Comparison', 'compare-pinning') plot_benchmark('locking', 'Lock/Unlock Performance Comparison', 'compare-locking') print("\nDone! Generated comparison charts.") ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Addressing buffer private reference count scalability issue @ 2026-03-08 17:09 Andres Freund <[email protected]> parent: Alexandre Felipe <[email protected]> 0 siblings, 1 reply; 6+ messages in thread From: Andres Freund @ 2026-03-08 17:09 UTC (permalink / raw) To: Alexandre Felipe <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]> Hi, On 2026-03-08 16:09:07 +0000, Alexandre Felipe wrote: > 2. > > Refactoring reference counting: Before starting to change code and > potentially breaking things I considered prudent to isolate it to limit the > damage. This code was part of a 8k+ LOC file. Not allowing, at least the fast paths, to be inlined will make the most common cases measurably slower, I've tried that. > 3. > > Using simplehash: Simply replacing the HTAB for a simplehash, and adding > a new set of macros SH_ENTRY_EMPTY, SH_MAKE_EMPTY, SH_MAKE_IN_USE. Yea, we'll need to that. Peter Geoghegan has a patch for it as well. > Here I assume that the buffer buffer sequence is independent enough from the > array size, so I use the buffer as the hash key directly, omitting a hash > function call. I doubt that that's good enough. The hash table code relies on the bits being well mixed, but you won't get that with buffer ids. > 4. > > Compact PrivateRefCountEntry: The original implementation used a 4-byte > key and 8-byte value. Reference count uses 32 bits, and it is unreasonable > to expect one backend to pin the same buffer 1 billion times. The lock mode > uses 32 bits but can only assume 4 values. So I packed them in one single > uint32, giving 30 bits for count and 2 bits for lock mode. This makes the > entries 8-byte long, on 64-bit CPUs this represents more than a 1/3 > reduction in memory. This makes the array aligned with the 64-bit words, > copying one entry can be completed in one instruction, and every entry will > be aligned. > 5. I'm rather sceptical that the overhead of needing to shift and mask is worth it. I suspect we'll also want to add more state for each entry (e.g. I think it may be worth getting rid of the io-in-progress resowner). > REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array > lookup, it is worth trying to remove it completely and keep only the hash. > For small values we would be trading a few branches for a buffer % SIZE, > for the use case of prefetch where pin/unpin in a FIFO fashion, it will > save an 8-entry array lookup, and some extra data moves. I doubt that that's ok, in the vast majority of access we will have 0-2 buffers pinned. And even when we have pinned more buffers, it's *exceedingly* common to access the same entry repeatedly (e.g. pin, lock, unlock, unpin), adding a few cycles to each of those repeated accesses will quickly show up. > From 56bfdd6d7296397a7b3f0b282e239fdc86d6580d Mon Sep 17 00:00:00 2001 > From: Alexandre Felipe <[email protected]> > Date: Fri, 6 Mar 2026 17:15:44 +0000 > Subject: [PATCH 4/5] Compact PrivateRefCountEntry > > The previous implementation used an 8-bite (64-bit) entry to store > a uint32 count and an uint32 lock mode. That is fine when we store > the data separate from the key (buffer). But in the simplehash > {key, value} are stored together, so each entry is 12-bytes. > This is somewhat awkward as we have to either pad the entry to 16-bytes, > or the access will be an alternating aligned/misaligned addreses. > > However, we are probably not going to need even 16-bits for the count > and 2-bits is enough for the lock mode. So we can easily pack these > int to a single uint32. I wouldn't want to rely on a 16bit pin counter anyway. > Incrementing/decrementing the count continue the same, just using > 4 instead of 1, lock mode access will require one or two additional > bitwise operations. > > No bit-shifts are required. I don't know how that last sentence can be true, given that: > -struct PrivateRefCountEntry > +#define PRIVATEREFCOUNT_LOCKMODE_MASK 0x3 > +#define ONE_PRIVATE_REFERENCE 4 /* 1 << 2 */ > + > +#define PrivateRefCountGetLockMode(d) ((BufferLockMode)((d) & PRIVATEREFCOUNT_LOCKMODE_MASK)) > +#define PrivateRefCountSetLockMode(d, m) ((d) = ((d) & ~PRIVATEREFCOUNT_LOCKMODE_MASK) | (m)) > +#define PrivateRefCountGetRefCount(d) ((int32)((d) >> 2)) > +#define PrivateRefCountIsZero(d) ((d) < ONE_PRIVATE_REFERENCE) Involves shifts at least in PrivateRefCountGetRefCount().. Greetings, Andres Freund ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Addressing buffer private reference count scalability issue @ 2026-03-11 18:03 Alexandre Felipe <[email protected]> parent: Andres Freund <[email protected]> 0 siblings, 1 reply; 6+ messages in thread From: Alexandre Felipe @ 2026-03-11 18:03 UTC (permalink / raw) To: Andres Freund <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]> Hi Hackers, Please find attached version 2 of this patch and the results of the ReadBuffer/ReleaseBuffer benchmark For those who prefer plain text here goes an excerpt baseline patch num_buffers pattern 1 random 194.76 183.50 sequential 183.75 168.02 2 random 204.74 187.02 sequential 195.27 169.58 3 random 222.62 192.57 sequential 212.81 169.55 7 random 233.96 219.64 sequential 221.94 171.94 11 random 365.50 247.96 sequential 318.24 183.96 19 random 390.58 267.96 sequential 369.51 195.35 100 random 424.03 286.33 sequential 420.74 280.23 988 random 445.75 301.37 sequential 453.61 313.30 10000 random 587.13 418.91 sequential 497.92 394.10 On Sun, Mar 8, 2026 at 5:09 PM Andres Freund <[email protected]> wrote: > Hi, > > On 2026-03-08 16:09:07 +0000, Alexandre Felipe wrote: > > 2. > > > > Refactoring reference counting: Before starting to change code and > > potentially breaking things I considered prudent to isolate it to > limit the > > damage. This code was part of a 8k+ LOC file. > > Not allowing, at least the fast paths, to be inlined will make the most > common > cases measurably slower, I've tried that. > I made it inline in this time. I placed in a separate file, I am including the .h at the top and the .c > > Here I assume that the buffer buffer sequence is independent enough from > the > > array size, so I use the buffer as the hash key directly, omitting a hash > > function call. > > I doubt that that's good enough. The hash table code relies on the bits > being > well mixed, but you won't get that with buffer ids. > Introduced murmurhash32, it is noticeably worse for sequential access (by buffer index), but has some mixing. > I'm rather sceptical that the overhead of needing to shift and mask is > worth > it. I suspect we'll also want to add more state for each entry (e.g. I > think > it may be worth getting rid of the io-in-progress resowner). > io-in-progress resowner name suggests that it runs only for reads... so maybe it could be taken out of the way of buffer hits? > > REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array > > lookup, it is worth trying to remove it completely and keep only the > hash. > > For small values we would be trading a few branches for a buffer % > SIZE, > > for the use case of prefetch where pin/unpin in a FIFO fashion, it > will > > save an 8-entry array lookup, and some extra data moves. > > I doubt that that's ok, in the vast majority of access we will have 0-2 > buffers pinned. And even when we have pinned more buffers, it's > *exceedingly* > common to access the same entry repeatedly (e.g. pin, lock, unlock, unpin), > adding a few cycles to each of those repeated accesses will quickly show > up. > I brought back the array, but I eliminated the linear search. 1. USE_REFCOUNT_CACHE_ENTRY will enable the last entry cache. 2a. the dynamic array case REFCOUNT_ARRAY_MAX_SIZE!=REFCOUNT_ARRAY_INITIAL_SIZE will grow the array when it reaches a certain level of occupation. I have set the default occupation level to 86% so that, if enabled, for a random input it will grow when we have about 2*size pins in total. If we find a sequential pattern then it will grow without growing the hash table. For the array lookup I don't use a hash, so for small number of pins it will be very fast. 2b. the static case REFCOUNT_ARRAY_MAX_SIZE==REFCOUNT_ARRAY_INITIAL_SIZE will use a static array, just as we had before and will not perform the linear search. It still has to read the size and do mask input. I tested the 4 variations and the winner is with the static array without the cache for the last entry. I increased the array size from 8 to 32, since you suggested before that that this could help. At that point it would have the tradeoff of a longer linear search, so it may help even more now. > Incrementing/decrementing the count continue the same, just using > > 4 instead of 1, lock mode access will require one or two additional > > bitwise operations. > > > > No bit-shifts are required. > > I don't know how that last sentence can be true, given that: > > > -struct PrivateRefCountEntry > > +#define PRIVATEREFCOUNT_LOCKMODE_MASK 0x3 > > +#define ONE_PRIVATE_REFERENCE 4 /* 1 << 2 > */ > > + > > +#define PrivateRefCountGetLockMode(d) ((BufferLockMode)((d) & > PRIVATEREFCOUNT_LOCKMODE_MASK)) > > +#define PrivateRefCountSetLockMode(d, m) ((d) = ((d) & > ~PRIVATEREFCOUNT_LOCKMODE_MASK) | (m)) > > +#define PrivateRefCountGetRefCount(d) ((int32)((d) >> 2)) > > +#define PrivateRefCountIsZero(d) ((d) < > ONE_PRIVATE_REFERENCE) > > Involves shifts at least in PrivateRefCountGetRefCount().. > Yes, there was some unintended code there. I am adding SharedBufferHasSingleRef to replace the (...) == 1 checks. The actual count is used only for debugging, so the shift there shouldn't be a problem. Regards, Alexandre Attachments: [image/png] compare-read.png (156.2K, 3-compare-read.png) download | view image [application/x-patch] v2-0004-Compact-PrivateRefCountEntry.patch (12.8K, 4-v2-0004-Compact-PrivateRefCountEntry.patch) download | inline diff: From 4b96a723bfc3a6fe78e05bc3ce454b0b160679d1 Mon Sep 17 00:00:00 2001 From: Alexandre Felipe <[email protected]> Date: Wed, 11 Mar 2026 12:08:03 +0000 Subject: [PATCH 4/5] Compact PrivateRefCountEntry The previous implementation used an 8-bytes (64-bit) entry to store a uint32 count and an uint32 lock mode. That is fine when we store the data separate from the key (buffer). But in the simplehash {key, value} are stored together, so each entry is 12-bytes. This is somewhat awkward as we have to either pad the entry to 16-bytes, or the access will be an alternating aligned/misaligned addreses. Lock can assume only 4 values, and 2^30 is a decent limit for the number of pins on a single buffer. So this change is packing the {count[31:2], lock[1:0]} into a single uint32. Incrementing/decrementing the count continue the same, just using 4 instead of 1, lock mode access will require one or two additional bitwise operations. The exact count requires one shift, and is used only for debugging. A special function is provided to check whether count == 1. No bit-shifts are required. --- src/backend/storage/buffer/bufmgr_refcnt.c | 186 +++++++++------------ src/backend/storage/buffer/bufmgr_refcnt.h | 2 +- 2 files changed, 76 insertions(+), 112 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr_refcnt.c b/src/backend/storage/buffer/bufmgr_refcnt.c index dd95d953f6..bf84e9f62e 100644 --- a/src/backend/storage/buffer/bufmgr_refcnt.c +++ b/src/backend/storage/buffer/bufmgr_refcnt.c @@ -47,47 +47,45 @@ #include "bufmgr_refcnt.h" #include "storage/proc.h" -/* Structure definitions - internal to this file */ -typedef struct PrivateRefCountData +/* + * Implementation details - opaque to callers. + * Packed refcount and lockmode in a single uint32: + * Bits 0-1: lockmode (4 values: UNLOCK, SHARE, SHARE_EXCLUSIVE, EXCLUSIVE) + * Bits 2-31: refcount (30 bits, max ~1 billion) + */ +struct PrivateRefCountEntry { - int32 refcount; - BufferLockMode lockmode; -} PrivateRefCountData; + Buffer buffer; + uint32 data; +}; -struct PrivateRefCountEntry +#define PRIVATEREFCOUNT_LOCKMODE_MASK 0x3 +#define ONE_PRIVATE_REFERENCE 4 /* 1 << 2 */ + +#define PrivateRefCountIsZero(d) ((d) < ONE_PRIVATE_REFERENCE) + +struct PrivateRefCountIterator { - Buffer buffer; /* hash key - InvalidBuffer means empty */ - PrivateRefCountData data; + int array_index; + bool in_hash; + void *hash_status; }; -/* - * Define simplehash parameters for the overflow hash table. - * We use buffer == InvalidBuffer as the "empty" marker to avoid needing - * a separate status field. - */ +/* Define simplehash for private refcount overflow hash table */ #define SH_PREFIX refcount #define SH_ELEMENT_TYPE PrivateRefCountEntry #define SH_KEY_TYPE Buffer #define SH_KEY buffer -#define SH_HASH_KEY(tb, key) murmurhash32(key) +#define SH_HASH_KEY(tb, key) ((uint32) murmurhash32(key)) #define SH_EQUAL(tb, a, b) ((a) == (b)) #define SH_SCOPE static inline -#define SH_DEFINE -#define SH_DECLARE -/* Use buffer field as empty marker - no separate status needed */ #define SH_ENTRY_EMPTY(entry) ((entry)->buffer == InvalidBuffer) #define SH_MAKE_EMPTY(entry) ((entry)->buffer = InvalidBuffer) -#define SH_MAKE_IN_USE(entry) ((void)0) /* key assignment handles this */ +#define SH_MAKE_IN_USE(entry) ((void) 0) +#define SH_DECLARE +#define SH_DEFINE #include "lib/simplehash.h" -struct PrivateRefCountIterator -{ - int array_index; - bool in_hash; - refcount_iterator hash_iter; - bool hash_iter_valid; -}; - /* Private refcount array and keys */ #define REFCOUNT_ARRAY_ENTRIES 8 static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES]; @@ -112,9 +110,9 @@ static int PrivateRefCountEntryLast = -1; static uint32 MaxProportionalPins = 0; /* Forward declarations */ -static void ReservePrivateRefCountEntry(void); +static void ReservePrivateRefCountEntry(Buffer buffer); static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer); -static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move); +static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer); /* * Initialize private refcount tracking for this backend. @@ -133,7 +131,7 @@ InitPrivateRefCount(void) memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray)); memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys)); - PrivateRefCountHash = refcount_create(CurrentMemoryContext, 16, NULL); + PrivateRefCountHash = refcount_create(CurrentMemoryContext, 64, NULL); } /* @@ -141,15 +139,15 @@ InitPrivateRefCount(void) * entry. */ static void -ReservePrivateRefCountEntry(void) +ReservePrivateRefCountEntry(Buffer buffer) { /* Already reserved (or freed), nothing to do */ if (ReservedRefCountSlot != -1) return; /* - * First search for a free entry the array, that'll be sufficient in the - * majority of cases. + * First search for a free entry in the array, that'll be sufficient in + * the majority of cases. */ { int i; @@ -159,11 +157,9 @@ ReservePrivateRefCountEntry(void) if (PrivateRefCountArrayKeys[i] == InvalidBuffer) { ReservedRefCountSlot = i; + return; } } - - if (ReservedRefCountSlot != -1) - return; } /* @@ -196,10 +192,7 @@ ReservePrivateRefCountEntry(void) /* clear the now free array slot */ PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer; victim_entry->buffer = InvalidBuffer; - - memset(&victim_entry->data, 0, sizeof(victim_entry->data)); - victim_entry->data.refcount = 0; - victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK; + victim_entry->data = 0; PrivateRefCountOverflowed++; } @@ -222,8 +215,7 @@ NewPrivateRefCountEntry(Buffer buffer) /* and fill it */ PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer; res->buffer = buffer; - res->data.refcount = 0; - res->data.lockmode = BUFFER_LOCK_UNLOCK; + res->data = 0; /* update cache for the next lookup */ PrivateRefCountEntryLast = ReservedRefCountSlot; @@ -237,10 +229,9 @@ NewPrivateRefCountEntry(Buffer buffer) * Slow-path for GetSharedBufferEntry(). */ static pg_noinline PrivateRefCountEntry * -GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move) +GetPrivateRefCountEntrySlow(Buffer buffer) { PrivateRefCountEntry *res; - int match = -1; int i; /* @@ -250,16 +241,11 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move) { if (PrivateRefCountArrayKeys[i] == buffer) { - match = i; + PrivateRefCountEntryLast = i; + return &PrivateRefCountArray[i]; } } - if (likely(match != -1)) - { - PrivateRefCountEntryLast = match; - return &PrivateRefCountArray[match]; - } - /* * Only look up the buffer in the hashtable if we've previously overflowed. */ @@ -267,41 +253,11 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move) return NULL; res = refcount_lookup(PrivateRefCountHash, buffer); - - if (res == NULL) - return NULL; - else if (!do_move) - { - return res; - } - else - { - /* move buffer from hashtable into the free array slot */ - PrivateRefCountEntry *free; - - ReservePrivateRefCountEntry(); - - Assert(ReservedRefCountSlot != -1); - free = &PrivateRefCountArray[ReservedRefCountSlot]; - Assert(free->buffer == InvalidBuffer); - - free->buffer = buffer; - free->data = res->data; - PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer; - PrivateRefCountEntryLast = ReservedRefCountSlot; - - ReservedRefCountSlot = -1; - - refcount_delete_item(PrivateRefCountHash, res); - Assert(PrivateRefCountOverflowed > 0); - PrivateRefCountOverflowed--; - - return free; - } + return res; } /* - * Return the PrivateRefCountEntry for the passed buffer. + * Return the PrivateRefCount entry for the passed buffer. * Returns NULL if the buffer is not currently pinned. */ static PrivateRefCountEntry * @@ -317,7 +273,7 @@ GetSharedBufferEntry(Buffer buffer) return &PrivateRefCountArray[PrivateRefCountEntryLast]; } - return GetPrivateRefCountEntrySlow(buffer, false); + return GetPrivateRefCountEntrySlow(buffer); } /* @@ -333,9 +289,9 @@ SharedBufferCreateRef(Buffer buffer) Assert(BufferIsValid(buffer)); Assert(!BufferIsLocal(buffer)); - ReservePrivateRefCountEntry(); + ReservePrivateRefCountEntry(buffer); ref = NewPrivateRefCountEntry(buffer); - ref->data.refcount++; + ref->data = ONE_PRIVATE_REFERENCE; return ref; } @@ -348,8 +304,8 @@ static void SharedBufferRefExisting(PrivateRefCountEntry *ref) { Assert(ref != NULL); - Assert(ref->data.refcount > 0); - ref->data.refcount++; + Assert(!PrivateRefCountIsZero(ref->data)); + ref->data += ONE_PRIVATE_REFERENCE; } /* @@ -361,14 +317,14 @@ static bool SharedBufferUnref(PrivateRefCountEntry *ref) { Assert(ref != NULL); - Assert(ref->data.refcount > 0); + Assert(!PrivateRefCountIsZero(ref->data)); - ref->data.refcount--; + ref->data -= ONE_PRIVATE_REFERENCE; - if (ref->data.refcount == 0) + if (PrivateRefCountIsZero(ref->data)) { /* No more references - clean up the entry */ - Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK); + Assert(SharedBufferGetLockMode(ref) == BUFFER_LOCK_UNLOCK); if (ref >= &PrivateRefCountArray[0] && ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES]) @@ -379,7 +335,8 @@ SharedBufferUnref(PrivateRefCountEntry *ref) } else { - refcount_delete_item(PrivateRefCountHash, ref); + /* could make slightly more efficient by using the pointer */ + refcount_delete(PrivateRefCountHash, ref->buffer); Assert(PrivateRefCountOverflowed > 0); PrivateRefCountOverflowed--; } @@ -390,25 +347,16 @@ SharedBufferUnref(PrivateRefCountEntry *ref) return false; } -/* - * Accessors for refcount entry fields. - */ -static int32 -SharedBufferRefCount(PrivateRefCountEntry *ref) -{ - return ref->data.refcount; -} - static BufferLockMode SharedBufferGetLockMode(PrivateRefCountEntry *ref) { - return ref->data.lockmode; + return (BufferLockMode)(ref->data & PRIVATEREFCOUNT_LOCKMODE_MASK); } static void SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode) { - ref->data.lockmode = mode; + ref->data = (ref->data & ~PRIVATEREFCOUNT_LOCKMODE_MASK) | mode; } /* @@ -417,7 +365,18 @@ SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode) static bool SharedBufferHasSingleRef(PrivateRefCountEntry *ref) { - return ref != NULL && ref->data.refcount == 1; + return ref != NULL && + ref->data >= ONE_PRIVATE_REFERENCE && + ref->data < (2 * ONE_PRIVATE_REFERENCE); +} + +/* + * Get the refcount for a buffer entry (for debug output only). + */ +static int32 +SharedBufferRefCount(PrivateRefCountEntry *ref) +{ + return (int32)(ref->data >> 2); } /* @@ -488,7 +447,7 @@ AssertBufferLocksPermitCatalogRead(void) if (buf == InvalidBuffer) continue; - AssertNotCatalogBufferLock(buf, res->data.lockmode); + AssertNotCatalogBufferLock(buf, SharedBufferGetLockMode(res)); } FreePrivateRefCountIterator(iter); } @@ -504,7 +463,7 @@ InitPrivateRefCountIterator(void) iter->array_index = 0; iter->in_hash = false; - iter->hash_iter_valid = false; + iter->hash_status = NULL; return iter; } @@ -530,20 +489,23 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter) iter->in_hash = true; if (PrivateRefCountOverflowed > 0) { - refcount_start_iterate(PrivateRefCountHash, &iter->hash_iter); - iter->hash_iter_valid = true; + refcount_iterator *hiter = palloc(sizeof(refcount_iterator)); + + refcount_start_iterate(PrivateRefCountHash, hiter); + iter->hash_status = hiter; } } - if (iter->hash_iter_valid) + if (iter->hash_status != NULL) { PrivateRefCountEntry *res; - res = refcount_iterate(PrivateRefCountHash, &iter->hash_iter); + res = refcount_iterate(PrivateRefCountHash, (refcount_iterator *) iter->hash_status); if (res != NULL) return res; - iter->hash_iter_valid = false; + pfree(iter->hash_status); + iter->hash_status = NULL; } return NULL; @@ -555,6 +517,8 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter) void FreePrivateRefCountIterator(PrivateRefCountIterator *iter) { + if (iter->hash_status != NULL) + pfree(iter->hash_status); pfree(iter); } diff --git a/src/backend/storage/buffer/bufmgr_refcnt.h b/src/backend/storage/buffer/bufmgr_refcnt.h index f2a6f45d84..8d31f6ee6f 100644 --- a/src/backend/storage/buffer/bufmgr_refcnt.h +++ b/src/backend/storage/buffer/bufmgr_refcnt.h @@ -39,8 +39,8 @@ static void SharedBufferRefExisting(PrivateRefCountEntry *ref); static bool SharedBufferUnref(PrivateRefCountEntry *ref); static BufferLockMode SharedBufferGetLockMode(PrivateRefCountEntry *ref); static void SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode); -static int32 SharedBufferRefCount(PrivateRefCountEntry *ref); static bool SharedBufferHasSingleRef(PrivateRefCountEntry *ref); +static int32 SharedBufferRefCount(PrivateRefCountEntry *ref); /* Pin limiting */ extern uint32 GetPinLimit(void); -- 2.34.1 [application/x-patch] v2-0002-Refactoring-reference-counting.patch (54.0K, 5-v2-0002-Refactoring-reference-counting.patch) download | inline diff: From 5de90fcd14e5d32c3165c3e7a278adaa44f4d9d1 Mon Sep 17 00:00:00 2001 From: Alexandre Felipe <[email protected]> Date: Wed, 11 Mar 2026 12:05:50 +0000 Subject: [PATCH 2/5] Refactoring reference counting This moves the private reference count logic out of backend/storage/buffer/buffmgr.c that is still 8000+ lines long. Preparing for the changes to come. --- .gitignore | 2 + src/backend/storage/buffer/Makefile | 3 + src/backend/storage/buffer/bufmgr.c | 704 +++------------------ src/backend/storage/buffer/bufmgr_refcnt.c | 615 ++++++++++++++++++ src/backend/storage/buffer/bufmgr_refcnt.h | 59 ++ 5 files changed, 751 insertions(+), 632 deletions(-) create mode 100644 src/backend/storage/buffer/bufmgr_refcnt.c create mode 100644 src/backend/storage/buffer/bufmgr_refcnt.h diff --git a/.gitignore b/.gitignore index 4e911395fe..fddb7f861d 100644 --- a/.gitignore +++ b/.gitignore @@ -43,3 +43,5 @@ lib*.pc /Release/ /tmp_install/ /portlock/ + +.* \ No newline at end of file diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile index fd7c40dcb0..6f45674cae 100644 --- a/src/backend/storage/buffer/Makefile +++ b/src/backend/storage/buffer/Makefile @@ -20,3 +20,6 @@ OBJS = \ localbuf.o include $(top_srcdir)/src/backend/common.mk + +# bufmgr.c includes bufmgr_refcnt.c for inlining +bufmgr.o: bufmgr_refcnt.c bufmgr_refcnt.h \ No newline at end of file diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 0546ee0193..82af282f70 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -54,6 +54,7 @@ #include "storage/aio.h" #include "storage/buf_internals.h" #include "storage/bufmgr.h" +#include "bufmgr_refcnt.h" #include "storage/fd.h" #include "storage/ipc.h" #include "storage/lmgr.h" @@ -70,7 +71,6 @@ #include "utils/timestamp.h" #include "utils/wait_event.h" - /* Note: these two macros only work on shared buffers, not local ones! */ #define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ)) #define BufferGetLSN(bufHdr) (PageGetLSN(BufHdrGetBlock(bufHdr))) @@ -93,43 +93,6 @@ */ #define BUF_DROP_FULL_SCAN_THRESHOLD (uint64) (NBuffers / 32) -/* - * This is separated out from PrivateRefCountEntry to allow for copying all - * the data members via struct assignment. - */ -typedef struct PrivateRefCountData -{ - /* - * How many times has the buffer been pinned by this backend. - */ - int32 refcount; - - /* - * Is the buffer locked by this backend? BUFFER_LOCK_UNLOCK indicates that - * the buffer is not locked. - */ - BufferLockMode lockmode; -} PrivateRefCountData; - -typedef struct PrivateRefCountEntry -{ - /* - * Note that this needs to be same as the entry's corresponding - * PrivateRefCountArrayKeys[i], if the entry is stored in the array. We - * store it in both places as this is used for the hashtable key and - * because it is more convenient (passing around a PrivateRefCountEntry - * suffices to identify the buffer) and faster (checking the keys array is - * faster when checking many entries, checking the entry is faster if just - * checking a single entry). - */ - Buffer buffer; - - PrivateRefCountData data; -} PrivateRefCountEntry; - -/* 64 bytes, about the size of a cache line on common systems */ -#define REFCOUNT_ARRAY_ENTRIES 8 - /* * Status of buffers to checkpoint for a particular tablespace, used * internally in BufferSync. @@ -213,55 +176,6 @@ int backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER; /* local state for LockBufferForCleanup */ static BufferDesc *PinCountWaitBuf = NULL; -/* - * Backend-Private refcount management: - * - * Each buffer also has a private refcount that keeps track of the number of - * times the buffer is pinned in the current process. This is so that the - * shared refcount needs to be modified only once if a buffer is pinned more - * than once by an individual backend. It's also used to check that no - * buffers are still pinned at the end of transactions and when exiting. We - * also use this mechanism to track whether this backend has a buffer locked, - * and, if so, in what mode. - * - * - * To avoid - as we used to - requiring an array with NBuffers entries to keep - * track of local buffers, we use a small sequentially searched array - * (PrivateRefCountArrayKeys, with the corresponding data stored in - * PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to - * keep track of backend local pins. - * - * Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all - * refcounts are kept track of in the array; after that, new array entries - * displace old ones into the hash table. That way a frequently used entry - * can't get "stuck" in the hashtable while infrequent ones clog the array. - * - * Note that in most scenarios the number of pinned buffers will not exceed - * REFCOUNT_ARRAY_ENTRIES. - * - * - * To enter a buffer into the refcount tracking mechanism first reserve a free - * entry using ReservePrivateRefCountEntry() and then later, if necessary, - * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing - * memory allocations in NewPrivateRefCountEntry() which can be important - * because in some scenarios it's called with a spinlock held... - */ -static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES]; -static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES]; -static HTAB *PrivateRefCountHash = NULL; -static int32 PrivateRefCountOverflowed = 0; -static uint32 PrivateRefCountClock = 0; -static int ReservedRefCountSlot = -1; -static int PrivateRefCountEntryLast = -1; - -static uint32 MaxProportionalPins; - -static void ReservePrivateRefCountEntry(void); -static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer); -static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move); -static inline int32 GetPrivateRefCount(Buffer buffer); -static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref); - /* ResourceOwner callbacks to hold in-progress I/Os and buffer pins */ static void ResOwnerReleaseBufferIO(Datum res); static char *ResOwnerPrintBufferIO(Datum res); @@ -286,301 +200,6 @@ const ResourceOwnerDesc buffer_resowner_desc = .DebugPrint = ResOwnerPrintBuffer }; -/* - * Ensure that the PrivateRefCountArray has sufficient space to store one more - * entry. This has to be called before using NewPrivateRefCountEntry() to fill - * a new entry - but it's perfectly fine to not use a reserved entry. - */ -static void -ReservePrivateRefCountEntry(void) -{ - /* Already reserved (or freed), nothing to do */ - if (ReservedRefCountSlot != -1) - return; - - /* - * First search for a free entry the array, that'll be sufficient in the - * majority of cases. - */ - { - int i; - - for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) - { - if (PrivateRefCountArrayKeys[i] == InvalidBuffer) - { - ReservedRefCountSlot = i; - - /* - * We could return immediately, but iterating till the end of - * the array allows compiler-autovectorization. - */ - } - } - - if (ReservedRefCountSlot != -1) - return; - } - - /* - * No luck. All array entries are full. Move one array entry into the hash - * table. - */ - { - /* - * Move entry from the current clock position in the array into the - * hashtable. Use that slot. - */ - int victim_slot; - PrivateRefCountEntry *victim_entry; - PrivateRefCountEntry *hashent; - bool found; - - /* select victim slot */ - victim_slot = PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES; - victim_entry = &PrivateRefCountArray[victim_slot]; - ReservedRefCountSlot = victim_slot; - - /* Better be used, otherwise we shouldn't get here. */ - Assert(PrivateRefCountArrayKeys[victim_slot] != InvalidBuffer); - Assert(PrivateRefCountArray[victim_slot].buffer != InvalidBuffer); - Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer); - - /* enter victim array entry into hashtable */ - hashent = hash_search(PrivateRefCountHash, - &PrivateRefCountArrayKeys[victim_slot], - HASH_ENTER, - &found); - Assert(!found); - /* move data from the entry in the array to the hash entry */ - hashent->data = victim_entry->data; - - /* clear the now free array slot */ - PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer; - victim_entry->buffer = InvalidBuffer; - - /* clear the whole data member, just for future proofing */ - memset(&victim_entry->data, 0, sizeof(victim_entry->data)); - victim_entry->data.refcount = 0; - victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK; - - PrivateRefCountOverflowed++; - } -} - -/* - * Fill a previously reserved refcount entry. - */ -static PrivateRefCountEntry * -NewPrivateRefCountEntry(Buffer buffer) -{ - PrivateRefCountEntry *res; - - /* only allowed to be called when a reservation has been made */ - Assert(ReservedRefCountSlot != -1); - - /* use up the reserved entry */ - res = &PrivateRefCountArray[ReservedRefCountSlot]; - - /* and fill it */ - PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer; - res->buffer = buffer; - res->data.refcount = 0; - res->data.lockmode = BUFFER_LOCK_UNLOCK; - - /* update cache for the next lookup */ - PrivateRefCountEntryLast = ReservedRefCountSlot; - - ReservedRefCountSlot = -1; - - return res; -} - -/* - * Slow-path for GetPrivateRefCountEntry(). This is big enough to not be worth - * inlining. This particularly seems to be true if the compiler is capable of - * auto-vectorizing the code, as that imposes additional stack-alignment - * requirements etc. - */ -static pg_noinline PrivateRefCountEntry * -GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move) -{ - PrivateRefCountEntry *res; - int match = -1; - int i; - - /* - * First search for references in the array, that'll be sufficient in the - * majority of cases. - */ - for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) - { - if (PrivateRefCountArrayKeys[i] == buffer) - { - match = i; - /* see ReservePrivateRefCountEntry() for why we don't return */ - } - } - - if (likely(match != -1)) - { - /* update cache for the next lookup */ - PrivateRefCountEntryLast = match; - - return &PrivateRefCountArray[match]; - } - - /* - * By here we know that the buffer, if already pinned, isn't residing in - * the array. - * - * Only look up the buffer in the hashtable if we've previously overflowed - * into it. - */ - if (PrivateRefCountOverflowed == 0) - return NULL; - - res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL); - - if (res == NULL) - return NULL; - else if (!do_move) - { - /* caller doesn't want us to move the hash entry into the array */ - return res; - } - else - { - /* move buffer from hashtable into the free array slot */ - bool found; - PrivateRefCountEntry *free; - - /* Ensure there's a free array slot */ - ReservePrivateRefCountEntry(); - - /* Use up the reserved slot */ - Assert(ReservedRefCountSlot != -1); - free = &PrivateRefCountArray[ReservedRefCountSlot]; - Assert(PrivateRefCountArrayKeys[ReservedRefCountSlot] == free->buffer); - Assert(free->buffer == InvalidBuffer); - - /* and fill it */ - free->buffer = buffer; - free->data = res->data; - PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer; - /* update cache for the next lookup */ - PrivateRefCountEntryLast = match; - - ReservedRefCountSlot = -1; - - - /* delete from hashtable */ - hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found); - Assert(found); - Assert(PrivateRefCountOverflowed > 0); - PrivateRefCountOverflowed--; - - return free; - } -} - -/* - * Return the PrivateRefCount entry for the passed buffer. - * - * Returns NULL if a buffer doesn't have a refcount entry. Otherwise, if - * do_move is true, and the entry resides in the hashtable the entry is - * optimized for frequent access by moving it to the array. - */ -static inline PrivateRefCountEntry * -GetPrivateRefCountEntry(Buffer buffer, bool do_move) -{ - Assert(BufferIsValid(buffer)); - Assert(!BufferIsLocal(buffer)); - - /* - * It's very common to look up the same buffer repeatedly. To make that - * fast, we have a one-entry cache. - * - * In contrast to the loop in GetPrivateRefCountEntrySlow(), here it - * faster to check PrivateRefCountArray[].buffer, as in the case of a hit - * fewer addresses are computed and fewer cachelines are accessed. Whereas - * in GetPrivateRefCountEntrySlow()'s case, checking - * PrivateRefCountArrayKeys saves a lot of memory accesses. - */ - if (likely(PrivateRefCountEntryLast != -1) && - likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer)) - { - return &PrivateRefCountArray[PrivateRefCountEntryLast]; - } - - /* - * The code for the cached lookup is small enough to be worth inlining - * into the caller. In the miss case however, that empirically doesn't - * seem worth it. - */ - return GetPrivateRefCountEntrySlow(buffer, do_move); -} - -/* - * Returns how many times the passed buffer is pinned by this backend. - * - * Only works for shared memory buffers! - */ -static inline int32 -GetPrivateRefCount(Buffer buffer) -{ - PrivateRefCountEntry *ref; - - Assert(BufferIsValid(buffer)); - Assert(!BufferIsLocal(buffer)); - - /* - * Not moving the entry - that's ok for the current users, but we might - * want to change this one day. - */ - ref = GetPrivateRefCountEntry(buffer, false); - - if (ref == NULL) - return 0; - return ref->data.refcount; -} - -/* - * Release resources used to track the reference count of a buffer which we no - * longer have pinned and don't want to pin again immediately. - */ -static void -ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref) -{ - Assert(ref->data.refcount == 0); - Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK); - - if (ref >= &PrivateRefCountArray[0] && - ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES]) - { - ref->buffer = InvalidBuffer; - PrivateRefCountArrayKeys[ref - PrivateRefCountArray] = InvalidBuffer; - - - /* - * Mark the just used entry as reserved - in many scenarios that - * allows us to avoid ever having to search the array/hash for free - * entries. - */ - ReservedRefCountSlot = ref - PrivateRefCountArray; - } - else - { - bool found; - Buffer buffer = ref->buffer; - - hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found); - Assert(found); - Assert(PrivateRefCountOverflowed > 0); - PrivateRefCountOverflowed--; - } -} - /* * BufferIsPinned * True iff the buffer is pinned (also checks for valid buffer number). @@ -596,7 +215,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref) BufferIsLocal(bufnum) ? \ (LocalRefCount[-(bufnum) - 1] > 0) \ : \ - (GetPrivateRefCount(bufnum) > 0) \ + (GetSharedBufferEntry(bufnum) != NULL) \ ) @@ -653,7 +272,6 @@ static void RelationCopyStorageUsingBuffer(RelFileLocator srclocator, RelFileLocator dstlocator, ForkNumber forkNum, bool permanent); static void AtProcExit_Buffers(int code, Datum arg); -static void CheckForBufferLeaks(void); #ifdef USE_ASSERT_CHECKING static void AssertNotCatalogBufferLock(Buffer buffer, BufferLockMode mode); #endif @@ -812,7 +430,6 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN Assert(BufferIsValid(recent_buffer)); ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); InitBufferTag(&tag, &rlocator, forkNum, blockNum); if (BufferIsLocal(recent_buffer)) @@ -2115,7 +1732,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, /* Make sure we will have room to remember the buffer pin */ ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); /* create a tag so we can lookup the buffer */ InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum); @@ -2327,7 +1943,7 @@ retry: UnlockBufHdr(buf); LWLockRelease(oldPartitionLock); /* safety check: should definitely not be our *own* pin */ - if (GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) > 0) + if (GetSharedBufferEntry(BufferDescriptorGetBuffer(buf)) != NULL) elog(ERROR, "buffer is pinned in InvalidateBuffer"); WaitIO(buf); goto retry; @@ -2380,7 +1996,7 @@ InvalidateVictimBuffer(BufferDesc *buf_hdr) LWLock *partition_lock; BufferTag tag; - Assert(GetPrivateRefCount(BufferDescriptorGetBuffer(buf_hdr)) == 1); + Assert(GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr)) != NULL); /* have buffer pinned, so it's safe to read tag without lock */ tag = buf_hdr->tag; @@ -2461,7 +2077,6 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context) * Ensure, before we pin a victim buffer, that there's a free refcount * entry and resource owner slot for the pin. */ - ReservePrivateRefCountEntry(); ResourceOwnerEnlarge(CurrentResourceOwner); /* we return here if a prospective victim buffer gets used concurrently */ @@ -2591,64 +2206,6 @@ again: return buf; } -/* - * Return the maximum number of buffers that a backend should try to pin once, - * to avoid exceeding its fair share. This is the highest value that - * GetAdditionalPinLimit() could ever return. Note that it may be zero on a - * system with a very small buffer pool relative to max_connections. - */ -uint32 -GetPinLimit(void) -{ - return MaxProportionalPins; -} - -/* - * Return the maximum number of additional buffers that this backend should - * pin if it wants to stay under the per-backend limit, considering the number - * of buffers it has already pinned. Unlike LimitAdditionalPins(), the limit - * return by this function can be zero. - */ -uint32 -GetAdditionalPinLimit(void) -{ - uint32 estimated_pins_held; - - /* - * We get the number of "overflowed" pins for free, but don't know the - * number of pins in PrivateRefCountArray. The cost of calculating that - * exactly doesn't seem worth it, so just assume the max. - */ - estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES; - - /* Is this backend already holding more than its fair share? */ - if (estimated_pins_held > MaxProportionalPins) - return 0; - - return MaxProportionalPins - estimated_pins_held; -} - -/* - * Limit the number of pins a batch operation may additionally acquire, to - * avoid running out of pinnable buffers. - * - * One additional pin is always allowed, on the assumption that the operation - * requires at least one to make progress. - */ -void -LimitAdditionalPins(uint32 *additional_pins) -{ - uint32 limit; - - if (*additional_pins <= 1) - return; - - limit = GetAdditionalPinLimit(); - limit = Max(limit, 1); - if (limit < *additional_pins) - *additional_pins = limit; -} - /* * Logic shared between ExtendBufferedRelBy(), ExtendBufferedRelTo(). Just to * avoid duplicating the tracing and relpersistence related logic. @@ -2812,7 +2369,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr, /* in case we need to pin an existing buffer below */ ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); InitBufferTag(&tag, &BMR_GET_SMGR(bmr)->smgr_rlocator.locator, fork, first_block + i); @@ -3185,9 +2741,8 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy, PrivateRefCountEntry *ref; Assert(!BufferIsLocal(b)); - Assert(ReservedRefCountSlot != -1); - ref = GetPrivateRefCountEntry(b, true); + ref = GetSharedBufferEntry(b); if (ref == NULL) { @@ -3257,8 +2812,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy, */ result = (pg_atomic_read_u64(&buf->state) & BM_VALID) != 0; - Assert(ref->data.refcount > 0); - ref->data.refcount++; + SharedBufferRefExisting(ref); ResourceOwnerRememberBuffer(CurrentResourceOwner, b); } @@ -3296,7 +2850,7 @@ PinBuffer_Locked(BufferDesc *buf) * As explained, We don't expect any preexisting pins. That allows us to * manipulate the PrivateRefCount after releasing the spinlock */ - Assert(GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf), false) == NULL); + Assert(GetSharedBufferEntry(BufferDescriptorGetBuffer(buf)) == NULL); /* * Since we hold the buffer spinlock, we can update the buffer state and @@ -3373,11 +2927,10 @@ UnpinBufferNoOwner(BufferDesc *buf) Assert(!BufferIsLocal(b)); /* not moving as we're likely deleting it soon anyway */ - ref = GetPrivateRefCountEntry(b, false); + ref = GetSharedBufferEntry(b); Assert(ref != NULL); - Assert(ref->data.refcount > 0); - ref->data.refcount--; - if (ref->data.refcount == 0) + + if (SharedBufferUnref(ref)) { uint64 old_buf_state; @@ -3402,8 +2955,6 @@ UnpinBufferNoOwner(BufferDesc *buf) /* Support LockBufferForCleanup() */ if (old_buf_state & BM_PIN_COUNT_WAITER) WakePinCountWaiter(buf); - - ForgetPrivateRefCountEntry(ref); } } @@ -3414,10 +2965,7 @@ UnpinBufferNoOwner(BufferDesc *buf) inline void TrackNewBufferPin(Buffer buf) { - PrivateRefCountEntry *ref; - - ref = NewPrivateRefCountEntry(buf); - ref->data.refcount++; + SharedBufferCreateRef(buf); ResourceOwnerRememberBuffer(CurrentResourceOwner, buf); @@ -4037,7 +3585,6 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context) BufferTag tag; /* Make sure we can handle the pin */ - ReservePrivateRefCountEntry(); ResourceOwnerEnlarge(CurrentResourceOwner); /* @@ -4101,11 +3648,9 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context) void AtEOXact_Buffers(bool isCommit) { - CheckForBufferLeaks(); + CheckPrivateRefCountLeaks(); AtEOXact_LocalBuffers(isCommit); - - Assert(PrivateRefCountOverflowed == 0); } /* @@ -4118,25 +3663,8 @@ AtEOXact_Buffers(bool isCommit) void InitBufferManagerAccess(void) { - HASHCTL hash_ctl; - - /* - * An advisory limit on the number of pins each backend should hold, based - * on shared_buffers and the maximum number of connections possible. - * That's very pessimistic, but outside toy-sized shared_buffers it should - * allow plenty of pins. LimitAdditionalPins() and - * GetAdditionalPinLimit() can be used to check the remaining balance. - */ - MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS); - - memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray)); - memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys)); - - hash_ctl.keysize = sizeof(Buffer); - hash_ctl.entrysize = sizeof(PrivateRefCountEntry); - PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl, - HASH_ELEM | HASH_BLOBS); + InitPrivateRefCount(); /* * AtProcExit_Buffers needs LWLock access, and thereby has to be called at @@ -4155,112 +3683,13 @@ AtProcExit_Buffers(int code, Datum arg) { UnlockBuffers(); - CheckForBufferLeaks(); + CheckPrivateRefCountLeaks(); /* localbuf.c needs a chance too */ AtProcExit_LocalBuffers(); } -/* - * CheckForBufferLeaks - ensure this backend holds no buffer pins - * - * As of PostgreSQL 8.0, buffer pins should get released by the - * ResourceOwner mechanism. This routine is just a debugging - * cross-check that no pins remain. - */ -static void -CheckForBufferLeaks(void) -{ #ifdef USE_ASSERT_CHECKING - int RefCountErrors = 0; - PrivateRefCountEntry *res; - int i; - char *s; - - /* check the array */ - for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) - { - if (PrivateRefCountArrayKeys[i] != InvalidBuffer) - { - res = &PrivateRefCountArray[i]; - - s = DebugPrintBufferRefcount(res->buffer); - elog(WARNING, "buffer refcount leak: %s", s); - pfree(s); - - RefCountErrors++; - } - } - - /* if necessary search the hash */ - if (PrivateRefCountOverflowed) - { - HASH_SEQ_STATUS hstat; - - hash_seq_init(&hstat, PrivateRefCountHash); - while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL) - { - s = DebugPrintBufferRefcount(res->buffer); - elog(WARNING, "buffer refcount leak: %s", s); - pfree(s); - RefCountErrors++; - } - } - - Assert(RefCountErrors == 0); -#endif -} - -#ifdef USE_ASSERT_CHECKING -/* - * Check for exclusive-locked catalog buffers. This is the core of - * AssertCouldGetRelation(). - * - * A backend would self-deadlock on the content lock if the catalog scan read - * the exclusive-locked buffer. The main threat is exclusive-locked buffers - * of catalogs used in relcache, because a catcache search on any catalog may - * build that catalog's relcache entry. We don't have an inventory of - * catalogs relcache uses, so just check buffers of most catalogs. - * - * It's better to minimize waits while holding an exclusive buffer lock, so it - * would be nice to broaden this check not to be catalog-specific. However, - * bttextcmp() accesses pg_collation, and non-core opclasses might similarly - * read tables. That is deadlock-free as long as there's no loop in the - * dependency graph: modifying table A may cause an opclass to read table B, - * but it must not cause a read of table A. - */ -void -AssertBufferLocksPermitCatalogRead(void) -{ - PrivateRefCountEntry *res; - - /* check the array */ - for (int i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) - { - if (PrivateRefCountArrayKeys[i] != InvalidBuffer) - { - res = &PrivateRefCountArray[i]; - - if (res->buffer == InvalidBuffer) - continue; - - AssertNotCatalogBufferLock(res->buffer, res->data.lockmode); - } - } - - /* if necessary search the hash */ - if (PrivateRefCountOverflowed) - { - HASH_SEQ_STATUS hstat; - - hash_seq_init(&hstat, PrivateRefCountHash); - while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL) - { - AssertNotCatalogBufferLock(res->buffer, res->data.lockmode); - } - } -} - static void AssertNotCatalogBufferLock(Buffer buffer, BufferLockMode mode) { @@ -4312,8 +3741,10 @@ DebugPrintBufferRefcount(Buffer buffer) } else { + PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer); + buf = GetBufferDescriptor(buffer - 1); - loccount = GetPrivateRefCount(buffer); + loccount = ref ? SharedBufferRefCount(ref) : 0; backend = INVALID_PROC_NUMBER; } @@ -5100,7 +4531,6 @@ FlushRelationBuffers(Relation rel) error_context_stack = &errcallback; /* Make sure we can handle the pin */ - ReservePrivateRefCountEntry(); ResourceOwnerEnlarge(CurrentResourceOwner); /* @@ -5136,7 +4566,6 @@ FlushRelationBuffers(Relation rel) continue; /* Make sure we can handle the pin */ - ReservePrivateRefCountEntry(); ResourceOwnerEnlarge(CurrentResourceOwner); buf_state = LockBufHdr(bufHdr); @@ -5231,7 +4660,6 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels) continue; /* Make sure we can handle the pin */ - ReservePrivateRefCountEntry(); ResourceOwnerEnlarge(CurrentResourceOwner); buf_state = LockBufHdr(bufHdr); @@ -5457,7 +4885,6 @@ FlushDatabaseBuffers(Oid dbid) continue; /* Make sure we can handle the pin */ - ReservePrivateRefCountEntry(); ResourceOwnerEnlarge(CurrentResourceOwner); buf_state = LockBufHdr(bufHdr); @@ -5532,17 +4959,18 @@ UnlockReleaseBuffer(Buffer buffer) void IncrBufferRefCount(Buffer buffer) { - Assert(BufferIsPinned(buffer)); ResourceOwnerEnlarge(CurrentResourceOwner); if (BufferIsLocal(buffer)) + { + Assert(LocalRefCount[-buffer - 1] > 0); LocalRefCount[-buffer - 1]++; + } else { - PrivateRefCountEntry *ref; + PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer); - ref = GetPrivateRefCountEntry(buffer, true); Assert(ref != NULL); - ref->data.refcount++; + SharedBufferRefExisting(ref); } ResourceOwnerRememberBuffer(CurrentResourceOwner, buffer); } @@ -5561,11 +4989,9 @@ MarkSharedBufferDirtyHint(Buffer buffer, BufferDesc *bufHdr, uint64 lockstate, { Page page = BufferGetPage(buffer); - Assert(GetPrivateRefCount(buffer) > 0); - + Assert(GetSharedBufferEntry(buffer) != NULL); /* here, either share-exclusive or exclusive lock is OK */ - Assert(BufferLockHeldByMeInMode(bufHdr, BUFFER_LOCK_EXCLUSIVE) || - BufferLockHeldByMeInMode(bufHdr, BUFFER_LOCK_SHARE_EXCLUSIVE)); + Assert(BufferIsLockedByMe(buffer)); /* * This routine might get called many times on the same page, if we are @@ -5777,12 +5203,12 @@ BufferLockAcquire(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode) * Get reference to the refcount entry before we hold the lock, it seems * better to do before holding the lock. */ - entry = GetPrivateRefCountEntry(buffer, true); + entry = GetSharedBufferEntry(buffer); /* * We better not already hold a lock on the buffer. */ - Assert(entry->data.lockmode == BUFFER_LOCK_UNLOCK); + Assert(SharedBufferGetLockMode(entry) == BUFFER_LOCK_UNLOCK); /* * Lock out cancel/die interrupts until we exit the code section protected @@ -5871,7 +5297,7 @@ BufferLockAcquire(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode) } /* Remember that we now hold this lock */ - entry->data.lockmode = mode; + SharedBufferSetLockMode(entry, mode); /* * Fix the process wait semaphore's count for any absorbed wakeups. @@ -5922,7 +5348,7 @@ BufferLockUnlock(Buffer buffer, BufferDesc *buf_hdr) static bool BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode) { - PrivateRefCountEntry *entry = GetPrivateRefCountEntry(buffer, true); + PrivateRefCountEntry *entry = GetSharedBufferEntry(buffer); bool mustwait; /* @@ -5930,7 +5356,7 @@ BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode) * already has locked, return false, independent of the existing and * desired lock level. */ - if (entry->data.lockmode != BUFFER_LOCK_UNLOCK) + if (SharedBufferGetLockMode(entry) != BUFFER_LOCK_UNLOCK) return false; /* @@ -5950,7 +5376,7 @@ BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode) } else { - entry->data.lockmode = mode; + SharedBufferSetLockMode(entry, mode); } return !mustwait; @@ -6160,11 +5586,11 @@ BufferLockDisownInternal(Buffer buffer, BufferDesc *buf_hdr) BufferLockMode mode; PrivateRefCountEntry *ref; - ref = GetPrivateRefCountEntry(buffer, false); + ref = GetSharedBufferEntry(buffer); if (ref == NULL) elog(ERROR, "lock %d is not held", buffer); - mode = ref->data.lockmode; - ref->data.lockmode = BUFFER_LOCK_UNLOCK; + mode = SharedBufferGetLockMode(ref); + SharedBufferSetLockMode(ref, BUFFER_LOCK_UNLOCK); return mode; } @@ -6398,12 +5824,12 @@ static bool BufferLockHeldByMeInMode(BufferDesc *buf_hdr, BufferLockMode mode) { PrivateRefCountEntry *entry = - GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf_hdr), false); + GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr)); if (!entry) return false; else - return entry->data.lockmode == mode; + return SharedBufferGetLockMode(entry) == mode; } /* @@ -6416,12 +5842,12 @@ static bool BufferLockHeldByMe(BufferDesc *buf_hdr) { PrivateRefCountEntry *entry = - GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf_hdr), false); + GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr)); if (!entry) return false; else - return entry->data.lockmode != BUFFER_LOCK_UNLOCK; + return SharedBufferGetLockMode(entry) != BUFFER_LOCK_UNLOCK; } /* @@ -6517,9 +5943,13 @@ CheckBufferIsPinnedOnce(Buffer buffer) } else { - if (GetPrivateRefCount(buffer) != 1) - elog(ERROR, "incorrect local pin count: %d", - GetPrivateRefCount(buffer)); + { + PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer); + + if (!ref || !SharedBufferHasSingleRef(ref)) + elog(ERROR, "incorrect pin count: %d", + ref ? SharedBufferRefCount(ref) : 0); + } } } @@ -6700,7 +6130,7 @@ HoldingBufferPinThatDelaysRecovery(void) if (bufid < 0) return false; - if (GetPrivateRefCount(bufid + 1) > 0) + if (GetSharedBufferEntry(bufid + 1) != NULL) return true; return false; @@ -6735,10 +6165,13 @@ ConditionalLockBufferForCleanup(Buffer buffer) } /* There should be exactly one local pin */ - refcount = GetPrivateRefCount(buffer); - Assert(refcount); - if (refcount != 1) - return false; + { + PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer); + + Assert(ref != NULL); + if (!SharedBufferHasSingleRef(ref)) + return false; + } /* Try to acquire lock */ if (!ConditionalLockBuffer(buffer)) @@ -6790,8 +6223,12 @@ IsBufferCleanupOK(Buffer buffer) } /* There should be exactly one local pin */ - if (GetPrivateRefCount(buffer) != 1) - return false; + { + PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer); + + if (!SharedBufferHasSingleRef(ref)) + return false; + } bufHdr = GetBufferDescriptor(buffer - 1); @@ -6827,12 +6264,12 @@ SharedBufferBeginSetHintBits(Buffer buffer, BufferDesc *buf_hdr, uint64 *locksta PrivateRefCountEntry *ref; BufferLockMode mode; - ref = GetPrivateRefCountEntry(buffer, true); + ref = GetSharedBufferEntry(buffer); if (ref == NULL) elog(ERROR, "buffer is not pinned"); - mode = ref->data.lockmode; + mode = SharedBufferGetLockMode(ref); if (mode == BUFFER_LOCK_UNLOCK) elog(ERROR, "buffer is not locked"); @@ -6874,7 +6311,7 @@ SharedBufferBeginSetHintBits(Buffer buffer, BufferDesc *buf_hdr, uint64 *locksta if (likely(pg_atomic_compare_exchange_u64(&buf_hdr->state, &old_state, desired_state))) { - ref->data.lockmode = BUFFER_LOCK_SHARE_EXCLUSIVE; + SharedBufferSetLockMode(ref, BUFFER_LOCK_SHARE_EXCLUSIVE); *lockstate = desired_state; return true; @@ -7647,7 +7084,7 @@ ResOwnerReleaseBuffer(Datum res) { PrivateRefCountEntry *ref; - ref = GetPrivateRefCountEntry(buffer, false); + ref = GetSharedBufferEntry(buffer); /* not having a private refcount would imply resowner corruption */ Assert(ref != NULL); @@ -7656,7 +7093,7 @@ ResOwnerReleaseBuffer(Datum res) * If the buffer was locked at the time of the resowner release, * release the lock now. This should only happen after errors. */ - if (ref->data.lockmode != BUFFER_LOCK_UNLOCK) + if (SharedBufferGetLockMode(ref) != BUFFER_LOCK_UNLOCK) { BufferDesc *buf = GetBufferDescriptor(buffer - 1); @@ -7749,7 +7186,6 @@ EvictUnpinnedBuffer(Buffer buf, bool *buffer_flushed) /* Make sure we can pin the buffer. */ ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); desc = GetBufferDescriptor(buf - 1); LockBufHdr(desc); @@ -7790,7 +7226,6 @@ EvictAllUnpinnedBuffers(int32 *buffers_evicted, int32 *buffers_flushed, continue; ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); LockBufHdr(desc); @@ -7844,7 +7279,6 @@ EvictRelUnpinnedBuffers(Relation rel, int32 *buffers_evicted, /* Make sure we can pin the buffer. */ ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); buf_state = LockBufHdr(desc); @@ -7936,7 +7370,6 @@ MarkDirtyUnpinnedBuffer(Buffer buf, bool *buffer_already_dirty) /* Make sure we can pin the buffer. */ ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); desc = GetBufferDescriptor(buf - 1); LockBufHdr(desc); @@ -7989,7 +7422,6 @@ MarkDirtyRelUnpinnedBuffers(Relation rel, /* Make sure we can pin the buffer. */ ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); buf_state = LockBufHdr(desc); @@ -8041,7 +7473,6 @@ MarkDirtyAllUnpinnedBuffers(int32 *buffers_dirtied, continue; ResourceOwnerEnlarge(CurrentResourceOwner); - ReservePrivateRefCountEntry(); LockBufHdr(desc); @@ -8740,3 +8171,12 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = { .complete_local = local_buffer_readv_complete, .report = buffer_readv_report, }; + + +/* + * Intentionally included at the bottom so that the compiler can inline + * functions but developers are forced to use accessors. + * bufmgr.c should not be concerned with bufmgr_refcnt.c implementation + * details. + */ + #include "bufmgr_refcnt.c" diff --git a/src/backend/storage/buffer/bufmgr_refcnt.c b/src/backend/storage/buffer/bufmgr_refcnt.c new file mode 100644 index 0000000000..fef8f98037 --- /dev/null +++ b/src/backend/storage/buffer/bufmgr_refcnt.c @@ -0,0 +1,615 @@ +/*------------------------------------------------------------------------- + * + * bufmgr_refcnt.c + * Backend-private buffer refcount tracking + * + * This file is included at the end of bufmgr.c to allow the compiler + * to inline functions. Do not compile this file separately. + * + * Each buffer has a private refcount that keeps track of the number of + * times the buffer is pinned in the current process. This is so that the + * shared refcount needs to be modified only once if a buffer is pinned more + * than once by an individual backend. This mechanism is also used to track + * whether this backend has a buffer locked, and, if so, in what mode. + * + * To avoid - as we used to - requiring an array with NBuffers entries to keep + * track of local buffers, we use a small sequentially searched array + * (PrivateRefCountArrayKeys, with the corresponding data stored in + * PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to + * keep track of backend local pins. + * + * Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all + * refcounts are kept track of in the array; after that, new array entries + * displace old ones into the hash table. That way a frequently used entry + * can't get "stuck" in the hashtable while infrequent ones clog the array. + * + * This was initially designed trying to optimize for the case where the + * number of pinned buffers is expected to not exceed REFCOUNT_ARRAY_ENTRIES. + * However this might not be the case with the introduction of prefetching. + * + * To enter a buffer into the refcount tracking mechanism first reserve a free + * entry using ReservePrivateRefCountEntry() and then later, if necessary, + * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing + * memory allocations in NewPrivateRefCountEntry() which can be important + * because in some scenarios it's called with a spinlock held... + * + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/storage/buffer/bufmgr_refcnt.c + * + *------------------------------------------------------------------------- + */ + +#include "utils/hsearch.h" + +/* Structure definitions - internal to this file */ +typedef struct PrivateRefCountData +{ + int32 refcount; + BufferLockMode lockmode; +} PrivateRefCountData; + +struct PrivateRefCountEntry +{ + Buffer buffer; + PrivateRefCountData data; +}; + +struct PrivateRefCountIterator +{ + int array_index; + bool in_hash; + HASH_SEQ_STATUS *hash_status; +}; + +/* Private refcount array and keys */ +#define REFCOUNT_ARRAY_ENTRIES 8 +static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES]; +static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES]; + +/* Overflow hash table for when array is full */ +static HTAB *PrivateRefCountHash = NULL; + +/* Count of entries that have overflowed into the hash table */ +static int32 PrivateRefCountOverflowed = 0; + +/* Clock hand for selecting victim when array is full */ +static uint32 PrivateRefCountClock = 0; + +/* Reserved slot index, or -1 if none reserved */ +static int ReservedRefCountSlot = -1; + +/* Cache for last accessed entry */ +static int PrivateRefCountEntryLast = -1; + +/* Advisory limit on the number of pins each backend should hold */ +static uint32 MaxProportionalPins = 0; + +/* Forward declarations */ +static void ReservePrivateRefCountEntry(void); +static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer); +static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move); + +/* + * Initialize private refcount tracking for this backend. + */ +void +InitPrivateRefCount(void) +{ + HASHCTL hash_ctl; + + + /* + * An advisory limit on the number of pins each backend should hold, based + * on shared_buffers and the maximum number of connections possible. + * That's very pessimistic, but outside toy-sized shared_buffers it should + * allow plenty of pins. LimitAdditionalPins() and + * GetAdditionalPinLimit() can be used to check the remaining balance. + */ + MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS); + + memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray)); + memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys)); + + hash_ctl.keysize = sizeof(Buffer); + hash_ctl.entrysize = sizeof(PrivateRefCountEntry); + + PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl, + HASH_ELEM | HASH_BLOBS); +} + +/* + * Ensure that the PrivateRefCountArray has sufficient space to store one more + * entry. + */ +static void +ReservePrivateRefCountEntry(void) +{ + /* Already reserved (or freed), nothing to do */ + if (ReservedRefCountSlot != -1) + return; + + /* + * First search for a free entry the array, that'll be sufficient in the + * majority of cases. + */ + { + int i; + + for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) + { + if (PrivateRefCountArrayKeys[i] == InvalidBuffer) + { + ReservedRefCountSlot = i; + } + } + + if (ReservedRefCountSlot != -1) + return; + } + + /* + * No luck. All array entries are full. Move one array entry into the hash + * table. + */ + { + int victim_slot; + PrivateRefCountEntry *victim_entry; + PrivateRefCountEntry *hashent; + bool found; + + /* select victim slot */ + victim_slot = PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES; + victim_entry = &PrivateRefCountArray[victim_slot]; + ReservedRefCountSlot = victim_slot; + + /* Better be used, otherwise we shouldn't get here. */ + Assert(PrivateRefCountArrayKeys[victim_slot] != InvalidBuffer); + Assert(PrivateRefCountArray[victim_slot].buffer != InvalidBuffer); + Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer); + + /* enter victim array entry into hashtable */ + hashent = hash_search(PrivateRefCountHash, + &PrivateRefCountArrayKeys[victim_slot], + HASH_ENTER, + &found); + Assert(!found); + hashent->data = victim_entry->data; + + /* clear the now free array slot */ + PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer; + victim_entry->buffer = InvalidBuffer; + + memset(&victim_entry->data, 0, sizeof(victim_entry->data)); + victim_entry->data.refcount = 0; + victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK; + + PrivateRefCountOverflowed++; + } +} + +/* + * Create a new refcount entry for the given buffer. + */ +static PrivateRefCountEntry * +NewPrivateRefCountEntry(Buffer buffer) +{ + PrivateRefCountEntry *res; + + /* only allowed to be called when a reservation has been made */ + Assert(ReservedRefCountSlot != -1); + + /* use up the reserved entry */ + res = &PrivateRefCountArray[ReservedRefCountSlot]; + + /* and fill it */ + PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer; + res->buffer = buffer; + res->data.refcount = 0; + res->data.lockmode = BUFFER_LOCK_UNLOCK; + + /* update cache for the next lookup */ + PrivateRefCountEntryLast = ReservedRefCountSlot; + + ReservedRefCountSlot = -1; + + return res; +} + +/* + * Slow-path for GetSharedBufferEntry(). + */ +static pg_noinline PrivateRefCountEntry * +GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move) +{ + PrivateRefCountEntry *res; + int match = -1; + int i; + + /* + * First search for references in the array. + */ + for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) + { + if (PrivateRefCountArrayKeys[i] == buffer) + { + match = i; + } + } + + if (likely(match != -1)) + { + PrivateRefCountEntryLast = match; + return &PrivateRefCountArray[match]; + } + + /* + * Only look up the buffer in the hashtable if we've previously overflowed. + */ + if (PrivateRefCountOverflowed == 0) + return NULL; + + res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL); + + if (res == NULL) + return NULL; + else if (!do_move) + { + return res; + } + else + { + /* move buffer from hashtable into the free array slot */ + bool found; + PrivateRefCountEntry *free; + + ReservePrivateRefCountEntry(); + + Assert(ReservedRefCountSlot != -1); + free = &PrivateRefCountArray[ReservedRefCountSlot]; + Assert(free->buffer == InvalidBuffer); + + free->buffer = buffer; + free->data = res->data; + PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer; + PrivateRefCountEntryLast = ReservedRefCountSlot; + + ReservedRefCountSlot = -1; + + hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found); + Assert(found); + Assert(PrivateRefCountOverflowed > 0); + PrivateRefCountOverflowed--; + + return free; + } +} + +/* + * Return the PrivateRefCountEntry for the passed buffer. + * Returns NULL if the buffer is not currently pinned. + */ +static PrivateRefCountEntry * +GetSharedBufferEntry(Buffer buffer) +{ + Assert(BufferIsValid(buffer)); + Assert(!BufferIsLocal(buffer)); + + /* Fast path: check one-entry cache */ + if (likely(PrivateRefCountEntryLast != -1) && + likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer)) + { + return &PrivateRefCountArray[PrivateRefCountEntryLast]; + } + + return GetPrivateRefCountEntrySlow(buffer, false); +} + +/* + * Create a new refcount entry for a buffer that is known to not be pinned. + * This is a fast path that skips the cache/hash lookup. + * Returns the new entry pointer with refcount already incremented. + */ +static PrivateRefCountEntry * +SharedBufferCreateRef(Buffer buffer) +{ + PrivateRefCountEntry *ref; + + Assert(BufferIsValid(buffer)); + Assert(!BufferIsLocal(buffer)); + + ReservePrivateRefCountEntry(); + ref = NewPrivateRefCountEntry(buffer); + ref->data.refcount++; + + return ref; +} + +/* + * Increment the private refcount for an existing entry. + * Use when you already have the entry from a previous lookup. + */ +static void +SharedBufferRefExisting(PrivateRefCountEntry *ref) +{ + Assert(ref != NULL); + Assert(ref->data.refcount > 0); + ref->data.refcount++; +} + +/* + * Decrement the private refcount for a buffer. + * If the refcount reaches zero, removes the entry and returns true. + * Returns false if the buffer still has references. + */ +static bool +SharedBufferUnref(PrivateRefCountEntry *ref) +{ + Assert(ref != NULL); + Assert(ref->data.refcount > 0); + + ref->data.refcount--; + + if (ref->data.refcount == 0) + { + /* No more references - clean up the entry */ + Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK); + + if (ref >= &PrivateRefCountArray[0] && + ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES]) + { + ref->buffer = InvalidBuffer; + PrivateRefCountArrayKeys[ref - PrivateRefCountArray] = InvalidBuffer; + ReservedRefCountSlot = ref - PrivateRefCountArray; + } + else + { + bool found; + Buffer buffer = ref->buffer; + + hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found); + Assert(found); + Assert(PrivateRefCountOverflowed > 0); + PrivateRefCountOverflowed--; + } + + return true; + } + + return false; +} + +/* + * Accessors for refcount entry fields. + */ +static int32 +SharedBufferRefCount(PrivateRefCountEntry *ref) +{ + return ref->data.refcount; +} + +static BufferLockMode +SharedBufferGetLockMode(PrivateRefCountEntry *ref) +{ + return ref->data.lockmode; +} + +static void +SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode) +{ + ref->data.lockmode = mode; +} + +/* + * Check if the entry has exactly one reference. + */ +static bool +SharedBufferHasSingleRef(PrivateRefCountEntry *ref) +{ + return ref != NULL && ref->data.refcount == 1; +} + +/* + * Check for buffer refcount leaks. + */ +void +CheckPrivateRefCountLeaks(void) +{ +#ifdef USE_ASSERT_CHECKING + int RefCountErrors = 0; + PrivateRefCountEntry *res; + int i; + char *s; + + /* check the array */ + for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) + { + if (PrivateRefCountArrayKeys[i] != InvalidBuffer) + { + res = &PrivateRefCountArray[i]; + + s = DebugPrintBufferRefcount(res->buffer); + elog(WARNING, "buffer refcount leak: %s", s); + pfree(s); + + RefCountErrors++; + } + } + + /* if necessary search the hash */ + if (PrivateRefCountOverflowed) + { + HASH_SEQ_STATUS hstat; + + hash_seq_init(&hstat, PrivateRefCountHash); + while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL) + { + s = DebugPrintBufferRefcount(res->buffer); + elog(WARNING, "buffer refcount leak: %s", s); + pfree(s); + RefCountErrors++; + } + } + + Assert(RefCountErrors == 0); +#endif +} + +#ifdef USE_ASSERT_CHECKING +/* Forward declaration - defined in bufmgr.c */ +static void AssertNotCatalogBufferLock(Buffer buffer, BufferLockMode mode); + +/* + * Check for exclusive-locked catalog buffers. This is the core of + * AssertCouldGetRelation(). + */ +void +AssertBufferLocksPermitCatalogRead(void) +{ + PrivateRefCountIterator *iter; + PrivateRefCountEntry *res; + + iter = InitPrivateRefCountIterator(); + while ((res = GetNextPrivateRefCountEntry(iter)) != NULL) + { + Buffer buf = res->buffer; + + if (buf == InvalidBuffer) + continue; + + AssertNotCatalogBufferLock(buf, res->data.lockmode); + } + FreePrivateRefCountIterator(iter); +} +#endif + +/* + * Initialize an iterator for walking all private refcount entries. + */ +PrivateRefCountIterator * +InitPrivateRefCountIterator(void) +{ + PrivateRefCountIterator *iter = palloc(sizeof(PrivateRefCountIterator)); + + iter->array_index = 0; + iter->in_hash = false; + iter->hash_status = NULL; + return iter; +} + +/* + * Get the next private refcount entry. + * Returns NULL when iteration is complete. + */ +PrivateRefCountEntry * +GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter) +{ + /* First iterate through the array */ + while (!iter->in_hash && iter->array_index < REFCOUNT_ARRAY_ENTRIES) + { + int idx = iter->array_index++; + + if (PrivateRefCountArrayKeys[idx] != InvalidBuffer) + return &PrivateRefCountArray[idx]; + } + + /* Then iterate through the hash if there are overflowed entries */ + if (!iter->in_hash) + { + iter->in_hash = true; + if (PrivateRefCountOverflowed > 0) + { + iter->hash_status = palloc(sizeof(HASH_SEQ_STATUS)); + hash_seq_init(iter->hash_status, PrivateRefCountHash); + } + } + + if (iter->hash_status != NULL) + { + PrivateRefCountEntry *res; + + res = (PrivateRefCountEntry *) hash_seq_search(iter->hash_status); + if (res != NULL) + return res; + + pfree(iter->hash_status); + iter->hash_status = NULL; + } + + return NULL; +} + +/* + * Free an iterator from InitPrivateRefCountIterator. + */ +void +FreePrivateRefCountIterator(PrivateRefCountIterator *iter) +{ + if (iter->hash_status != NULL) + { + hash_seq_term(iter->hash_status); + pfree(iter->hash_status); + } + pfree(iter); +} + + +/* + * Return the maximum number of buffers that a backend should try to pin once, + * to avoid exceeding its fair share. This is the highest value that + * GetAdditionalPinLimit() could ever return. Note that it may be zero on a + * system with a very small buffer pool relative to max_connections. + */ + uint32 + GetPinLimit(void) + { + return MaxProportionalPins; + } + + /* + * Return the maximum number of additional buffers that this backend should + * pin if it wants to stay under the per-backend limit, considering the number + * of buffers it has already pinned. Unlike LimitAdditionalPins(), the limit + * return by this function can be zero. + */ + uint32 + GetAdditionalPinLimit(void) + { + uint32 estimated_pins_held; + + /* + * We get the number of "overflowed" pins for free, but don't know the + * number of pins in PrivateRefCountArray. The cost of calculating that + * exactly doesn't seem worth it, so just assume the max. + */ + estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES; + + /* Is this backend already holding more than its fair share? */ + if (estimated_pins_held > MaxProportionalPins) + return 0; + + return MaxProportionalPins - estimated_pins_held; + } + + /* + * Limit the number of pins a batch operation may additionally acquire, to + * avoid running out of pinnable buffers. + * + * One additional pin is always allowed, on the assumption that the operation + * requires at least one to make progress. + */ + void + LimitAdditionalPins(uint32 *additional_pins) + { + uint32 limit; + + if (*additional_pins <= 1) + return; + + limit = GetAdditionalPinLimit(); + limit = Max(limit, 1); + if (limit < *additional_pins) + *additional_pins = limit; + } diff --git a/src/backend/storage/buffer/bufmgr_refcnt.h b/src/backend/storage/buffer/bufmgr_refcnt.h new file mode 100644 index 0000000000..f2a6f45d84 --- /dev/null +++ b/src/backend/storage/buffer/bufmgr_refcnt.h @@ -0,0 +1,59 @@ +/*------------------------------------------------------------------------- + * + * bufmgr_refcnt.h + * Backend-private buffer refcount tracking + * + * This header provides opaque declarations for the private refcount + * tracking system. The implementation is in bufmgr_refcnt.c which is + * included at the end of bufmgr.c for inlining. + * + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/backend/storage/buffer/bufmgr_refcnt.h + * + *------------------------------------------------------------------------- + */ +#ifndef BUFMGR_REFCNT_H +#define BUFMGR_REFCNT_H + +#include "storage/buf.h" +#include "storage/bufmgr.h" + +/* Opaque handle to a private refcount entry */ +typedef struct PrivateRefCountEntry PrivateRefCountEntry; + +/* Opaque handle to an iterator */ +typedef struct PrivateRefCountIterator PrivateRefCountIterator; + +/* Initialization */ +extern void InitPrivateRefCount(void); + +/* + * Hot-path functions - forward declarations. + * Defined as static in bufmgr_refcnt.c which is included in bufmgr.c. + */ +static PrivateRefCountEntry *GetSharedBufferEntry(Buffer buffer); +static PrivateRefCountEntry *SharedBufferCreateRef(Buffer buffer); +static void SharedBufferRefExisting(PrivateRefCountEntry *ref); +static bool SharedBufferUnref(PrivateRefCountEntry *ref); +static BufferLockMode SharedBufferGetLockMode(PrivateRefCountEntry *ref); +static void SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode); +static int32 SharedBufferRefCount(PrivateRefCountEntry *ref); +static bool SharedBufferHasSingleRef(PrivateRefCountEntry *ref); + +/* Pin limiting */ +extern uint32 GetPinLimit(void); +extern uint32 GetAdditionalPinLimit(void); +extern void LimitAdditionalPins(uint32 *additional_pins); + +/* Leak checking */ +extern void CheckPrivateRefCountLeaks(void); + +/* Iterator functions */ +extern PrivateRefCountIterator *InitPrivateRefCountIterator(void); +extern PrivateRefCountEntry *GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter); +extern void FreePrivateRefCountIterator(PrivateRefCountIterator *iter); + + +#endif /* BUFMGR_REFCNT_H */ -- 2.34.1 [application/x-patch] v2-0003-Using-simplehash.patch (13.2K, 6-v2-0003-Using-simplehash.patch) download | inline diff: From bf146598cd94110da255c6fb45b4a5c58002a91d Mon Sep 17 00:00:00 2001 From: Alexandre Felipe <[email protected]> Date: Wed, 11 Mar 2026 12:07:54 +0000 Subject: [PATCH 3/5] Using simplehash This patch replaces the HTAB implementation with a simplehash as suggested by Andres Freund [1]. The simplehash implements a templated open addressing hash, which have entry size information at compile time. The access strategy of the simplehash is very close to the plain array that was seen to be very efficient compared to the HTAB implementation, with the additional advantage of using the key value to initialize the search closer to where the key actually is, instead of always scanning the entire array. Adds one macro to the simplehash implementation enable overriding the empty position detection, to enable using buffer == InvalidBuffer instead of requiring a separate status field. [1] https://www.postgresql.org/message-id/s5p7iou7pdhxhvmv4rohmskwqmr36dc4rybvwlep5yvwrjs4pa%406oxsemms5mw4 --- src/backend/storage/buffer/bufmgr_refcnt.c | 89 +++++++++++----------- src/include/lib/simplehash.h | 59 +++++++++----- 2 files changed, 84 insertions(+), 64 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr_refcnt.c b/src/backend/storage/buffer/bufmgr_refcnt.c index fef8f98037..dd95d953f6 100644 --- a/src/backend/storage/buffer/bufmgr_refcnt.c +++ b/src/backend/storage/buffer/bufmgr_refcnt.c @@ -42,7 +42,10 @@ *------------------------------------------------------------------------- */ -#include "utils/hsearch.h" +#include "common/hashfn.h" +#include "storage/buf_internals.h" +#include "bufmgr_refcnt.h" +#include "storage/proc.h" /* Structure definitions - internal to this file */ typedef struct PrivateRefCountData @@ -53,15 +56,36 @@ typedef struct PrivateRefCountData struct PrivateRefCountEntry { - Buffer buffer; + Buffer buffer; /* hash key - InvalidBuffer means empty */ PrivateRefCountData data; }; +/* + * Define simplehash parameters for the overflow hash table. + * We use buffer == InvalidBuffer as the "empty" marker to avoid needing + * a separate status field. + */ +#define SH_PREFIX refcount +#define SH_ELEMENT_TYPE PrivateRefCountEntry +#define SH_KEY_TYPE Buffer +#define SH_KEY buffer +#define SH_HASH_KEY(tb, key) murmurhash32(key) +#define SH_EQUAL(tb, a, b) ((a) == (b)) +#define SH_SCOPE static inline +#define SH_DEFINE +#define SH_DECLARE +/* Use buffer field as empty marker - no separate status needed */ +#define SH_ENTRY_EMPTY(entry) ((entry)->buffer == InvalidBuffer) +#define SH_MAKE_EMPTY(entry) ((entry)->buffer = InvalidBuffer) +#define SH_MAKE_IN_USE(entry) ((void)0) /* key assignment handles this */ +#include "lib/simplehash.h" + struct PrivateRefCountIterator { int array_index; bool in_hash; - HASH_SEQ_STATUS *hash_status; + refcount_iterator hash_iter; + bool hash_iter_valid; }; /* Private refcount array and keys */ @@ -70,7 +94,7 @@ static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES]; static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES]; /* Overflow hash table for when array is full */ -static HTAB *PrivateRefCountHash = NULL; +static refcount_hash *PrivateRefCountHash = NULL; /* Count of entries that have overflowed into the hash table */ static int32 PrivateRefCountOverflowed = 0; @@ -98,26 +122,18 @@ static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buff void InitPrivateRefCount(void) { - HASHCTL hash_ctl; - - /* * An advisory limit on the number of pins each backend should hold, based * on shared_buffers and the maximum number of connections possible. * That's very pessimistic, but outside toy-sized shared_buffers it should * allow plenty of pins. LimitAdditionalPins() and * GetAdditionalPinLimit() can be used to check the remaining balance. - */ - MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS); - + */ + MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS); memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray)); memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys)); - hash_ctl.keysize = sizeof(Buffer); - hash_ctl.entrysize = sizeof(PrivateRefCountEntry); - - PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl, - HASH_ELEM | HASH_BLOBS); + PrivateRefCountHash = refcount_create(CurrentMemoryContext, 16, NULL); } /* @@ -171,10 +187,9 @@ ReservePrivateRefCountEntry(void) Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer); /* enter victim array entry into hashtable */ - hashent = hash_search(PrivateRefCountHash, - &PrivateRefCountArrayKeys[victim_slot], - HASH_ENTER, - &found); + hashent = refcount_insert(PrivateRefCountHash, + PrivateRefCountArrayKeys[victim_slot], + &found); Assert(!found); hashent->data = victim_entry->data; @@ -251,7 +266,7 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move) if (PrivateRefCountOverflowed == 0) return NULL; - res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL); + res = refcount_lookup(PrivateRefCountHash, buffer); if (res == NULL) return NULL; @@ -262,7 +277,6 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move) else { /* move buffer from hashtable into the free array slot */ - bool found; PrivateRefCountEntry *free; ReservePrivateRefCountEntry(); @@ -278,8 +292,7 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move) ReservedRefCountSlot = -1; - hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found); - Assert(found); + refcount_delete_item(PrivateRefCountHash, res); Assert(PrivateRefCountOverflowed > 0); PrivateRefCountOverflowed--; @@ -366,11 +379,7 @@ SharedBufferUnref(PrivateRefCountEntry *ref) } else { - bool found; - Buffer buffer = ref->buffer; - - hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found); - Assert(found); + refcount_delete_item(PrivateRefCountHash, ref); Assert(PrivateRefCountOverflowed > 0); PrivateRefCountOverflowed--; } @@ -441,10 +450,10 @@ CheckPrivateRefCountLeaks(void) /* if necessary search the hash */ if (PrivateRefCountOverflowed) { - HASH_SEQ_STATUS hstat; + refcount_iterator iter; - hash_seq_init(&hstat, PrivateRefCountHash); - while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL) + refcount_start_iterate(PrivateRefCountHash, &iter); + while ((res = refcount_iterate(PrivateRefCountHash, &iter)) != NULL) { s = DebugPrintBufferRefcount(res->buffer); elog(WARNING, "buffer refcount leak: %s", s); @@ -495,7 +504,7 @@ InitPrivateRefCountIterator(void) iter->array_index = 0; iter->in_hash = false; - iter->hash_status = NULL; + iter->hash_iter_valid = false; return iter; } @@ -521,21 +530,20 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter) iter->in_hash = true; if (PrivateRefCountOverflowed > 0) { - iter->hash_status = palloc(sizeof(HASH_SEQ_STATUS)); - hash_seq_init(iter->hash_status, PrivateRefCountHash); + refcount_start_iterate(PrivateRefCountHash, &iter->hash_iter); + iter->hash_iter_valid = true; } } - if (iter->hash_status != NULL) + if (iter->hash_iter_valid) { PrivateRefCountEntry *res; - res = (PrivateRefCountEntry *) hash_seq_search(iter->hash_status); + res = refcount_iterate(PrivateRefCountHash, &iter->hash_iter); if (res != NULL) return res; - pfree(iter->hash_status); - iter->hash_status = NULL; + iter->hash_iter_valid = false; } return NULL; @@ -547,11 +555,6 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter) void FreePrivateRefCountIterator(PrivateRefCountIterator *iter) { - if (iter->hash_status != NULL) - { - hash_seq_term(iter->hash_status); - pfree(iter->hash_status); - } pfree(iter); } diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h index 848719232a..3c03a7e9c9 100644 --- a/src/include/lib/simplehash.h +++ b/src/include/lib/simplehash.h @@ -287,6 +287,20 @@ SH_SCOPE void SH_STAT(SH_TYPE * tb); #define SH_COMPARE_KEYS(tb, ahash, akey, b) (SH_EQUAL(tb, b->SH_KEY, akey)) #endif +/* + * Macros to check/set entry status. Users can override these to avoid + * needing a separate status field if their key type has an "invalid" value. + */ +#ifndef SH_ENTRY_EMPTY +#define SH_ENTRY_EMPTY(entry) ((entry)->status == SH_STATUS_EMPTY) +#endif +#ifndef SH_MAKE_EMPTY +#define SH_MAKE_EMPTY(entry) ((entry)->status = SH_STATUS_EMPTY) +#endif +#ifndef SH_MAKE_IN_USE +#define SH_MAKE_IN_USE(entry) ((entry)->status = SH_STATUS_IN_USE) +#endif + /* * Wrap the following definitions in include guards, to avoid multiple * definition errors if this header is included more than once. The rest of @@ -544,7 +558,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize) uint32 hash; uint32 optimal; - if (oldentry->status != SH_STATUS_IN_USE) + if (SH_ENTRY_EMPTY(oldentry)) { startelem = i; break; @@ -566,7 +580,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize) { SH_ELEMENT_TYPE *oldentry = &olddata[copyelem]; - if (oldentry->status == SH_STATUS_IN_USE) + if (!SH_ENTRY_EMPTY(oldentry)) { uint32 hash; uint32 startelem2; @@ -582,7 +596,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize) { newentry = &newdata[curelem]; - if (newentry->status == SH_STATUS_EMPTY) + if (SH_ENTRY_EMPTY(newentry)) { break; } @@ -653,14 +667,14 @@ restart: SH_ELEMENT_TYPE *entry = &data[curelem]; /* any empty bucket can directly be used */ - if (entry->status == SH_STATUS_EMPTY) + if (SH_ENTRY_EMPTY(entry)) { tb->members++; entry->SH_KEY = key; #ifdef SH_STORE_HASH SH_GET_HASH(tb, entry) = hash; #endif - entry->status = SH_STATUS_IN_USE; + SH_MAKE_IN_USE(entry); *found = false; return entry; } @@ -675,7 +689,7 @@ restart: if (SH_COMPARE_KEYS(tb, hash, key, entry)) { - Assert(entry->status == SH_STATUS_IN_USE); + Assert(!SH_ENTRY_EMPTY(entry)); *found = true; return entry; } @@ -699,7 +713,7 @@ restart: emptyelem = SH_NEXT(tb, emptyelem, startelem); emptyentry = &data[emptyelem]; - if (emptyentry->status == SH_STATUS_EMPTY) + if (SH_ENTRY_EMPTY(emptyentry)) { lastentry = emptyentry; break; @@ -748,7 +762,7 @@ restart: #ifdef SH_STORE_HASH SH_GET_HASH(tb, entry) = hash; #endif - entry->status = SH_STATUS_IN_USE; + SH_MAKE_IN_USE(entry); *found = false; return entry; } @@ -810,12 +824,12 @@ SH_LOOKUP_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash) { SH_ELEMENT_TYPE *entry = &tb->data[curelem]; - if (entry->status == SH_STATUS_EMPTY) + if (SH_ENTRY_EMPTY(entry)) { return NULL; } - Assert(entry->status == SH_STATUS_IN_USE); + Assert(!SH_ENTRY_EMPTY(entry)); if (SH_COMPARE_KEYS(tb, hash, key, entry)) return entry; @@ -868,10 +882,10 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key) { SH_ELEMENT_TYPE *entry = &tb->data[curelem]; - if (entry->status == SH_STATUS_EMPTY) + if (SH_ENTRY_EMPTY(entry)) return false; - if (entry->status == SH_STATUS_IN_USE && + if (!SH_ENTRY_EMPTY(entry) && SH_COMPARE_KEYS(tb, hash, key, entry)) { SH_ELEMENT_TYPE *lastentry = entry; @@ -894,9 +908,9 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key) curelem = SH_NEXT(tb, curelem, startelem); curentry = &tb->data[curelem]; - if (curentry->status != SH_STATUS_IN_USE) + if (SH_ENTRY_EMPTY(curentry)) { - lastentry->status = SH_STATUS_EMPTY; + SH_MAKE_EMPTY(lastentry); break; } @@ -906,7 +920,7 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key) /* current is at optimal position, done */ if (curoptimal == curelem) { - lastentry->status = SH_STATUS_EMPTY; + SH_MAKE_EMPTY(lastentry); break; } @@ -957,9 +971,9 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry) curelem = SH_NEXT(tb, curelem, startelem); curentry = &tb->data[curelem]; - if (curentry->status != SH_STATUS_IN_USE) + if (SH_ENTRY_EMPTY(curentry)) { - lastentry->status = SH_STATUS_EMPTY; + SH_MAKE_EMPTY(lastentry); break; } @@ -969,7 +983,7 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry) /* current is at optimal position, done */ if (curoptimal == curelem) { - lastentry->status = SH_STATUS_EMPTY; + SH_MAKE_EMPTY(lastentry); break; } @@ -997,7 +1011,7 @@ SH_START_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter) { SH_ELEMENT_TYPE *entry = &tb->data[i]; - if (entry->status != SH_STATUS_IN_USE) + if (SH_ENTRY_EMPTY(entry)) { startelem = i; break; @@ -1063,7 +1077,7 @@ SH_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter) if ((iter->cur & tb->sizemask) == (iter->end & tb->sizemask)) iter->done = true; - if (elem->status == SH_STATUS_IN_USE) + if (!SH_ENTRY_EMPTY(elem)) { return elem; } @@ -1140,7 +1154,7 @@ SH_STAT(SH_TYPE * tb) elem = &tb->data[i]; - if (elem->status != SH_STATUS_IN_USE) + if (SH_ENTRY_EMPTY(elem)) continue; hash = SH_ENTRY_HASH(tb, elem); @@ -1205,6 +1219,9 @@ SH_STAT(SH_TYPE * tb) #undef SH_STORE_HASH #undef SH_USE_NONDEFAULT_ALLOCATOR #undef SH_EQUAL +#undef SH_ENTRY_EMPTY +#undef SH_MAKE_EMPTY +#undef SH_MAKE_IN_USE /* undefine locally declared macros */ #undef SH_MAKE_PREFIX -- 2.34.1 [application/x-patch] v2-0001-Benchmark-buffer-pinning.patch (22.3K, 7-v2-0001-Benchmark-buffer-pinning.patch) download | inline diff: From 36b69e2028c8febb35ca2c45327384251c44b775 Mon Sep 17 00:00:00 2001 From: Alexandre Felipe <[email protected]> Date: Wed, 11 Mar 2026 12:05:50 +0000 Subject: [PATCH 1/5] Benchmark buffer pinning --- src/test/modules/test_buffer_pin/Makefile | 18 + src/test/modules/test_buffer_pin/benchmark.py | 185 ++++++++++ .../modules/test_buffer_pin/requirements.txt | 9 + .../test_buffer_pin/test_buffer_pin--1.0.sql | 66 ++++ .../modules/test_buffer_pin/test_buffer_pin.c | 323 ++++++++++++++++++ .../test_buffer_pin/test_buffer_pin.control | 4 + 6 files changed, 605 insertions(+) create mode 100644 src/test/modules/test_buffer_pin/Makefile create mode 100755 src/test/modules/test_buffer_pin/benchmark.py create mode 100644 src/test/modules/test_buffer_pin/requirements.txt create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin.c create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin.control diff --git a/src/test/modules/test_buffer_pin/Makefile b/src/test/modules/test_buffer_pin/Makefile new file mode 100644 index 0000000000..2707f84c5b --- /dev/null +++ b/src/test/modules/test_buffer_pin/Makefile @@ -0,0 +1,18 @@ +MODULE_big = test_buffer_pin +OBJS = test_buffer_pin.o + +EXTENSION = test_buffer_pin +DATA = test_buffer_pin--1.0.sql + +PGFILEDESC = "test_buffer_pin - buffer pinning benchmarks" + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = src/test/modules/test_buffer_pin +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/src/test/modules/test_buffer_pin/benchmark.py b/src/test/modules/test_buffer_pin/benchmark.py new file mode 100755 index 0000000000..004044bc5c --- /dev/null +++ b/src/test/modules/test_buffer_pin/benchmark.py @@ -0,0 +1,185 @@ +#!/usr/bin/env python3 +""" +Buffer pinning benchmark - runs SQL benchmarks and saves results. + +Usage: + python3 benchmark.py [options] + +Examples: + python3 benchmark.py --port 5434 --name pg18-baseline --title "PostgreSQL 18 Baseline" + python3 benchmark.py --port 5434 --name patch-0005 --title "With Patch 0005" +""" + +import argparse +import numpy as np +import pandas as pd +from sqlalchemy import create_engine, text +import matplotlib +matplotlib.use('Agg') +import matplotlib.pyplot as plt +from pathlib import Path + + +def geomspace_int(start, stop, num): + """Generate geometrically spaced integers, pushing duplicates forward.""" + raw = np.geomspace(start, stop, num) + result = [] + for v in raw: + candidate = max(int(round(v)), result[-1] + 1 if result else 1) + result.append(candidate) + return result + + +def ensure_test_table(conn, table_name, min_blocks): + """Create a table with enough blocks for the benchmark.""" + rows_needed = min_blocks * 226 + 10000 + print(f"Creating table {table_name} with ~{min_blocks} blocks...") + conn.execute(text(f"DROP TABLE IF EXISTS {table_name}")) + conn.execute(text(f"CREATE UNLOGGED TABLE {table_name} AS SELECT generate_series(1, {rows_needed}) AS id")) + result = conn.execute(text(f"SELECT pg_relation_size('{table_name}') / 8192 as blocks")).fetchone() + assert result[0] >= min_blocks, f"Table {table_name} has {result[0]} blocks, expected at least {min_blocks}" + +def run_benchmark(conn, func_template, buffer_counts): + """Run a benchmark function across all buffer counts and patterns in one query.""" + array_str = str(buffer_counts) + sql = text(f""" + SELECT + CASE WHEN r.random THEN 'random' ELSE 'sequential' END as pattern, + b.num_buffers, + percentile_cont(0.5) WITHIN GROUP (ORDER BY bench.per_op_ns) as median_ns + FROM + unnest(ARRAY[false, true]) AS r(random), + unnest(ARRAY[{array_str}]) AS b(num_buffers), + LATERAL {func_template} AS bench + WHERE bench.iteration > 0 + GROUP BY r.random, b.num_buffers + ORDER BY r.random, b.num_buffers + """) + return [dict(row._mapping) for row in conn.execute(sql)] + + +def save_results(results_dir, name, results_dict, title): + """Save benchmark results to CSV and generate SVG plot.""" + results_dir = Path(results_dir) + results_dir.mkdir(parents=True, exist_ok=True) + + all_data = [] + for bench_name, data in results_dict.items(): + df = pd.DataFrame(data) + df['benchmark'] = bench_name + all_data.append(df) + + combined_df = pd.concat(all_data, ignore_index=True) + + csv_path = results_dir / f"{name}.csv" + combined_df.to_csv(csv_path, index=False) + print(f"Saved CSV: {csv_path}") + return combined_df + + +def plot_results(results_dir, name, results_dict, title): + """Generate SVG plot from benchmark results.""" + results_dir = Path(results_dir) + + fig, ax = plt.subplots(figsize=(10, 6)) + + benchmarks = [ + ('read', 'blue', 'o', 'ReadBuffer/ReleaseBuffer'), + ('pinning', 'green', 's', 'IncrBufferRefCount/ReleaseBuffer'), + ('locking', 'purple', 'd', 'LockBuffer/UnlockBuffer'), + ('resowner', 'red', '^', 'ResourceOwner Remember/Forget'), + ] + + for bench_name, color, marker, label in benchmarks: + if bench_name not in results_dict: + continue + df = pd.DataFrame(results_dict[bench_name]) + + seq_data = df[df['pattern'] == 'sequential'] + ax.plot(seq_data['num_buffers'], seq_data['median_ns'], + color=color, linestyle='-', marker=marker, + linewidth=2, markersize=6, label=f'{label} (seq)') + + rand_data = df[df['pattern'] == 'random'] + ax.plot(rand_data['num_buffers'], rand_data['median_ns'], + color=color, linestyle='--', marker=marker, + linewidth=2, markersize=6, label=f'{label} (rand)') + + ax.set_xlabel('Number of Buffers (sliding window)') + ax.set_ylabel('Time per operation (ns)') + ax.set_title(title) + ax.set_ylim(0, None) + ax.grid(True, alpha=0.3) + ax.set_xscale('log', base=2) + ax.legend(fontsize=8, loc='upper left') + + plt.tight_layout() + svg_path = results_dir / f"{name}.svg" + plt.savefig(svg_path, format='svg') + plt.close() + print(f"Saved SVG: {svg_path}") + + +def main(): + parser = argparse.ArgumentParser(description='Buffer pinning benchmark') + parser.add_argument('--iterations', '-n', type=int, default=10000, + help='Number of operations per sample (default: 10000)') + parser.add_argument('--samples', type=int, default=50, + help='Number of samples per data point (default: 50)') + parser.add_argument('--port', type=int, default=5432, + help='PostgreSQL port (default: 5432)') + parser.add_argument('--points', type=int, default=50, + help='Number of data points (default: 200)') + parser.add_argument('--max-dist', type=int, default=500, + help='Maximum buffer count to test (default: 512)') + parser.add_argument('--name', default='benchmark', + help='Name for output files (default: benchmark)') + parser.add_argument('--title', default='PostgreSQL Buffer Pinning Performance', + help='Title for the plot') + parser.add_argument('--results-dir', default=None, + help='Directory for results (default: ./results)') + parser.add_argument('--table', default='bench_large', + help='Table name to use for benchmarks (default: bench_large)') + args = parser.parse_args() + + if args.results_dir is None: + args.results_dir = Path(__file__).parent / 'results' + + engine = create_engine(f'postgresql://localhost:{args.port}/postgres') + buffer_counts = geomspace_int(1, args.max_dist, args.points) + + min_blocks_needed = args.max_dist + 100 + + with engine.connect().execution_options(isolation_level="AUTOCOMMIT") as conn: + conn.execute(text("CREATE EXTENSION IF NOT EXISTS test_buffer_pin")) + ensure_test_table(conn, args.table, min_blocks_needed) + + + n, s = args.iterations, args.samples + print(f"\nRunning benchmarks: {n} iterations, {s} samples, max {args.max_dist} buffers") + + results = {} + + print(" Running: bench_pinning...") + results['pinning'] = run_benchmark(conn, + f"bench_pinning('{args.table}', b.num_buffers, {n}, {s}, r.random)", + buffer_counts) + + print(" Running: bench_prefetch_pipeline...") + results['read'] = run_benchmark(conn, + f"bench_prefetch_pipeline('{args.table}', b.num_buffers, {n}, {s}, r.random)", + buffer_counts) + + print(" Running: bench_resowner...") + results['resowner'] = run_benchmark(conn, + f"bench_resowner(b.num_buffers, {n}, {s}, r.random)", + buffer_counts) + + save_results(args.results_dir, args.name, results, args.title) + plot_results(args.results_dir, args.name, results, args.title) + + print(f"\nDone! Results saved to {args.results_dir}/{args.name}.*") + + +if __name__ == '__main__': + main() diff --git a/src/test/modules/test_buffer_pin/requirements.txt b/src/test/modules/test_buffer_pin/requirements.txt new file mode 100644 index 0000000000..e73ab56927 --- /dev/null +++ b/src/test/modules/test_buffer_pin/requirements.txt @@ -0,0 +1,9 @@ +# Requirements for benchmark.py +# pip install -r requirements.txt +# not enforcing versions, as it might simply +# work with your installed versions +numpy # tested with 1.24+ +pandas # tested with 2.0+ +sqlalchemy # tested with 2.0+ +matplotlib # tested with 3.7+ +psycopg2-binary # tested with 2.9+ diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql b/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql new file mode 100644 index 0000000000..4c43c41496 --- /dev/null +++ b/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql @@ -0,0 +1,66 @@ +/* test_buffer_pin--1.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION test_buffer_pin" to load this file. \quit + +CREATE FUNCTION bench_prefetch_pipeline( + relname text, + prefetch_dist int, + num_ops int, + iterations int, + random_access bool +) +RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8) +AS 'MODULE_PATHNAME', 'bench_prefetch_pipeline' +LANGUAGE C STRICT; + +COMMENT ON FUNCTION bench_prefetch_pipeline IS +'Benchmark buffer pinning with sliding window prefetch simulation. +Arguments: + relname - name of the relation to use + prefetch_dist - number of buffers to keep pinned (sliding window size) + num_ops - number of pin/unpin operations to perform + iterations - number of times to repeat the benchmark + random_access - if true, access blocks randomly; if false, sequentially +'; + +CREATE FUNCTION bench_pinning( + relname text, + num_buffers int, + num_ops int, + iterations int, + random_access bool +) +RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8) +AS 'MODULE_PATHNAME', 'bench_pinning' +LANGUAGE C STRICT; + +COMMENT ON FUNCTION bench_pinning IS +'Benchmark pure pin/unpin operations without buffer lookup. +Uses IncrBufferRefCount/ReleaseBuffer on pre-pinned buffers to isolate +refcount tracking overhead from buffer table lookups and I/O. +Arguments: + relname - name of the relation to use for pinning buffers + num_buffers - number of buffers to keep as base pins + num_ops - number of pin/unpin pairs to perform + iterations - number of times to repeat the benchmark + random_access - if true, access buffers randomly; if false, sequentially'; + +CREATE FUNCTION bench_resowner( + num_buffers int, + num_ops int, + iterations int, + random_access bool +) +RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8) +AS 'MODULE_PATHNAME', 'bench_resowner' +LANGUAGE C STRICT; + +COMMENT ON FUNCTION bench_resowner IS +'Benchmark ResourceOwner remember/forget operations only. +Uses fake resource values - no actual resources are tracked. +Arguments: + num_buffers - number of fake resources to track + num_ops - number of remember/forget pairs to perform + iterations - number of times to repeat the benchmark + random_access - if true, access resources randomly; if false, sequentially'; diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin.c b/src/test/modules/test_buffer_pin/test_buffer_pin.c new file mode 100644 index 0000000000..7b6c91678f --- /dev/null +++ b/src/test/modules/test_buffer_pin/test_buffer_pin.c @@ -0,0 +1,323 @@ +/* + * test_buffer_pin.c - Buffer pinning benchmark + */ + +#include "postgres.h" + +#include "access/heapam.h" +#include "catalog/namespace.h" +#include "common/pg_prng.h" +#include "fmgr.h" +#include "funcapi.h" +#include "miscadmin.h" +#include "portability/instr_time.h" +#include "storage/buf_internals.h" +#include "storage/bufmgr.h" +#include "utils/builtins.h" +#include "utils/rel.h" +#include "utils/resowner.h" +#include "utils/varlena.h" + +PG_MODULE_MAGIC; + +PG_FUNCTION_INFO_V1(bench_prefetch_pipeline); +PG_FUNCTION_INFO_V1(bench_pinning); +PG_FUNCTION_INFO_V1(bench_resowner); + +/* Custom ResourceOwnerDesc for benchmark - does nothing on release */ +static void bench_release_resource(Datum res) { /* no-op */ } + +static const ResourceOwnerDesc bench_resowner_desc = { + .name = "BenchmarkResource", + .release_phase = RESOURCE_RELEASE_AFTER_LOCKS, + .release_priority = RELEASE_PRIO_FIRST, + .ReleaseResource = bench_release_resource, + .DebugPrint = NULL, +}; + +/* + * Generate an access sequence for benchmarking. + * If random_access is true, uses Fisher-Yates shuffle. + * Otherwise, generates sequential pattern modulo num_items. + */ +static void +generate_access_sequence(int *sequence, int num_operations, int num_items, bool random_access) +{ + for (int i = 0; i < num_operations; i++) + sequence[i] = i % num_items; + + if (random_access) + { + for (int i = num_operations - 1; i > 0; i--) + { + int j = pg_prng_uint64_range(&pg_global_prng_state, 0, i); + int tmp = sequence[i]; + sequence[i] = sequence[j]; + sequence[j] = tmp; + } + } +} + + +/* + * bench_pinning - benchmark pure pin/unpin operations + * + * Warms up the cache by reading all blocks in the relation. + * Precompute a block sequence in which the buffers will be read. + * Then scan times repeatedly a scan following the block sequence. + * with a fixed distance of `num_buffers` (plus ramp up and ramp down) +*/ +Datum +bench_prefetch_pipeline(PG_FUNCTION_ARGS) +{ + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + text *relname_text = PG_GETARG_TEXT_PP(0); + int num_buffers = PG_GETARG_INT32(1); + int num_operations = PG_GETARG_INT32(2); + int iterations = PG_GETARG_INT32(3); + bool random_access = PG_GETARG_BOOL(4); + + Oid relid; + Relation rel; + BlockNumber nblocks; + Buffer *pipeline; + int *block_sequence; + int iter; + + Tuplestorestate *tupstore; + TupleDesc tupdesc; + Datum values[3]; + bool nulls[3] = {false, false, false}; + + if (num_buffers < 1 || num_operations < num_buffers) + ereport(ERROR, (errmsg("invalid parameters"))); + + InitMaterializedSRF(fcinfo, 0); + tupstore = rsinfo->setResult; + tupdesc = rsinfo->setDesc; + + relid = RangeVarGetRelid(makeRangeVarFromNameList(textToQualifiedNameList(relname_text)), AccessShareLock, false); + rel = relation_open(relid, AccessShareLock); + nblocks = RelationGetNumberOfBlocks(rel); + + if (nblocks == 0) + ereport(ERROR, (errmsg("relation has no blocks"))); + + pipeline = palloc0(num_buffers * sizeof(Buffer)); + block_sequence = palloc(num_operations * sizeof(int)); + + /* Warm up */ + for (int i = 0; i < num_operations && i < (int) nblocks; i++) + { + Buffer buf = ReadBuffer(rel, i); + ReleaseBuffer(buf); + } + generate_access_sequence(block_sequence, num_operations, nblocks, random_access); + + for (iter = 0; iter < iterations; iter++) + { + instr_time start_time, end_time; + int64 elapsed_ns; + + INSTR_TIME_SET_CURRENT(start_time); + + for (int i = 0; i < num_operations + num_buffers; i++) + { + if (i >= num_buffers) + ReleaseBuffer(pipeline[(i - num_buffers) % num_buffers]); + if (i < num_operations) + pipeline[i % num_buffers] = ReadBuffer(rel, block_sequence[i]); + } + + INSTR_TIME_SET_CURRENT(end_time); + INSTR_TIME_SUBTRACT(end_time, start_time); + elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time); + + values[0] = Int32GetDatum(iter); + values[1] = Int64GetDatum(elapsed_ns); + values[2] = Float8GetDatum((double) elapsed_ns / num_operations); + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + pfree(pipeline); + pfree(block_sequence); + relation_close(rel, AccessShareLock); + + return (Datum) 0; +} + +/* + * bench_pinning - benchmark pure pin/unpin operations + * + * Pre-reads buffers into cache, then times IncrBufferRefCount/ReleaseBuffer + * cycles in a pipelined fashion. This is intended to separate the + * pin count dependent part of the ReadBuffer/ReleaseBuffer operations + */ +Datum +bench_pinning(PG_FUNCTION_ARGS) +{ + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + text *relname_text = PG_GETARG_TEXT_PP(0); + int num_buffers = PG_GETARG_INT32(1); + int num_operations = PG_GETARG_INT32(2); + int iterations = PG_GETARG_INT32(3); + bool random_access = PG_GETARG_BOOL(4); + + Oid relid; + Relation rel; + BlockNumber nblocks; + Buffer *base_buffers; + Buffer *pipeline; + int *access_sequence; + int iter; + + Tuplestorestate *tupstore; + TupleDesc tupdesc; + Datum values[3]; + bool nulls[3] = {false, false, false}; + + if (num_buffers < 1 || num_operations < num_buffers) + ereport(ERROR, (errmsg("invalid parameters"))); + + InitMaterializedSRF(fcinfo, 0); + tupstore = rsinfo->setResult; + tupdesc = rsinfo->setDesc; + + relid = RangeVarGetRelid(makeRangeVarFromNameList(textToQualifiedNameList(relname_text)), AccessShareLock, false); + rel = relation_open(relid, AccessShareLock); + nblocks = RelationGetNumberOfBlocks(rel); + + if ((BlockNumber) num_buffers > nblocks) + ereport(ERROR, (errmsg("not enough blocks in relation"))); + + base_buffers = palloc(num_buffers * sizeof(Buffer)); + pipeline = palloc(num_buffers * sizeof(Buffer)); + access_sequence = palloc(num_operations * sizeof(int)); + + generate_access_sequence(access_sequence, num_operations, num_buffers, random_access); + + /* Pin the buffers as base pins (keeps them in cache) */ + for (int i = 0; i < num_buffers; i++) + base_buffers[i] = ReadBuffer(rel, i); + + for (iter = 0; iter < iterations; iter++) + { + instr_time start_time, end_time; + int64 elapsed_ns; + + INSTR_TIME_SET_CURRENT(start_time); + + for (int i = 0; i < num_operations + num_buffers; i++) + { + if (i >= num_buffers) + ReleaseBuffer(pipeline[(i - num_buffers) % num_buffers]); + if (i < num_operations) + { + Buffer buf = base_buffers[access_sequence[i]]; + IncrBufferRefCount(buf); + pipeline[i % num_buffers] = buf; + } + } + + INSTR_TIME_SET_CURRENT(end_time); + INSTR_TIME_SUBTRACT(end_time, start_time); + elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time); + + values[0] = Int32GetDatum(iter); + values[1] = Int64GetDatum(elapsed_ns); + values[2] = Float8GetDatum((double) elapsed_ns / num_operations); + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + /* Release base pins */ + for (int i = 0; i < num_buffers; i++) + ReleaseBuffer(base_buffers[i]); + + pfree(access_sequence); + pfree(pipeline); + pfree(base_buffers); + relation_close(rel, AccessShareLock); + + return (Datum) 0; +} + +/* + * bench_resowner - benchmark ResourceOwner remember/forget operations + * + * Uses fake resource values to test pure ResourceOwner tracking overhead + * without any actual resource operations. Uses same pipelined structure. + */ +Datum +bench_resowner(PG_FUNCTION_ARGS) +{ + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + int num_buffers = PG_GETARG_INT32(0); + int num_operations = PG_GETARG_INT32(1); + int iterations = PG_GETARG_INT32(2); + bool random_access = PG_GETARG_BOOL(3); + + Datum *resources; + Datum *pipeline; + int *access_sequence; + int iter; + + Tuplestorestate *tupstore; + TupleDesc tupdesc; + Datum values[3]; + bool nulls[3] = {false, false, false}; + + if (num_buffers < 1 || num_operations < num_buffers) + ereport(ERROR, (errmsg("invalid parameters"))); + + InitMaterializedSRF(fcinfo, 0); + tupstore = rsinfo->setResult; + tupdesc = rsinfo->setDesc; + + resources = palloc(num_buffers * sizeof(Datum)); + pipeline = palloc(num_buffers * sizeof(Datum)); + access_sequence = palloc(num_operations * sizeof(int)); + + /* Use fake resource values (pointers to array elements for uniqueness) */ + for (int i = 0; i < num_buffers; i++) + resources[i] = PointerGetDatum(&resources[i]); + + generate_access_sequence(access_sequence, num_operations, num_buffers, random_access); + + for (iter = 0; iter < iterations; iter++) + { + instr_time start_time, end_time; + int64 elapsed_ns; + + INSTR_TIME_SET_CURRENT(start_time); + + for (int i = 0; i < num_operations + num_buffers; i++) + { + if (i >= num_buffers) + ResourceOwnerForget(CurrentResourceOwner, + pipeline[(i - num_buffers) % num_buffers], + &bench_resowner_desc); + if (i < num_operations) + { + Datum res = resources[access_sequence[i]]; + ResourceOwnerEnlarge(CurrentResourceOwner); + ResourceOwnerRemember(CurrentResourceOwner, res, &bench_resowner_desc); + pipeline[i % num_buffers] = res; + } + } + + INSTR_TIME_SET_CURRENT(end_time); + INSTR_TIME_SUBTRACT(end_time, start_time); + elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time); + + values[0] = Int32GetDatum(iter); + values[1] = Int64GetDatum(elapsed_ns); + values[2] = Float8GetDatum((double) elapsed_ns / num_operations); + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + pfree(access_sequence); + pfree(pipeline); + pfree(resources); + + return (Datum) 0; +} diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin.control b/src/test/modules/test_buffer_pin/test_buffer_pin.control new file mode 100644 index 0000000000..f065941712 --- /dev/null +++ b/src/test/modules/test_buffer_pin/test_buffer_pin.control @@ -0,0 +1,4 @@ +comment = 'test_buffer_pin - buffer pinning benchmarks' +default_version = '1.0' +module_pathname = '$libdir/test_buffer_pin' +relocatable = true -- 2.34.1 [application/x-patch] v2-0005-Array-direct-mapping.patch (15.3K, 8-v2-0005-Array-direct-mapping.patch) download | inline diff: From 33138ffb0d834bc452e9d2d1819f21871c64210f Mon Sep 17 00:00:00 2001 From: bob <[email protected]> Date: Wed, 11 Mar 2026 12:08:04 +0000 Subject: [PATCH 5/5] Array direct mapping Instead of searching all array entries, each buffer is mapped to exactly one slot at index (buffer % REFCOUNT_ARRAY_ENTRIES). Collisions are resolved by evicting to the hash table. This simplifies lookup to O(1) for the array check and should improve performance for num_buffers > 8 while maintaining performance for small numbers of buffers. --- src/backend/storage/buffer/bufmgr_refcnt.c | 307 +++++++++------------ 1 file changed, 132 insertions(+), 175 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr_refcnt.c b/src/backend/storage/buffer/bufmgr_refcnt.c index bf84e9f62e..726697ebf7 100644 --- a/src/backend/storage/buffer/bufmgr_refcnt.c +++ b/src/backend/storage/buffer/bufmgr_refcnt.c @@ -12,26 +12,15 @@ * than once by an individual backend. This mechanism is also used to track * whether this backend has a buffer locked, and, if so, in what mode. * - * To avoid - as we used to - requiring an array with NBuffers entries to keep - * track of local buffers, we use a small sequentially searched array - * (PrivateRefCountArrayKeys, with the corresponding data stored in - * PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to - * keep track of backend local pins. + * To avoid requiring an array with NBuffers entries, we use a dynamically + * sized direct-mapped array (PrivateRefCountArray) and an overflow hash table + * (PrivateRefCountHash). Each buffer maps to exactly one array slot at + * index (buffer & mask). When a collision occurs (two different buffers + * map to the same slot), the existing entry is evicted to the hash table. + * The array grows automatically when occupation exceeds a specified threshold. * - * Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all - * refcounts are kept track of in the array; after that, new array entries - * displace old ones into the hash table. That way a frequently used entry - * can't get "stuck" in the hashtable while infrequent ones clog the array. - * - * This was initially designed trying to optimize for the case where the - * number of pinned buffers is expected to not exceed REFCOUNT_ARRAY_ENTRIES. - * However this might not be the case with the introduction of prefetching. - * - * To enter a buffer into the refcount tracking mechanism first reserve a free - * entry using ReservePrivateRefCountEntry() and then later, if necessary, - * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing - * memory allocations in NewPrivateRefCountEntry() which can be important - * because in some scenarios it's called with a spinlock held... + * This design provides O(1) lookup for the common case where there are no + * collisions, while gracefully handling overflow via the hash table. * * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California @@ -86,33 +77,40 @@ struct PrivateRefCountIterator #define SH_DEFINE #include "lib/simplehash.h" -/* Private refcount array and keys */ -#define REFCOUNT_ARRAY_ENTRIES 8 -static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES]; -static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES]; +/* + * Private refcount array - direct-mapped by (buffer & mask). + * Each buffer maps to exactly one slot. Collisions evict to hash table. + * + * Consider an array with N slots, after storing n random buffers + * + * - The probability of a single buffer hitting a given slot is 1/N + * - The probability of a single buffer not hitting a given slot is 1-1/N + * - The probability of n buffers not hitting a given slot is (1-1/N)^n + * - the expected number of vacant slots is N * (1-1/N)^n ~ N * e^(-n/N) + * + * The last approximation should be not too bad for large number, + * and that gives e.g. 63% occupation for n = N, 86% ocupation for n = 2*N + */ +#define REFCOUNT_ARRAY_INITIAL_SIZE 8 +#define REFCOUNT_ARRAY_MAX_OCCUPATION 0.86 +#define REFCOUNT_ARRAY_MAX_SIZE (1 << 20) /* 1M entries max */ + +static struct PrivateRefCountEntry *PrivateRefCountArray = NULL; +static int32 PrivateRefCountArrayMask = 0; /* (size - 1) for fast modulo */ +static int32 PrivateRefCountArrayUsed = 0; /* entries in the array */ +static int32 PrivateRefCountArrayTolerated = 0; /* grow when Used exceeds this */ -/* Overflow hash table for when array is full */ +/* Overflow hash table for collisions */ static refcount_hash *PrivateRefCountHash = NULL; -/* Count of entries that have overflowed into the hash table */ +/* Count of entries in the hash table */ static int32 PrivateRefCountOverflowed = 0; -/* Clock hand for selecting victim when array is full */ -static uint32 PrivateRefCountClock = 0; - -/* Reserved slot index, or -1 if none reserved */ -static int ReservedRefCountSlot = -1; - -/* Cache for last accessed entry */ -static int PrivateRefCountEntryLast = -1; - /* Advisory limit on the number of pins each backend should hold */ static uint32 MaxProportionalPins = 0; -/* Forward declarations */ -static void ReservePrivateRefCountEntry(Buffer buffer); -static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer); -static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer); +/* Forward declaration */ +static void GrowPrivateRefCountArray(void); /* * Initialize private refcount tracking for this backend. @@ -126,174 +124,135 @@ InitPrivateRefCount(void) * That's very pessimistic, but outside toy-sized shared_buffers it should * allow plenty of pins. LimitAdditionalPins() and * GetAdditionalPinLimit() can be used to check the remaining balance. - */ + */ MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS); - memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray)); - memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys)); - PrivateRefCountHash = refcount_create(CurrentMemoryContext, 64, NULL); + /* Initialize the direct-mapped array */ + PrivateRefCountArrayMask = REFCOUNT_ARRAY_INITIAL_SIZE - 1; + PrivateRefCountArrayUsed = 0; + PrivateRefCountArrayTolerated = (int32)(REFCOUNT_ARRAY_INITIAL_SIZE * REFCOUNT_ARRAY_MAX_OCCUPATION); + PrivateRefCountArray = MemoryContextAllocZero(TopMemoryContext, + REFCOUNT_ARRAY_INITIAL_SIZE * sizeof(PrivateRefCountEntry)); + + PrivateRefCountHash = refcount_create(TopMemoryContext, 64, NULL); } /* - * Ensure that the PrivateRefCountArray has sufficient space to store one more - * entry. + * Grow the private refcount array when usage exceeds tolerated occupation. + * Doubles the size and rehashes all existing entries. */ static void -ReservePrivateRefCountEntry(Buffer buffer) +GrowPrivateRefCountArray(void) { - /* Already reserved (or freed), nothing to do */ - if (ReservedRefCountSlot != -1) + PrivateRefCountEntry *old_array = PrivateRefCountArray; + int32 old_size = PrivateRefCountArrayMask + 1; + int32 new_size = old_size * 2; + int32 new_mask = new_size - 1; + int32 i; + + /* Don't grow beyond maximum */ + if (new_size > REFCOUNT_ARRAY_MAX_SIZE) return; - /* - * First search for a free entry in the array, that'll be sufficient in - * the majority of cases. - */ - { - int i; - - for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) - { - if (PrivateRefCountArrayKeys[i] == InvalidBuffer) - { - ReservedRefCountSlot = i; - return; - } - } - } - - /* - * No luck. All array entries are full. Move one array entry into the hash - * table. - */ - { - int victim_slot; - PrivateRefCountEntry *victim_entry; - PrivateRefCountEntry *hashent; - bool found; - - /* select victim slot */ - victim_slot = PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES; - victim_entry = &PrivateRefCountArray[victim_slot]; - ReservedRefCountSlot = victim_slot; - - /* Better be used, otherwise we shouldn't get here. */ - Assert(PrivateRefCountArrayKeys[victim_slot] != InvalidBuffer); - Assert(PrivateRefCountArray[victim_slot].buffer != InvalidBuffer); - Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer); - - /* enter victim array entry into hashtable */ - hashent = refcount_insert(PrivateRefCountHash, - PrivateRefCountArrayKeys[victim_slot], - &found); - Assert(!found); - hashent->data = victim_entry->data; - - /* clear the now free array slot */ - PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer; - victim_entry->buffer = InvalidBuffer; - victim_entry->data = 0; - - PrivateRefCountOverflowed++; - } -} - -/* - * Create a new refcount entry for the given buffer. - */ -static PrivateRefCountEntry * -NewPrivateRefCountEntry(Buffer buffer) -{ - PrivateRefCountEntry *res; - - /* only allowed to be called when a reservation has been made */ - Assert(ReservedRefCountSlot != -1); - - /* use up the reserved entry */ - res = &PrivateRefCountArray[ReservedRefCountSlot]; - - /* and fill it */ - PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer; - res->buffer = buffer; - res->data = 0; - - /* update cache for the next lookup */ - PrivateRefCountEntryLast = ReservedRefCountSlot; - - ReservedRefCountSlot = -1; - - return res; -} - -/* - * Slow-path for GetSharedBufferEntry(). - */ -static pg_noinline PrivateRefCountEntry * -GetPrivateRefCountEntrySlow(Buffer buffer) -{ - PrivateRefCountEntry *res; - int i; + /* Allocate new array and update metadata */ + PrivateRefCountArray = MemoryContextAllocZero(TopMemoryContext, + new_size * sizeof(PrivateRefCountEntry)); + PrivateRefCountArrayMask = new_mask; + PrivateRefCountArrayTolerated = (int32)(new_size * REFCOUNT_ARRAY_MAX_OCCUPATION); /* - * First search for references in the array. + * Rehash entries from old array. When doubling, each old slot maps to + * two possible new slots (slot or slot + old_size), so entries from + * different old slots cannot collide. PrivateRefCountArrayUsed stays same. */ - for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) + for (i = 0; i < old_size; i++) { - if (PrivateRefCountArrayKeys[i] == buffer) + if (old_array[i].buffer != InvalidBuffer) { - PrivateRefCountEntryLast = i; - return &PrivateRefCountArray[i]; + int new_slot = old_array[i].buffer & new_mask; + + PrivateRefCountArray[new_slot] = old_array[i]; } } - /* - * Only look up the buffer in the hashtable if we've previously overflowed. - */ - if (PrivateRefCountOverflowed == 0) - return NULL; - - res = refcount_lookup(PrivateRefCountHash, buffer); - return res; + pfree(old_array); } /* * Return the PrivateRefCount entry for the passed buffer. * Returns NULL if the buffer is not currently pinned. + * + * With direct-mapped array, lookup is O(1): check the slot at + * (buffer & mask), then check hash table if needed. */ static PrivateRefCountEntry * GetSharedBufferEntry(Buffer buffer) { + int slot = buffer & PrivateRefCountArrayMask; + PrivateRefCountEntry *entry = &PrivateRefCountArray[slot]; + Assert(BufferIsValid(buffer)); Assert(!BufferIsLocal(buffer)); - /* Fast path: check one-entry cache */ - if (likely(PrivateRefCountEntryLast != -1) && - likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer)) - { - return &PrivateRefCountArray[PrivateRefCountEntryLast]; - } + /* Check the direct-mapped slot */ + if (entry->buffer == buffer) + return entry; - return GetPrivateRefCountEntrySlow(buffer); + /* Check hash table if there are overflowed entries */ + if (PrivateRefCountOverflowed > 0) + return refcount_lookup(PrivateRefCountHash, buffer); + + return NULL; } /* * Create a new refcount entry for a buffer that is known to not be pinned. - * This is a fast path that skips the cache/hash lookup. - * Returns the new entry pointer with refcount already incremented. + * Returns the new entry pointer with refcount set to 1. + * + * If the direct-mapped slot contains a different buffer, it is evicted + * to the hash table first. The array grows when overflow exceeds 20% of size. */ static PrivateRefCountEntry * SharedBufferCreateRef(Buffer buffer) { - PrivateRefCountEntry *ref; + int slot; + PrivateRefCountEntry *entry; + bool was_empty; Assert(BufferIsValid(buffer)); Assert(!BufferIsLocal(buffer)); - ReservePrivateRefCountEntry(buffer); - ref = NewPrivateRefCountEntry(buffer); - ref->data = ONE_PRIVATE_REFERENCE; + /* Grow when array usage exceeds tolerated occupation */ + if (PrivateRefCountArrayUsed > PrivateRefCountArrayTolerated) + GrowPrivateRefCountArray(); + + slot = buffer & PrivateRefCountArrayMask; + entry = &PrivateRefCountArray[slot]; + was_empty = (entry->buffer == InvalidBuffer); + + /* If slot contains a different buffer, evict it to hash table */ + if (!was_empty && entry->buffer != buffer) + { + PrivateRefCountEntry *hashent; + bool found; + + hashent = refcount_insert(PrivateRefCountHash, entry->buffer, &found); + Assert(!found); + hashent->data = entry->data; + + PrivateRefCountOverflowed++; + was_empty = true; + } + + /* Track array usage */ + if (was_empty) + PrivateRefCountArrayUsed++; + + /* Use the slot for our buffer */ + entry->buffer = buffer; + entry->data = ONE_PRIVATE_REFERENCE; - return ref; + return entry; } /* @@ -327,15 +286,15 @@ SharedBufferUnref(PrivateRefCountEntry *ref) Assert(SharedBufferGetLockMode(ref) == BUFFER_LOCK_UNLOCK); if (ref >= &PrivateRefCountArray[0] && - ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES]) + ref < &PrivateRefCountArray[PrivateRefCountArrayMask + 1]) { + /* Array entry - mark as empty */ ref->buffer = InvalidBuffer; - PrivateRefCountArrayKeys[ref - PrivateRefCountArray] = InvalidBuffer; - ReservedRefCountSlot = ref - PrivateRefCountArray; + PrivateRefCountArrayUsed--; } else { - /* could make slightly more efficient by using the pointer */ + /* Hash table entry */ refcount_delete(PrivateRefCountHash, ref->buffer); Assert(PrivateRefCountOverflowed > 0); PrivateRefCountOverflowed--; @@ -392,9 +351,9 @@ CheckPrivateRefCountLeaks(void) char *s; /* check the array */ - for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++) + for (i = 0; i <= PrivateRefCountArrayMask; i++) { - if (PrivateRefCountArrayKeys[i] != InvalidBuffer) + if (PrivateRefCountArray[i].buffer != InvalidBuffer) { res = &PrivateRefCountArray[i]; @@ -475,11 +434,11 @@ PrivateRefCountEntry * GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter) { /* First iterate through the array */ - while (!iter->in_hash && iter->array_index < REFCOUNT_ARRAY_ENTRIES) + while (!iter->in_hash && iter->array_index <= PrivateRefCountArrayMask) { int idx = iter->array_index++; - if (PrivateRefCountArrayKeys[idx] != InvalidBuffer) + if (PrivateRefCountArray[idx].buffer != InvalidBuffer) return &PrivateRefCountArray[idx]; } @@ -547,11 +506,9 @@ FreePrivateRefCountIterator(PrivateRefCountIterator *iter) uint32 estimated_pins_held; /* - * We get the number of "overflowed" pins for free, but don't know the - * number of pins in PrivateRefCountArray. The cost of calculating that - * exactly doesn't seem worth it, so just assume the max. + * We track array usage, so we can get an accurate count. */ - estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES; + estimated_pins_held = PrivateRefCountOverflowed + PrivateRefCountArrayUsed; /* Is this backend already holding more than its fair share? */ if (estimated_pins_held > MaxProportionalPins) -- 2.34.1 ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Addressing buffer private reference count scalability issue @ 2026-03-11 18:55 Andres Freund <[email protected]> parent: Alexandre Felipe <[email protected]> 0 siblings, 1 reply; 6+ messages in thread From: Andres Freund @ 2026-03-11 18:55 UTC (permalink / raw) To: Alexandre Felipe <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]> Hi, On 2026-03-11 18:03:34 +0000, Alexandre Felipe wrote: > Hi Hackers, > > Please find attached version 2 of this patch and the results of the > ReadBuffer/ReleaseBuffer benchmark > > For those who prefer plain text here goes an excerpt > baseline patch > num_buffers pattern > 1 random 194.76 183.50 > sequential 183.75 168.02 > 2 random 204.74 187.02 > sequential 195.27 169.58 > 3 random 222.62 192.57 > sequential 212.81 169.55 > 7 random 233.96 219.64 > sequential 221.94 171.94 > 11 random 365.50 247.96 > sequential 318.24 183.96 > 19 random 390.58 267.96 > sequential 369.51 195.35 > 100 random 424.03 286.33 > sequential 420.74 280.23 > 988 random 445.75 301.37 > sequential 453.61 313.30 > 10000 random 587.13 418.91 > sequential 497.92 394.10 Nice numbers! It'd be good to also evaluate in the context of queries, as such a focused microbenchmark will have much higher L1/L2 cache hit ratios than workloads that actually look at data. > > > Here I assume that the buffer buffer sequence is independent enough from > > the > > > array size, so I use the buffer as the hash key directly, omitting a hash > > > function call. > > > > I doubt that that's good enough. The hash table code relies on the bits > > being > > well mixed, but you won't get that with buffer ids. > > > > Introduced murmurhash32, it is noticeably worse for sequential access > (by buffer index), but has some mixing. It's not surprising it's worse, I just don't see how we could get away without some mixing. It's not at all crazy to have accesss patterns that just differ in the higher bits of the buffer number (e.g. a nestloop join where the inner side always needs a certain number of buffers). We need to be somewhat resistant to that causing a lot of collisions (which will trigger a hashtable growth cycle, without that necessarily fixing the issue!). > > > I'm rather sceptical that the overhead of needing to shift and mask is > > worth > > it. I suspect we'll also want to add more state for each entry (e.g. I > > think > > it may be worth getting rid of the io-in-progress resowner). > > > > io-in-progress resowner name suggests that it runs only for reads... > so maybe it could be taken out of the way of buffer hits? It's not used for hits. But it's used for both reads and writes. > > > > REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array > > > lookup, it is worth trying to remove it completely and keep only the > > hash. > > > For small values we would be trading a few branches for a buffer % > > SIZE, > > > for the use case of prefetch where pin/unpin in a FIFO fashion, it > > will > > > save an 8-entry array lookup, and some extra data moves. > > > > I doubt that that's ok, in the vast majority of access we will have 0-2 > > buffers pinned. And even when we have pinned more buffers, it's > > *exceedingly* > > common to access the same entry repeatedly (e.g. pin, lock, unlock, unpin), > > adding a few cycles to each of those repeated accesses will quickly show > > up. > > > > I brought back the array, but I eliminated the linear search. Why? In my benchmarks allowing vectorization helped a decent amount in real queries, because it does away with all the branch misses. > 1. USE_REFCOUNT_CACHE_ENTRY will enable the last entry cache. > > 2a. the dynamic array case > REFCOUNT_ARRAY_MAX_SIZE!=REFCOUNT_ARRAY_INITIAL_SIZE > will grow the array when it reaches a certain level of occupation. > I have set the default occupation level to 86% so that, if enabled, for a > random input it will grow when we have about 2*size pins in total. > If we find a sequential pattern then it will grow without growing the hash > table. > For the array lookup I don't use a hash, so for small number of pins > it will be very fast. I doubt it makes sense to basically have two levels of hash tables. > 2b. the static case > REFCOUNT_ARRAY_MAX_SIZE==REFCOUNT_ARRAY_INITIAL_SIZE > will use a static array, just as we had before and will not perform the > linear > search. It still has to read the size and do mask input. > > I tested the 4 variations and the winner is with the static array without > the cache for the last entry. > I increased the array size from 8 to 32, since you suggested before that > that this could help. At that point it would have the tradeoff of a longer > linear search, so it may help even more now. Does your benchmark actually test repeated accesses to the same refcount? As mentioned, those are very very common (access sequences like Pin, (Lock, Unlock)+, Unpin are extremely common). > From 5de90fcd14e5d32c3165c3e7a278adaa44f4d9d1 Mon Sep 17 00:00:00 2001 > From: Alexandre Felipe <[email protected]> > Date: Wed, 11 Mar 2026 12:05:50 +0000 > Subject: [PATCH 2/5] Refactoring reference counting > > This moves the private reference count logic out of > backend/storage/buffer/buffmgr.c that is still 8000+ > lines long. Preparing for the changes to come. I don't think the new naming scheme (GetSharedBufferEntry etc) is good, as that does not *at all* communicate that this is backend private state. I'd strongly advise separating moving code from a large scale rename. I certainly won't waste time trying to see the difference between what was just moved and what was changed at the same time. > diff --git a/.gitignore b/.gitignore > index 4e911395fe..fddb7f861d 100644 > --- a/.gitignore > +++ b/.gitignore > @@ -43,3 +43,5 @@ lib*.pc > /Release/ > /tmp_install/ > /portlock/ > + > +.* > \ No newline at end of file What? Certainly not. > From bf146598cd94110da255c6fb45b4a5c58002a91d Mon Sep 17 00:00:00 2001 > From: Alexandre Felipe <[email protected]> > Date: Wed, 11 Mar 2026 12:07:54 +0000 > Subject: [PATCH 3/5] Using simplehash > > This patch replaces the HTAB implementation with a simplehash > as suggested by Andres Freund [1]. The simplehash implements a templated > open addressing hash, which have entry size information at compile time. > > The access strategy of the simplehash is very close to the plain array > that was seen to be very efficient compared to the HTAB implementation, > with the additional advantage of using the key value to initialize the > search closer to where the key actually is, instead of always scanning > the entire array. > > Adds one macro to the simplehash implementation enable overriding the > empty position detection, to enable using buffer == InvalidBuffer > instead of requiring a separate status field. > > [1] https://www.postgresql.org/message-id/s5p7iou7pdhxhvmv4rohmskwqmr36dc4rybvwlep5yvwrjs4pa%406oxsemms5... > > --- > src/backend/storage/buffer/bufmgr_refcnt.c | 89 +++++++++++----------- > src/include/lib/simplehash.h | 59 +++++++++----- > 2 files changed, 84 insertions(+), 64 deletions(-) I don't think it's a good idea to introduce new simplehash infrastructure as part of this larger change. You also haven't documented the new stuff. > From 4b96a723bfc3a6fe78e05bc3ce454b0b160679d1 Mon Sep 17 00:00:00 2001 > From: Alexandre Felipe <[email protected]> > Date: Wed, 11 Mar 2026 12:08:03 +0000 > Subject: [PATCH 4/5] Compact PrivateRefCountEntry > > The previous implementation used an 8-bytes (64-bit) entry to store > a uint32 count and an uint32 lock mode. That is fine when we store > the data separate from the key (buffer). But in the simplehash > {key, value} are stored together, so each entry is 12-bytes. > This is somewhat awkward as we have to either pad the entry to 16-bytes, > or the access will be an alternating aligned/misaligned addreses. > > Lock can assume only 4 values, and 2^30 is a decent limit for the > number of pins on a single buffer. So this change is packing the > {count[31:2], lock[1:0]} into a single uint32. > > Incrementing/decrementing the count continue the same, just using > 4 instead of 1, lock mode access will require one or two additional > bitwise operations. The exact count requires one shift, and is used > only for debugging. A special function is provided to check whether > count == 1. Have you actually evaluated the benefit from this? Pretty sceptical it's worth it. Greetings, Andres Freund ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Addressing buffer private reference count scalability issue @ 2026-03-11 22:41 Alexandre Felipe <[email protected]> parent: Andres Freund <[email protected]> 0 siblings, 1 reply; 6+ messages in thread From: Alexandre Felipe @ 2026-03-11 22:41 UTC (permalink / raw) To: Andres Freund <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]> On Wed, Mar 11, 2026 at 6:55 PM Andres Freund <[email protected]> wrote: > Nice numbers! > > It'd be good to also evaluate in the context of queries, as such a focused > microbenchmark will have much higher L1/L2 cache hit ratios than workloads > that actually look at data > I ran with the query you first used to illustrate the issue, and it was slower. Went back and tried a strict copy of the code and it was 3% slower. passing CFLAGS += -falign-functions=64 this shows no degradation. Maybe LTO is what we need here? It's not surprising it's worse, I just don't see how we could get away > without > some mixing. I will keep that for the future. > It's not at all crazy to have accesss patterns that just differ > in the higher bits of the buffer number (e.g. a nestloop join where the > inner > side always needs a certain number of buffers). We need to be somewhat > resistant to that causing a lot of collisions (which will trigger a > hashtable > growth cycle, without that necessarily fixing the issue!). > Yes, I understand that possibility. > > I brought back the array, but I eliminated the linear search. > > Why? In my benchmarks allowing vectorization helped a decent amount in real > queries, because it does away with all the branch misses. I basically treated the array as a hash table with a weak hash function and delegated collisions to simplehash. In the worse case would do a simple array[buffer & mask], never fails with one branch, and would fail in ~3% of the cases with two branches. And would eliminate the branches necessary to check the 1-entry cache. But notice that this was the last patch, if you like everything except that it is just a matter of picking 02, 03, 04. > 1. USE_REFCOUNT_CACHE_ENTRY will enable the last entry cache. > > > > 2a. the dynamic array case > > REFCOUNT_ARRAY_MAX_SIZE!=REFCOUNT_ARRAY_INITIAL_SIZE > > will grow the array when it reaches a certain level of occupation. > > I have set the default occupation level to 86% so that, if enabled, for a > > random input it will grow when we have about 2*size pins in total. > > If we find a sequential pattern then it will grow without growing the > hash > > table. > > For the array lookup I don't use a hash, so for small number of pins > > it will be very fast. > > I doubt it makes sense to basically have two levels of hash tables. > > > > 2b. the static case > > REFCOUNT_ARRAY_MAX_SIZE==REFCOUNT_ARRAY_INITIAL_SIZE > > will use a static array, just as we had before and will not perform the > > linear > > search. It still has to read the size and do mask input. > > > > I tested the 4 variations and the winner is with the static array without > > the cache for the last entry. > > I increased the array size from 8 to 32, since you suggested before that > > that this could help. At that point it would have the tradeoff of a > longer > > linear search, so it may help even more now. > > Does your benchmark actually test repeated accesses to the same refcount? > As > mentioned, those are very very common (access sequences like Pin, (Lock, > Unlock)+, Unpin are extremely common). Yes that is the case a FIFO with one buffer, I didn't include the lock unlock tests here. I did it a some point and it I don't think the new naming scheme (GetSharedBufferEntry etc) is good, as > that does not *at all* communicate that this is backend private state. > I suspected that, GetEntryForPrivateReferenceCountOfSharedBuffer would be more accurate right? I probably will stick with the original names. > I'd strongly advise separating moving code from a large scale rename. I > certainly won't waste time trying to see the difference between what was > just > moved and what was changed at the same time. > Would you prefer not moving the code at all? One of the main reasons for this was the changes in data structure on the patch 04, that I will not include in the next version. > > diff --git a/.gitignore b/.gitignore > > index 4e911395fe..fddb7f861d 100644 > > --- a/.gitignore > > +++ b/.gitignore > > @@ -43,3 +43,5 @@ lib*.pc > > /Release/ > > /tmp_install/ > > /portlock/ > > + > > +.* > > \ No newline at end of file > > What? Certainly not. > Do you mean, we should certainly not exclude hidden files from git? I usually build with prefix postgresql/.build/patch-*/ Then whenever I checkout something I have to keep adding this again. I don't think it's a good idea to introduce new simplehash infrastructure as > part of this larger change. Do you think it is worth doing that as a separate patch? Then we get it out of the way on this that probably will go a few more versions? You also haven't documented the new stuff. Do you mean as source comments, or is there a separate documentation for this? > > The previous implementation used an 8-bytes (64-bit) entry to store > > a uint32 count and an uint32 lock mode. That is fine when we store > > the data separate from the key (buffer). But in the simplehash > > {key, value} are stored together, so each entry is 12-bytes. > > This is somewhat awkward as we have to either pad the entry to 16-bytes, > > or the access will be an alternating aligned/misaligned addreses. > > > > Lock can assume only 4 values, and 2^30 is a decent limit for the > > number of pins on a single buffer. So this change is packing the > > {count[31:2], lock[1:0]} into a single uint32. > > > > Incrementing/decrementing the count continue the same, just using > > 4 instead of 1, lock mode access will require one or two additional > > bitwise operations. The exact count requires one shift, and is used > > only for debugging. A special function is provided to check whether > > count == 1. > > Have you actually evaluated the benefit from this? Pretty sceptical it's > worth > it. > I tested, and I agree, not worth it from a speed perspective. At this point the only part left is the introduction of the simplehash. However what I will try is to store just the buffer number in the hash and keep another array for the entries, who knows that works better. ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Addressing buffer private reference count scalability issue @ 2026-03-15 02:37 Alexandre Felipe <[email protected]> parent: Alexandre Felipe <[email protected]> 0 siblings, 0 replies; 6+ messages in thread From: Alexandre Felipe @ 2026-03-15 02:37 UTC (permalink / raw) To: Andres Freund <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]> Hi Andres, > I don't think it's a good idea to introduce new simplehash infrastructure as > part of this larger change. I am submitting the change I did before on simplehash for empty entry detection. > You also haven't documented the new stuff. What and where I was supposed to document? 01 the change I did before, 02 I applied it to refcount in bufmgr. Exploring a bit more the code base, 03 removes status from nodeMemoize 04 remove status from pg_rewind/filemap.c 05 was a bit trickier because InvalidBlockNumber is not 0, then I had to make entries empty after allocation that uses memset 0 by default. I grepped for a list and there are 21 files in total so I will stop here. > In my benchmarks allowing vectorization helped a decent amount in real > queries, because it does away with all the branch misses. Interesting, did you compile with the default configuration? I used gcc11 (a bit old I know) and and yes, it remove some branches but still use a loop with cmove (conditional copy), in some cases it unrolls but still uses cmove for each entry (the machine I tested is quite feature rich e.g. avx512cd avx512bw avx512vl avx512_vnni. So I am wondering if the impact I see is not the same impact as you see. If we go for vectorisation we could do a vectorized loop e.g. 16 iterations on 16 x 32-bit vectors with early exit. but that would inevitably make the few entries case slightly slower. Do you have any other localised issue like this that could be worth looking into? Regards, Alexandre Attachments: [application/octet-stream] v3-0004-Use-null-pointer-as-an-empty-marker.patch (1.9K, 3-v3-0004-Use-null-pointer-as-an-empty-marker.patch) download | inline diff: From f457eb9c232f0e35143aa3ce37480b7b07148c53 Mon Sep 17 00:00:00 2001 From: Bob <[email protected]> Date: Sun, 15 Mar 2026 01:23:12 +0000 Subject: [PATCH 4/5] Use null pointer as an empty marker Here simplehash is being simply used as a Set NULL pointer is a natural empty marker. --- src/bin/pg_rewind/filemap.c | 7 ++++++- src/bin/pg_rewind/filemap.h | 2 -- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c index b79c47f925..574a282d84 100644 --- a/src/bin/pg_rewind/filemap.c +++ b/src/bin/pg_rewind/filemap.c @@ -45,6 +45,9 @@ #define SH_KEY path #define SH_HASH_KEY(tb, key) hash_string(key) #define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0) +#define SH_ENTRY_EMPTY(entry) ((entry)->path == NULL) +#define SH_MAKE_EMPTY(entry) ((entry)->path = NULL) +#define SH_MAKE_IN_USE(entry) ((void)0) #define SH_SCOPE static inline #define SH_RAW_ALLOCATOR pg_malloc0 #define SH_DECLARE @@ -68,7 +71,6 @@ static file_entry_t *lookup_filehash_entry(const char *path); typedef struct keepwal_entry { const char *path; - uint32 status; } keepwal_entry; #define SH_PREFIX keepwal @@ -77,6 +79,9 @@ typedef struct keepwal_entry #define SH_KEY path #define SH_HASH_KEY(tb, key) hash_string(key) #define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0) +#define SH_ENTRY_EMPTY(entry) ((entry)->path == NULL) +#define SH_MAKE_EMPTY(entry) ((entry)->path = NULL) +#define SH_MAKE_IN_USE(entry) ((void)0) #define SH_SCOPE static inline #define SH_RAW_ALLOCATOR pg_malloc0 #define SH_DECLARE diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h index 4c6dd8740d..b3e67be1da 100644 --- a/src/bin/pg_rewind/filemap.h +++ b/src/bin/pg_rewind/filemap.h @@ -56,8 +56,6 @@ typedef enum */ typedef struct file_entry_t { - uint32 status; /* hash status */ - const char *path; file_content_type_t content_type; -- 2.34.1 [application/octet-stream] v3-0001-Custom-simplehash-empty-value-detection.patch (6.1K, 4-v3-0001-Custom-simplehash-empty-value-detection.patch) download | inline diff: From db04956b20ba4324a0cd4d391a2d2e3e2a358831 Mon Sep 17 00:00:00 2001 From: Alexandre Felipe <[email protected]> Date: Sun, 15 Mar 2026 00:00:00 +0000 Subject: [PATCH 1/5] Custom simplehash empty value detection Changes the empty value identification in simplehash allowing custom values to be used. The default continues to use the status field. For types where the key value already has an "invalid" value, the macros SH_ENTRY_EMPTY, SH_MAKE_EMPTY and SH_MAKE_IN_USE can be overridden to elliminate the need for a separate status field. --- src/include/lib/simplehash.h | 59 +++++++++++++++++++++++------------- 1 file changed, 38 insertions(+), 21 deletions(-) diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h index 848719232a..3c03a7e9c9 100644 --- a/src/include/lib/simplehash.h +++ b/src/include/lib/simplehash.h @@ -287,6 +287,20 @@ SH_SCOPE void SH_STAT(SH_TYPE * tb); #define SH_COMPARE_KEYS(tb, ahash, akey, b) (SH_EQUAL(tb, b->SH_KEY, akey)) #endif +/* + * Macros to check/set entry status. Users can override these to avoid + * needing a separate status field if their key type has an "invalid" value. + */ +#ifndef SH_ENTRY_EMPTY +#define SH_ENTRY_EMPTY(entry) ((entry)->status == SH_STATUS_EMPTY) +#endif +#ifndef SH_MAKE_EMPTY +#define SH_MAKE_EMPTY(entry) ((entry)->status = SH_STATUS_EMPTY) +#endif +#ifndef SH_MAKE_IN_USE +#define SH_MAKE_IN_USE(entry) ((entry)->status = SH_STATUS_IN_USE) +#endif + /* * Wrap the following definitions in include guards, to avoid multiple * definition errors if this header is included more than once. The rest of @@ -544,7 +558,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize) uint32 hash; uint32 optimal; - if (oldentry->status != SH_STATUS_IN_USE) + if (SH_ENTRY_EMPTY(oldentry)) { startelem = i; break; @@ -566,7 +580,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize) { SH_ELEMENT_TYPE *oldentry = &olddata[copyelem]; - if (oldentry->status == SH_STATUS_IN_USE) + if (!SH_ENTRY_EMPTY(oldentry)) { uint32 hash; uint32 startelem2; @@ -582,7 +596,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize) { newentry = &newdata[curelem]; - if (newentry->status == SH_STATUS_EMPTY) + if (SH_ENTRY_EMPTY(newentry)) { break; } @@ -653,14 +667,14 @@ restart: SH_ELEMENT_TYPE *entry = &data[curelem]; /* any empty bucket can directly be used */ - if (entry->status == SH_STATUS_EMPTY) + if (SH_ENTRY_EMPTY(entry)) { tb->members++; entry->SH_KEY = key; #ifdef SH_STORE_HASH SH_GET_HASH(tb, entry) = hash; #endif - entry->status = SH_STATUS_IN_USE; + SH_MAKE_IN_USE(entry); *found = false; return entry; } @@ -675,7 +689,7 @@ restart: if (SH_COMPARE_KEYS(tb, hash, key, entry)) { - Assert(entry->status == SH_STATUS_IN_USE); + Assert(!SH_ENTRY_EMPTY(entry)); *found = true; return entry; } @@ -699,7 +713,7 @@ restart: emptyelem = SH_NEXT(tb, emptyelem, startelem); emptyentry = &data[emptyelem]; - if (emptyentry->status == SH_STATUS_EMPTY) + if (SH_ENTRY_EMPTY(emptyentry)) { lastentry = emptyentry; break; @@ -748,7 +762,7 @@ restart: #ifdef SH_STORE_HASH SH_GET_HASH(tb, entry) = hash; #endif - entry->status = SH_STATUS_IN_USE; + SH_MAKE_IN_USE(entry); *found = false; return entry; } @@ -810,12 +824,12 @@ SH_LOOKUP_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash) { SH_ELEMENT_TYPE *entry = &tb->data[curelem]; - if (entry->status == SH_STATUS_EMPTY) + if (SH_ENTRY_EMPTY(entry)) { return NULL; } - Assert(entry->status == SH_STATUS_IN_USE); + Assert(!SH_ENTRY_EMPTY(entry)); if (SH_COMPARE_KEYS(tb, hash, key, entry)) return entry; @@ -868,10 +882,10 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key) { SH_ELEMENT_TYPE *entry = &tb->data[curelem]; - if (entry->status == SH_STATUS_EMPTY) + if (SH_ENTRY_EMPTY(entry)) return false; - if (entry->status == SH_STATUS_IN_USE && + if (!SH_ENTRY_EMPTY(entry) && SH_COMPARE_KEYS(tb, hash, key, entry)) { SH_ELEMENT_TYPE *lastentry = entry; @@ -894,9 +908,9 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key) curelem = SH_NEXT(tb, curelem, startelem); curentry = &tb->data[curelem]; - if (curentry->status != SH_STATUS_IN_USE) + if (SH_ENTRY_EMPTY(curentry)) { - lastentry->status = SH_STATUS_EMPTY; + SH_MAKE_EMPTY(lastentry); break; } @@ -906,7 +920,7 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key) /* current is at optimal position, done */ if (curoptimal == curelem) { - lastentry->status = SH_STATUS_EMPTY; + SH_MAKE_EMPTY(lastentry); break; } @@ -957,9 +971,9 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry) curelem = SH_NEXT(tb, curelem, startelem); curentry = &tb->data[curelem]; - if (curentry->status != SH_STATUS_IN_USE) + if (SH_ENTRY_EMPTY(curentry)) { - lastentry->status = SH_STATUS_EMPTY; + SH_MAKE_EMPTY(lastentry); break; } @@ -969,7 +983,7 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry) /* current is at optimal position, done */ if (curoptimal == curelem) { - lastentry->status = SH_STATUS_EMPTY; + SH_MAKE_EMPTY(lastentry); break; } @@ -997,7 +1011,7 @@ SH_START_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter) { SH_ELEMENT_TYPE *entry = &tb->data[i]; - if (entry->status != SH_STATUS_IN_USE) + if (SH_ENTRY_EMPTY(entry)) { startelem = i; break; @@ -1063,7 +1077,7 @@ SH_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter) if ((iter->cur & tb->sizemask) == (iter->end & tb->sizemask)) iter->done = true; - if (elem->status == SH_STATUS_IN_USE) + if (!SH_ENTRY_EMPTY(elem)) { return elem; } @@ -1140,7 +1154,7 @@ SH_STAT(SH_TYPE * tb) elem = &tb->data[i]; - if (elem->status != SH_STATUS_IN_USE) + if (SH_ENTRY_EMPTY(elem)) continue; hash = SH_ENTRY_HASH(tb, elem); @@ -1205,6 +1219,9 @@ SH_STAT(SH_TYPE * tb) #undef SH_STORE_HASH #undef SH_USE_NONDEFAULT_ALLOCATOR #undef SH_EQUAL +#undef SH_ENTRY_EMPTY +#undef SH_MAKE_EMPTY +#undef SH_MAKE_IN_USE /* undefine locally declared macros */ #undef SH_MAKE_PREFIX -- 2.34.1 [application/octet-stream] v3-0003-Use-NULL-key-as-empty-in-nodeMemoize.c.patch (1.3K, 5-v3-0003-Use-NULL-key-as-empty-in-nodeMemoize.c.patch) download | inline diff: From e091fd06217aae30cf73697156b22bc66c382445 Mon Sep 17 00:00:00 2001 From: Bob <[email protected]> Date: Sun, 15 Mar 2026 01:19:05 +0000 Subject: [PATCH 3/5] Use NULL key as empty in nodeMemoize.c MemoizeKey *key is describeed as Hash key for hash table lookups But it is not as one would expect. MemoizeHash_hash computes hash from tb->private_data ignoring the key Similarly Memoize_equal doesn't use key2, and uses probeslot from tb->private_data. At line 545 memoize_insert(mstate->hashtable, NULL, found); if not found we hold a pointer to an entry still empty in the simplehash, until we assign it a key (line 561) --- src/backend/executor/nodeMemoize.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/backend/executor/nodeMemoize.c b/src/backend/executor/nodeMemoize.c index fdca97d742..44154ad25a 100644 --- a/src/backend/executor/nodeMemoize.c +++ b/src/backend/executor/nodeMemoize.c @@ -143,6 +143,9 @@ static bool MemoizeHash_equal(struct memoize_hash *tb, #define SH_KEY key #define SH_HASH_KEY(tb, key) MemoizeHash_hash(tb, key) #define SH_EQUAL(tb, a, b) MemoizeHash_equal(tb, a, b) +#define SH_ENTRY_EMPTY(entry) ((entry)->key == NULL) +#define SH_MAKE_EMPTY(entry) ((entry)->key = NULL) +#define SH_MAKE_IN_USE(entry) ((void)0) #define SH_SCOPE static inline #define SH_STORE_HASH #define SH_GET_HASH(tb, a) a->hash -- 2.34.1 [application/octet-stream] v3-0005-Use-InvalidBlockNumber-as-empty-marker.patch (5.2K, 6-v3-0005-Use-InvalidBlockNumber-as-empty-marker.patch) download | inline diff: From 210916bcc236856e63b258a03573da5319b3e244 Mon Sep 17 00:00:00 2001 From: Bob <[email protected]> Date: Sun, 15 Mar 2026 01:48:09 +0000 Subject: [PATCH 5/5] Use InvalidBlockNumber as empty marker This one required an update on simplehash implementation InvalidBlockNumber defined as ((BlockNumber) 0xFFFFFFFF) in ./src/include/storage/block.h by default SimpleHash simply zeroes the memory and that makes everything empty. For this case we have to call SH_MAKE_EMPTY on each entry after allocating. Removing the status field also a the hacks old_status = entry->status modify entry, corrupts status entry->status = old_status where the status initialized by simplehash had to be saved and restored when updating the entry. --- src/backend/nodes/tidbitmap.c | 19 ++++++------------- src/include/lib/simplehash.h | 22 ++++++++++++++++++++++ 2 files changed, 28 insertions(+), 13 deletions(-) diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c index f1f925cb13..207f27f0ca 100644 --- a/src/backend/nodes/tidbitmap.c +++ b/src/backend/nodes/tidbitmap.c @@ -92,7 +92,6 @@ typedef struct PagetableEntry { BlockNumber blockno; /* page number (hashtable key) */ - char status; /* hash entry status */ bool ischunk; /* T = lossy storage, F = exact */ bool recheck; /* should the tuples be rechecked? */ bitmapword words[Max(WORDS_PER_PAGE, WORDS_PER_CHUNK)]; @@ -237,6 +236,12 @@ static int tbm_shared_comparator(const void *left, const void *right, #define SH_KEY blockno #define SH_HASH_KEY(tb, key) murmurhash32(key) #define SH_EQUAL(tb, a, b) a == b +#define SH_ENTRY_EMPTY(entry) ((entry)->blockno == InvalidBlockNumber) +#define SH_MAKE_EMPTY(entry) ((entry)->blockno = InvalidBlockNumber) +#define SH_MAKE_IN_USE(entry) ((void)0) +// Since the empty marker is non-zero, we need to reset the entries +// after allocation using the custom SH_MAKE_EMPTY macro. +#define SH_NONZERO_EMPTY #define SH_SCOPE static inline #define SH_DEFINE #define SH_DECLARE @@ -291,15 +296,12 @@ tbm_create_pagetable(TIDBitmap *tbm) { PagetableEntry *page; bool found; - char oldstatus; page = pagetable_insert(tbm->pagetable, tbm->entry1.blockno, &found); Assert(!found); - oldstatus = page->status; memcpy(page, &tbm->entry1, sizeof(PagetableEntry)); - page->status = oldstatus; } tbm->status = TBM_HASH; @@ -1230,10 +1232,7 @@ tbm_get_pageentry(TIDBitmap *tbm, BlockNumber pageno) /* Initialize it if not present before */ if (!found) { - char oldstatus = page->status; - MemSet(page, 0, sizeof(PagetableEntry)); - page->status = oldstatus; page->blockno = pageno; /* must count it too */ tbm->nentries++; @@ -1317,10 +1316,7 @@ tbm_mark_page_lossy(TIDBitmap *tbm, BlockNumber pageno) /* Initialize it if not present before */ if (!found) { - char oldstatus = page->status; - MemSet(page, 0, sizeof(PagetableEntry)); - page->status = oldstatus; page->blockno = chunk_pageno; page->ischunk = true; /* must count it too */ @@ -1329,11 +1325,8 @@ tbm_mark_page_lossy(TIDBitmap *tbm, BlockNumber pageno) } else if (!page->ischunk) { - char oldstatus = page->status; - /* chunk header page was formerly non-lossy, make it lossy */ MemSet(page, 0, sizeof(PagetableEntry)); - page->status = oldstatus; page->blockno = chunk_pageno; page->ischunk = true; /* we assume it had some tuple bit(s) set, so mark it lossy */ diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h index 3c03a7e9c9..c890a09b76 100644 --- a/src/include/lib/simplehash.h +++ b/src/include/lib/simplehash.h @@ -301,6 +301,12 @@ SH_SCOPE void SH_STAT(SH_TYPE * tb); #define SH_MAKE_IN_USE(entry) ((entry)->status = SH_STATUS_IN_USE) #endif +/* + * If the empty marker is non-zero (e.g., InvalidBlockNumber = 0xFFFFFFFF), + * define SH_NONZERO_EMPTY to explicitly initialize entries. When unset, + * zero-initialization via memset is sufficient (the default). + */ + /* * Wrap the following definitions in include guards, to avoid multiple * definition errors if this header is included more than once. The rest of @@ -481,6 +487,11 @@ SH_CREATE(MemoryContext ctx, uint32 nelements, void *private_data) tb->data = (SH_ELEMENT_TYPE *) SH_ALLOCATE(tb, sizeof(SH_ELEMENT_TYPE) * size); +#ifdef SH_NONZERO_EMPTY + for (uint64 i = 0; i < size; i++) + SH_MAKE_EMPTY(&tb->data[i]); +#endif + SH_UPDATE_PARAMETERS(tb, size); return tb; } @@ -497,7 +508,12 @@ SH_DESTROY(SH_TYPE * tb) SH_SCOPE void SH_RESET(SH_TYPE * tb) { +#ifdef SH_NONZERO_EMPTY + for (uint32 i = 0; i < tb->size; i++) + SH_MAKE_EMPTY(&tb->data[i]); +#else memset(tb->data, 0, sizeof(SH_ELEMENT_TYPE) * tb->size); +#endif tb->members = 0; } @@ -526,6 +542,11 @@ SH_GROW(SH_TYPE * tb, uint64 newsize) tb->data = (SH_ELEMENT_TYPE *) SH_ALLOCATE(tb, sizeof(SH_ELEMENT_TYPE) * newsize); +#ifdef SH_NONZERO_EMPTY + for (uint64 j = 0; j < newsize; j++) + SH_MAKE_EMPTY(&tb->data[j]); +#endif + /* * Update parameters for new table after allocation succeeds to avoid * inconsistent state on OOM. @@ -1222,6 +1243,7 @@ SH_STAT(SH_TYPE * tb) #undef SH_ENTRY_EMPTY #undef SH_MAKE_EMPTY #undef SH_MAKE_IN_USE +#undef SH_NONZERO_EMPTY /* undefine locally declared macros */ #undef SH_MAKE_PREFIX -- 2.34.1 [application/octet-stream] v3-0002-Use-InvalidBuffer-to-indicate-empty-slot.patch (1.5K, 7-v3-0002-Use-InvalidBuffer-to-indicate-empty-slot.patch) download | inline diff: From 056b123ae41d8ee8da5fdf5d4ee9cd886a52b2d9 Mon Sep 17 00:00:00 2001 From: Bob <[email protected]> Date: Sun, 15 Mar 2026 00:13:55 +0000 Subject: [PATCH 2/5] Use InvalidBuffer to indicate empty slot Simple hash requires a mechanism to distinguish empty slots. The previous implementation of refcount simplehash was using the default `status` field, adding a char to the PrivateRefCountEntr. The buffer already has an reserved value InvalidBuffer that can be used to mark an entry as empty. Making use of that removes one field from PrivateRefCountEntry and keep it 32bit aligned, without padding required. --- src/backend/storage/buffer/bufmgr.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 00bc609529..db911327f9 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -125,8 +125,6 @@ typedef struct PrivateRefCountEntry */ Buffer buffer; - char status; - PrivateRefCountData data; } PrivateRefCountEntry; @@ -136,6 +134,9 @@ typedef struct PrivateRefCountEntry #define SH_KEY buffer #define SH_HASH_KEY(tb, key) murmurhash32((uint32) (key)) #define SH_EQUAL(tb, a, b) ((a) == (b)) +#define SH_ENTRY_EMPTY(entry) ((entry)->buffer == InvalidBuffer) +#define SH_MAKE_EMPTY(entry) ((entry)->buffer = InvalidBuffer) +#define SH_MAKE_IN_USE(entry) ((void)0) /* key assignment implies in use */ #define SH_SCOPE static inline #define SH_DECLARE #define SH_DEFINE -- 2.34.1 ^ permalink raw reply [nested|flat] 6+ messages in thread
end of thread, other threads:[~2026-03-15 02:37 UTC | newest] Thread overview: 6+ messages (download: mbox mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2026-03-08 16:09 Addressing buffer private reference count scalability issue Alexandre Felipe <[email protected]> 2026-03-08 17:09 ` Andres Freund <[email protected]> 2026-03-11 18:03 ` Alexandre Felipe <[email protected]> 2026-03-11 18:55 ` Andres Freund <[email protected]> 2026-03-11 22:41 ` Alexandre Felipe <[email protected]> 2026-03-15 02:37 ` Alexandre Felipe <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox