Addressing buffer private reference count scalability issue

public inbox for [email protected]  
help / color / mirror / Atom feed

Addressing buffer private reference count scalability issue
6+ messages / 2 participants
[nested] [flat]

* Addressing buffer private reference count scalability issue
@ 2026-03-08 16:09  Alexandre Felipe <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Alexandre Felipe @ 2026-03-08 16:09 UTC (permalink / raw)
  To: PostgreSQL Hackers <[email protected]>; Andres Freund <[email protected]>

Hi Hackers,

This patch addresses a performance issue pointed out by Andres Freund,


   1.

   Benchmark buffer pinning: You know benchmark code, implemented a few
   functions that can be use in postgres queries, and a python script that
   runs them and produces CSV files and SVG plots for the current build.
   2.

   Refactoring reference counting: Before starting to change code and
   potentially breaking things I considered prudent to isolate it to limit the
   damage. This code was part of a 8k+ LOC file.
   3.

   Using simplehash: Simply replacing the HTAB for a simplehash, and adding
   a new set of macros SH_ENTRY_EMPTY, SH_MAKE_EMPTY, SH_MAKE_IN_USE. To allow
   using the InvalidBuffer special value instead of allocating extra space for
   a validity flag. Here I assume that the buffer buffer sequence is
   independent enough from the array size, so I use the buffer as the hash key
   directly, omitting a hash function call.
   4.

   Compact PrivateRefCountEntry: The original implementation used a 4-byte
   key and 8-byte value. Reference count uses 32 bits, and it is unreasonable
   to expect one backend to pin the same buffer 1 billion times. The lock mode
   uses 32 bits but can only assume 4 values. So I packed them in one single
   uint32, giving 30 bits for count and 2 bits for lock mode. This makes the
   entries 8-byte long, on 64-bit CPUs this represents more than a 1/3
   reduction in memory. This makes the array aligned with the 64-bit words,
   copying one entry can be completed in one instruction, and every entry will
   be aligned.
   5.

   REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array
   lookup, it is worth trying to remove it completely and keep only the hash.
   For small values we would be trading a few branches for a buffer % SIZE,
   for the use case of prefetch where pin/unpin in a FIFO fashion, it will
   save an 8-entry array lookup, and some extra data moves.

In addition to the patch I am including

- A bash script to apply and benchmark the patches sequentially. You might
have to adjust REPO_ROOT, in my case it gets it relative to the script
path, that is under $REPO_ROOT/.patches/pins/.
- A compare-patches.py script that can be copied to
src/test/modules/test_buffer_pin to process the benchmark CSV in figures
showing one metric for different patches instead of different metrics for
one patch as the benchmark.py produces.
- A nicely formatted post about this [2]

Regards,
Alexandre

[1]
https://www.postgresql.org/message-id/s5p7iou7pdhxhvmv4rohmskwqmr36dc4rybvwlep5yvwrjs4pa%406oxsemms5...
[2] https://afelipe.hashnode.dev/postgres-backend-buffer-pinning-algorithm


Attachments:

  [application/octet-stream] v1-0003-Using-simplehash.patch (12.8K, 3-v1-0003-Using-simplehash.patch)
  download | inline diff:
From 077520420223d3bc14c9f7b073c15021aae20388 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Fri, 6 Mar 2026 16:55:43 +0000
Subject: [PATCH 3/5] Using simplehash

This patch replaces the HTAB implementation with a simplehash
as suggested by Andres Freund [1]. The simplehash implements a templated
open addressing hash, which have entry size information at compile time.

The access strategy of the simplehash is very close to the plain array
that was seen to be very efficient compared to the HTAB implementation,
with the additional advantage of using the key value to initialize the
search closer to where the key actually is, instead of always scanning
the entire array.

---
 src/backend/storage/buffer/buf_refcount.c | 86 +++++++++++------------
 src/include/lib/simplehash.h              | 59 ++++++++++------
 2 files changed, 81 insertions(+), 64 deletions(-)

diff --git a/src/backend/storage/buffer/buf_refcount.c b/src/backend/storage/buffer/buf_refcount.c
index 1c0bec29c93..ff37355a61e 100644
--- a/src/backend/storage/buffer/buf_refcount.c
+++ b/src/backend/storage/buffer/buf_refcount.c
@@ -40,10 +40,10 @@
  */
 #include "postgres.h"
 
+#include "common/hashfn.h"
 #include "storage/buf_internals.h"
 #include "storage/buf_refcount.h"
 #include "storage/proc.h"
-#include "utils/hsearch.h"
 
 
 
@@ -55,15 +55,36 @@ typedef struct PrivateRefCountData
 
 struct PrivateRefCountEntry
 {
-	Buffer		buffer;
+	Buffer		buffer;		/* hash key - InvalidBuffer means empty */
 	PrivateRefCountData data;
 };
 
+/*
+ * Define simplehash parameters for the overflow hash table.
+ * We use buffer == InvalidBuffer as the "empty" marker to avoid needing
+ * a separate status field.
+ */
+#define SH_PREFIX refcount
+#define SH_ELEMENT_TYPE PrivateRefCountEntry
+#define SH_KEY_TYPE Buffer
+#define SH_KEY buffer
+#define SH_HASH_KEY(tb, key) murmurhash32(key)
+#define SH_EQUAL(tb, a, b) ((a) == (b))
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+/* Use buffer field as empty marker - no separate status needed */
+#define SH_ENTRY_EMPTY(entry) ((entry)->buffer == InvalidBuffer)
+#define SH_MAKE_EMPTY(entry) ((entry)->buffer = InvalidBuffer)
+#define SH_MAKE_IN_USE(entry) ((void)0)  /* key assignment handles this */
+#include "lib/simplehash.h"
+
 struct PrivateRefCountIterator
 {
 	int			array_index;
 	bool		in_hash;
-	HASH_SEQ_STATUS *hash_status;
+	refcount_iterator hash_iter;
+	bool		hash_iter_valid;
 };
 
 /* Private refcount array and keys */
@@ -72,7 +93,7 @@ static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
 static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
 
 /* Overflow hash table for when array is full */
-static HTAB *PrivateRefCountHash = NULL;
+static refcount_hash *PrivateRefCountHash = NULL;
 
 /* Count of entries that have overflowed into the hash table */
 static int32 PrivateRefCountOverflowed = 0;
@@ -100,26 +121,18 @@ static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buff
 void
 InitPrivateRefCount(void)
 {
-	HASHCTL		hash_ctl;
-
-
 	/*
 	 * An advisory limit on the number of pins each backend should hold, based
 	 * on shared_buffers and the maximum number of connections possible.
 	 * That's very pessimistic, but outside toy-sized shared_buffers it should
 	 * allow plenty of pins.  LimitAdditionalPins() and
 	 * GetAdditionalPinLimit() can be used to check the remaining balance.
-	 */
-	 MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
-
+	*/
+	MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
 	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
 	memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
 
-	hash_ctl.keysize = sizeof(Buffer);
-	hash_ctl.entrysize = sizeof(PrivateRefCountEntry);
-
-	PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS);
+	PrivateRefCountHash = refcount_create(CurrentMemoryContext, 16, NULL);
 }
 
 /*
@@ -173,10 +186,9 @@ ReservePrivateRefCountEntry(void)
 		Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer);
 
 		/* enter victim array entry into hashtable */
-		hashent = hash_search(PrivateRefCountHash,
-							  &PrivateRefCountArrayKeys[victim_slot],
-							  HASH_ENTER,
-							  &found);
+		hashent = refcount_insert(PrivateRefCountHash,
+								  PrivateRefCountArrayKeys[victim_slot],
+								  &found);
 		Assert(!found);
 		hashent->data = victim_entry->data;
 
@@ -253,7 +265,7 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
 	if (PrivateRefCountOverflowed == 0)
 		return NULL;
 
-	res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL);
+	res = refcount_lookup(PrivateRefCountHash, buffer);
 
 	if (res == NULL)
 		return NULL;
@@ -264,7 +276,6 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
 	else
 	{
 		/* move buffer from hashtable into the free array slot */
-		bool		found;
 		PrivateRefCountEntry *free;
 
 		ReservePrivateRefCountEntry();
@@ -280,8 +291,7 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
 
 		ReservedRefCountSlot = -1;
 
-		hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
-		Assert(found);
+		refcount_delete_item(PrivateRefCountHash, res);
 		Assert(PrivateRefCountOverflowed > 0);
 		PrivateRefCountOverflowed--;
 
@@ -384,11 +394,7 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
 		}
 		else
 		{
-			bool		found;
-			Buffer		buffer = ref->buffer;
-
-			hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
-			Assert(found);
+			refcount_delete_item(PrivateRefCountHash, ref);
 			Assert(PrivateRefCountOverflowed > 0);
 			PrivateRefCountOverflowed--;
 		}
@@ -456,10 +462,10 @@ CheckPrivateRefCountLeaks(void)
 	/* if necessary search the hash */
 	if (PrivateRefCountOverflowed)
 	{
-		HASH_SEQ_STATUS hstat;
+		refcount_iterator iter;
 
-		hash_seq_init(&hstat, PrivateRefCountHash);
-		while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL)
+		refcount_start_iterate(PrivateRefCountHash, &iter);
+		while ((res = refcount_iterate(PrivateRefCountHash, &iter)) != NULL)
 		{
 			s = DebugPrintBufferRefcount(res->buffer);
 			elog(WARNING, "buffer refcount leak: %s", s);
@@ -482,7 +488,7 @@ InitPrivateRefCountIterator(void)
 
 	iter->array_index = 0;
 	iter->in_hash = false;
-	iter->hash_status = NULL;
+	iter->hash_iter_valid = false;
 	return iter;
 }
 
@@ -508,21 +514,20 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
 		iter->in_hash = true;
 		if (PrivateRefCountOverflowed > 0)
 		{
-			iter->hash_status = palloc(sizeof(HASH_SEQ_STATUS));
-			hash_seq_init(iter->hash_status, PrivateRefCountHash);
+			refcount_start_iterate(PrivateRefCountHash, &iter->hash_iter);
+			iter->hash_iter_valid = true;
 		}
 	}
 
-	if (iter->hash_status != NULL)
+	if (iter->hash_iter_valid)
 	{
 		PrivateRefCountEntry *res;
 
-		res = (PrivateRefCountEntry *) hash_seq_search(iter->hash_status);
+		res = refcount_iterate(PrivateRefCountHash, &iter->hash_iter);
 		if (res != NULL)
 			return res;
 
-		pfree(iter->hash_status);
-		iter->hash_status = NULL;
+		iter->hash_iter_valid = false;
 	}
 
 	return NULL;
@@ -534,11 +539,6 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
 void
 FreePrivateRefCountIterator(PrivateRefCountIterator *iter)
 {
-	if (iter->hash_status != NULL)
-	{
-		hash_seq_term(iter->hash_status);
-		pfree(iter->hash_status);
-	}
 	pfree(iter);
 }
 
diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index 848719232a4..3c03a7e9c9b 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -287,6 +287,20 @@ SH_SCOPE void SH_STAT(SH_TYPE * tb);
 #define SH_COMPARE_KEYS(tb, ahash, akey, b) (SH_EQUAL(tb, b->SH_KEY, akey))
 #endif
 
+/*
+ * Macros to check/set entry status. Users can override these to avoid
+ * needing a separate status field if their key type has an "invalid" value.
+ */
+#ifndef SH_ENTRY_EMPTY
+#define SH_ENTRY_EMPTY(entry) ((entry)->status == SH_STATUS_EMPTY)
+#endif
+#ifndef SH_MAKE_EMPTY
+#define SH_MAKE_EMPTY(entry) ((entry)->status = SH_STATUS_EMPTY)
+#endif
+#ifndef SH_MAKE_IN_USE
+#define SH_MAKE_IN_USE(entry) ((entry)->status = SH_STATUS_IN_USE)
+#endif
+
 /*
  * Wrap the following definitions in include guards, to avoid multiple
  * definition errors if this header is included more than once.  The rest of
@@ -544,7 +558,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 		uint32		hash;
 		uint32		optimal;
 
-		if (oldentry->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_EMPTY(oldentry))
 		{
 			startelem = i;
 			break;
@@ -566,7 +580,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 	{
 		SH_ELEMENT_TYPE *oldentry = &olddata[copyelem];
 
-		if (oldentry->status == SH_STATUS_IN_USE)
+		if (!SH_ENTRY_EMPTY(oldentry))
 		{
 			uint32		hash;
 			uint32		startelem2;
@@ -582,7 +596,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 			{
 				newentry = &newdata[curelem];
 
-				if (newentry->status == SH_STATUS_EMPTY)
+				if (SH_ENTRY_EMPTY(newentry))
 				{
 					break;
 				}
@@ -653,14 +667,14 @@ restart:
 		SH_ELEMENT_TYPE *entry = &data[curelem];
 
 		/* any empty bucket can directly be used */
-		if (entry->status == SH_STATUS_EMPTY)
+		if (SH_ENTRY_EMPTY(entry))
 		{
 			tb->members++;
 			entry->SH_KEY = key;
 #ifdef SH_STORE_HASH
 			SH_GET_HASH(tb, entry) = hash;
 #endif
-			entry->status = SH_STATUS_IN_USE;
+			SH_MAKE_IN_USE(entry);
 			*found = false;
 			return entry;
 		}
@@ -675,7 +689,7 @@ restart:
 
 		if (SH_COMPARE_KEYS(tb, hash, key, entry))
 		{
-			Assert(entry->status == SH_STATUS_IN_USE);
+			Assert(!SH_ENTRY_EMPTY(entry));
 			*found = true;
 			return entry;
 		}
@@ -699,7 +713,7 @@ restart:
 				emptyelem = SH_NEXT(tb, emptyelem, startelem);
 				emptyentry = &data[emptyelem];
 
-				if (emptyentry->status == SH_STATUS_EMPTY)
+				if (SH_ENTRY_EMPTY(emptyentry))
 				{
 					lastentry = emptyentry;
 					break;
@@ -748,7 +762,7 @@ restart:
 #ifdef SH_STORE_HASH
 			SH_GET_HASH(tb, entry) = hash;
 #endif
-			entry->status = SH_STATUS_IN_USE;
+			SH_MAKE_IN_USE(entry);
 			*found = false;
 			return entry;
 		}
@@ -810,12 +824,12 @@ SH_LOOKUP_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash)
 	{
 		SH_ELEMENT_TYPE *entry = &tb->data[curelem];
 
-		if (entry->status == SH_STATUS_EMPTY)
+		if (SH_ENTRY_EMPTY(entry))
 		{
 			return NULL;
 		}
 
-		Assert(entry->status == SH_STATUS_IN_USE);
+		Assert(!SH_ENTRY_EMPTY(entry));
 
 		if (SH_COMPARE_KEYS(tb, hash, key, entry))
 			return entry;
@@ -868,10 +882,10 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 	{
 		SH_ELEMENT_TYPE *entry = &tb->data[curelem];
 
-		if (entry->status == SH_STATUS_EMPTY)
+		if (SH_ENTRY_EMPTY(entry))
 			return false;
 
-		if (entry->status == SH_STATUS_IN_USE &&
+		if (!SH_ENTRY_EMPTY(entry) &&
 			SH_COMPARE_KEYS(tb, hash, key, entry))
 		{
 			SH_ELEMENT_TYPE *lastentry = entry;
@@ -894,9 +908,9 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 				curelem = SH_NEXT(tb, curelem, startelem);
 				curentry = &tb->data[curelem];
 
-				if (curentry->status != SH_STATUS_IN_USE)
+				if (SH_ENTRY_EMPTY(curentry))
 				{
-					lastentry->status = SH_STATUS_EMPTY;
+					SH_MAKE_EMPTY(lastentry);
 					break;
 				}
 
@@ -906,7 +920,7 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 				/* current is at optimal position, done */
 				if (curoptimal == curelem)
 				{
-					lastentry->status = SH_STATUS_EMPTY;
+					SH_MAKE_EMPTY(lastentry);
 					break;
 				}
 
@@ -957,9 +971,9 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
 		curelem = SH_NEXT(tb, curelem, startelem);
 		curentry = &tb->data[curelem];
 
-		if (curentry->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_EMPTY(curentry))
 		{
-			lastentry->status = SH_STATUS_EMPTY;
+			SH_MAKE_EMPTY(lastentry);
 			break;
 		}
 
@@ -969,7 +983,7 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
 		/* current is at optimal position, done */
 		if (curoptimal == curelem)
 		{
-			lastentry->status = SH_STATUS_EMPTY;
+			SH_MAKE_EMPTY(lastentry);
 			break;
 		}
 
@@ -997,7 +1011,7 @@ SH_START_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter)
 	{
 		SH_ELEMENT_TYPE *entry = &tb->data[i];
 
-		if (entry->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_EMPTY(entry))
 		{
 			startelem = i;
 			break;
@@ -1063,7 +1077,7 @@ SH_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter)
 
 		if ((iter->cur & tb->sizemask) == (iter->end & tb->sizemask))
 			iter->done = true;
-		if (elem->status == SH_STATUS_IN_USE)
+		if (!SH_ENTRY_EMPTY(elem))
 		{
 			return elem;
 		}
@@ -1140,7 +1154,7 @@ SH_STAT(SH_TYPE * tb)
 
 		elem = &tb->data[i];
 
-		if (elem->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_EMPTY(elem))
 			continue;
 
 		hash = SH_ENTRY_HASH(tb, elem);
@@ -1205,6 +1219,9 @@ SH_STAT(SH_TYPE * tb)
 #undef SH_STORE_HASH
 #undef SH_USE_NONDEFAULT_ALLOCATOR
 #undef SH_EQUAL
+#undef SH_ENTRY_EMPTY
+#undef SH_MAKE_EMPTY
+#undef SH_MAKE_IN_USE
 
 /* undefine locally declared macros */
 #undef SH_MAKE_PREFIX
-- 
2.53.0



  [application/octet-stream] v1-0004-Compact-PrivateRefCountEntry.patch (10.8K, 4-v1-0004-Compact-PrivateRefCountEntry.patch)
  download | inline diff:
From 56bfdd6d7296397a7b3f0b282e239fdc86d6580d Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Fri, 6 Mar 2026 17:15:44 +0000
Subject: [PATCH 4/5] Compact PrivateRefCountEntry

The previous implementation used an 8-bite (64-bit) entry to store
a uint32 count and an uint32 lock mode. That is fine when we store
the data separate from the key (buffer). But in the simplehash
{key, value} are stored together, so each entry is 12-bytes.
This is somewhat awkward as we have to either pad the entry to 16-bytes,
or the access will be an alternating aligned/misaligned addreses.

However, we are probably not going to need even 16-bits for the count
and 2-bits is enough for the lock mode. So we can easily pack these
int to a single uint32.

Incrementing/decrementing the count continue the same, just using
4 instead of 1, lock mode access will require one or two additional
bitwise operations.

No bit-shifts are required.

---
 src/backend/storage/buffer/buf_refcount.c | 167 +++++++++-------------
 1 file changed, 70 insertions(+), 97 deletions(-)

diff --git a/src/backend/storage/buffer/buf_refcount.c b/src/backend/storage/buffer/buf_refcount.c
index ff37355a61e..29dfb720997 100644
--- a/src/backend/storage/buffer/buf_refcount.c
+++ b/src/backend/storage/buffer/buf_refcount.c
@@ -40,53 +40,54 @@
  */
 #include "postgres.h"
 
-#include "common/hashfn.h"
 #include "storage/buf_internals.h"
 #include "storage/buf_refcount.h"
 #include "storage/proc.h"
 
 
 
-typedef struct PrivateRefCountData
+/*
+ * Implementation details - opaque to callers.
+ * Packed refcount and lockmode in a single uint32:
+ *   Bits 0-1:  lockmode (4 values: UNLOCK, SHARE, SHARE_EXCLUSIVE, EXCLUSIVE)
+ *   Bits 2-31: refcount (30 bits, max ~1 billion)
+ */
+struct PrivateRefCountEntry
 {
-	int32		refcount;
-	BufferLockMode lockmode;
-} PrivateRefCountData;
+	Buffer		buffer;
+	uint32		data;
+};
 
-struct PrivateRefCountEntry
+#define PRIVATEREFCOUNT_LOCKMODE_MASK	0x3
+#define ONE_PRIVATE_REFERENCE			4	/* 1 << 2 */
+
+#define PrivateRefCountGetLockMode(d)	((BufferLockMode)((d) & PRIVATEREFCOUNT_LOCKMODE_MASK))
+#define PrivateRefCountSetLockMode(d, m) ((d) = ((d) & ~PRIVATEREFCOUNT_LOCKMODE_MASK) | (m))
+#define PrivateRefCountGetRefCount(d)	((int32)((d) >> 2))
+#define PrivateRefCountIsZero(d)		((d) < ONE_PRIVATE_REFERENCE)
+
+struct PrivateRefCountIterator
 {
-	Buffer		buffer;		/* hash key - InvalidBuffer means empty */
-	PrivateRefCountData data;
+	int			array_index;
+	bool		in_hash;
+	void	   *hash_status;
 };
 
-/*
- * Define simplehash parameters for the overflow hash table.
- * We use buffer == InvalidBuffer as the "empty" marker to avoid needing
- * a separate status field.
- */
+/* Define simplehash for private refcount overflow hash table */
 #define SH_PREFIX refcount
 #define SH_ELEMENT_TYPE PrivateRefCountEntry
 #define SH_KEY_TYPE Buffer
 #define SH_KEY buffer
-#define SH_HASH_KEY(tb, key) murmurhash32(key)
+#define SH_HASH_KEY(tb, key) ((uint32) (key))
 #define SH_EQUAL(tb, a, b) ((a) == (b))
 #define SH_SCOPE static inline
-#define SH_DEFINE
-#define SH_DECLARE
-/* Use buffer field as empty marker - no separate status needed */
 #define SH_ENTRY_EMPTY(entry) ((entry)->buffer == InvalidBuffer)
 #define SH_MAKE_EMPTY(entry) ((entry)->buffer = InvalidBuffer)
-#define SH_MAKE_IN_USE(entry) ((void)0)  /* key assignment handles this */
+#define SH_MAKE_IN_USE(entry) ((void) 0)
+#define SH_DECLARE
+#define SH_DEFINE
 #include "lib/simplehash.h"
 
-struct PrivateRefCountIterator
-{
-	int			array_index;
-	bool		in_hash;
-	refcount_iterator hash_iter;
-	bool		hash_iter_valid;
-};
-
 /* Private refcount array and keys */
 #define REFCOUNT_ARRAY_ENTRIES 8
 static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
@@ -113,7 +114,7 @@ static uint32 MaxProportionalPins = 0;
 /* Forward declarations */
 static void ReservePrivateRefCountEntry(void);
 static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
-static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move);
+static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer);
 
 /*
  * Initialize private refcount tracking for this backend.
@@ -132,7 +133,7 @@ InitPrivateRefCount(void)
 	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
 	memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
 
-	PrivateRefCountHash = refcount_create(CurrentMemoryContext, 16, NULL);
+	PrivateRefCountHash = refcount_create(CurrentMemoryContext, 64, NULL);
 }
 
 /*
@@ -158,6 +159,11 @@ ReservePrivateRefCountEntry(void)
 			if (PrivateRefCountArrayKeys[i] == InvalidBuffer)
 			{
 				ReservedRefCountSlot = i;
+
+				/*
+				 * We could return immediately, but iterating till the end of
+				 * the array allows compiler-autovectorization.
+				 */
 			}
 		}
 
@@ -195,10 +201,7 @@ ReservePrivateRefCountEntry(void)
 		/* clear the now free array slot */
 		PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer;
 		victim_entry->buffer = InvalidBuffer;
-
-		memset(&victim_entry->data, 0, sizeof(victim_entry->data));
-		victim_entry->data.refcount = 0;
-		victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK;
+		victim_entry->data = 0;
 
 		PrivateRefCountOverflowed++;
 	}
@@ -221,8 +224,7 @@ NewPrivateRefCountEntry(Buffer buffer)
 	/* and fill it */
 	PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
 	res->buffer = buffer;
-	res->data.refcount = 0;
-	res->data.lockmode = BUFFER_LOCK_UNLOCK;
+	res->data = 0;
 
 	/* update cache for the next lookup */
 	PrivateRefCountEntryLast = ReservedRefCountSlot;
@@ -236,7 +238,7 @@ NewPrivateRefCountEntry(Buffer buffer)
  * Slow-path for GetSharedBufferEntry().
  */
 static pg_noinline PrivateRefCountEntry *
-GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
+GetPrivateRefCountEntrySlow(Buffer buffer)
 {
 	PrivateRefCountEntry *res;
 	int			match = -1;
@@ -266,41 +268,11 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
 		return NULL;
 
 	res = refcount_lookup(PrivateRefCountHash, buffer);
-
-	if (res == NULL)
-		return NULL;
-	else if (!do_move)
-	{
-		return res;
-	}
-	else
-	{
-		/* move buffer from hashtable into the free array slot */
-		PrivateRefCountEntry *free;
-
-		ReservePrivateRefCountEntry();
-
-		Assert(ReservedRefCountSlot != -1);
-		free = &PrivateRefCountArray[ReservedRefCountSlot];
-		Assert(free->buffer == InvalidBuffer);
-
-		free->buffer = buffer;
-		free->data = res->data;
-		PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
-		PrivateRefCountEntryLast = ReservedRefCountSlot;
-
-		ReservedRefCountSlot = -1;
-
-		refcount_delete_item(PrivateRefCountHash, res);
-		Assert(PrivateRefCountOverflowed > 0);
-		PrivateRefCountOverflowed--;
-
-		return free;
-	}
+	return res;
 }
 
 /*
- * Return the PrivateRefCountEntry for the passed buffer.
+ * Return the PrivateRefCount entry for the passed buffer.
  * Returns NULL if the buffer is not currently pinned.
  */
 PrivateRefCountEntry *
@@ -316,7 +288,7 @@ GetSharedBufferEntry(Buffer buffer)
 		return &PrivateRefCountArray[PrivateRefCountEntryLast];
 	}
 
-	return GetPrivateRefCountEntrySlow(buffer, false);
+	return GetPrivateRefCountEntrySlow(buffer);
 }
 
 /*
@@ -332,25 +304,20 @@ SharedBufferRef(Buffer buffer)
 	Assert(BufferIsValid(buffer));
 	Assert(!BufferIsLocal(buffer));
 
-	/* Check cache first, then slow path */
-	if (likely(PrivateRefCountEntryLast != -1) &&
-		likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer))
-	{
-		ref = &PrivateRefCountArray[PrivateRefCountEntryLast];
-	}
-	else
-	{
-		ref = GetPrivateRefCountEntrySlow(buffer, true);
-	}
+	ref = GetSharedBufferEntry(buffer);
 
 	if (ref == NULL)
 	{
 		/* New pin - create entry */
 		ReservePrivateRefCountEntry();
 		ref = NewPrivateRefCountEntry(buffer);
+		ref->data = ONE_PRIVATE_REFERENCE;
+	}
+	else
+	{
+		/* Already pinned - increment */
+		ref->data += ONE_PRIVATE_REFERENCE;
 	}
-
-	ref->data.refcount++;
 
 	return ref;
 }
@@ -363,8 +330,8 @@ void
 SharedBufferRefExisting(PrivateRefCountEntry *ref)
 {
 	Assert(ref != NULL);
-	Assert(ref->data.refcount > 0);
-	ref->data.refcount++;
+	Assert(!PrivateRefCountIsZero(ref->data));
+	ref->data += ONE_PRIVATE_REFERENCE;
 }
 
 /*
@@ -376,14 +343,14 @@ bool
 SharedBufferUnref(PrivateRefCountEntry *ref)
 {
 	Assert(ref != NULL);
-	Assert(ref->data.refcount > 0);
+	Assert(!PrivateRefCountIsZero(ref->data));
 
-	ref->data.refcount--;
+	ref->data -= ONE_PRIVATE_REFERENCE;
 
-	if (ref->data.refcount == 0)
+	if (PrivateRefCountIsZero(ref->data))
 	{
 		/* No more references - clean up the entry */
-		Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK);
+		Assert(SharedBufferGetLockMode(ref) == BUFFER_LOCK_UNLOCK);
 
 		if (ref >= &PrivateRefCountArray[0] &&
 			ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES])
@@ -394,7 +361,8 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
 		}
 		else
 		{
-			refcount_delete_item(PrivateRefCountHash, ref);
+			/* could make slightly more efficient by using the pointer */
+			refcount_delete(PrivateRefCountHash, ref->buffer);
 			Assert(PrivateRefCountOverflowed > 0);
 			PrivateRefCountOverflowed--;
 		}
@@ -406,24 +374,24 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
 }
 
 /*
- * Accessors for refcount entry fields.
+ * Accessors for refcount entry fields - opaque to callers.
  */
 int32
 SharedBufferRefCount(PrivateRefCountEntry *ref)
 {
-	return ref->data.refcount;
+	return PrivateRefCountGetRefCount(ref->data);
 }
 
 BufferLockMode
 SharedBufferGetLockMode(PrivateRefCountEntry *ref)
 {
-	return ref->data.lockmode;
+	return PrivateRefCountGetLockMode(ref->data);
 }
 
 void
 SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode)
 {
-	ref->data.lockmode = mode;
+	PrivateRefCountSetLockMode(ref->data, mode);
 }
 
 Buffer
@@ -488,7 +456,7 @@ InitPrivateRefCountIterator(void)
 
 	iter->array_index = 0;
 	iter->in_hash = false;
-	iter->hash_iter_valid = false;
+	iter->hash_status = NULL;
 	return iter;
 }
 
@@ -514,20 +482,23 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
 		iter->in_hash = true;
 		if (PrivateRefCountOverflowed > 0)
 		{
-			refcount_start_iterate(PrivateRefCountHash, &iter->hash_iter);
-			iter->hash_iter_valid = true;
+			refcount_iterator *hiter = palloc(sizeof(refcount_iterator));
+
+			refcount_start_iterate(PrivateRefCountHash, hiter);
+			iter->hash_status = hiter;
 		}
 	}
 
-	if (iter->hash_iter_valid)
+	if (iter->hash_status != NULL)
 	{
 		PrivateRefCountEntry *res;
 
-		res = refcount_iterate(PrivateRefCountHash, &iter->hash_iter);
+		res = refcount_iterate(PrivateRefCountHash, (refcount_iterator *) iter->hash_status);
 		if (res != NULL)
 			return res;
 
-		iter->hash_iter_valid = false;
+		pfree(iter->hash_status);
+		iter->hash_status = NULL;
 	}
 
 	return NULL;
@@ -539,6 +510,8 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
 void
 FreePrivateRefCountIterator(PrivateRefCountIterator *iter)
 {
+	if (iter->hash_status != NULL)
+		pfree(iter->hash_status);
 	pfree(iter);
 }
 
-- 
2.53.0



  [application/octet-stream] v1-0002-Refactoring-reference-counting.patch (50.6K, 5-v1-0002-Refactoring-reference-counting.patch)
  download | inline diff:
From c8b90725fd033465c68688f4663892ce1196a48e Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Fri, 6 Mar 2026 16:31:00 +0000
Subject: [PATCH 2/5] Refactoring reference counting

This patch refactors the reference counting mechanism moving the
implementation details away from bufmgr. Unfortunately, this
comes with additional calls overhead, but I think that the ease
of maintenance will pay off. And with the next optimisations,
we will end up better than before.

---
 src/backend/storage/buffer/Makefile       |   1 +
 src/backend/storage/buffer/buf_refcount.c | 602 ++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c       | 661 +++-------------------
 src/include/storage/buf_refcount.h        |  58 ++
 4 files changed, 727 insertions(+), 595 deletions(-)
 create mode 100644 src/backend/storage/buffer/buf_refcount.c
 create mode 100644 src/include/storage/buf_refcount.h

diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index fd7c40dcb08..c81271aabf6 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = \
 	buf_init.o \
+	buf_refcount.o \
 	buf_table.o \
 	bufmgr.o \
 	freelist.o \
diff --git a/src/backend/storage/buffer/buf_refcount.c b/src/backend/storage/buffer/buf_refcount.c
new file mode 100644
index 00000000000..1c0bec29c93
--- /dev/null
+++ b/src/backend/storage/buffer/buf_refcount.c
@@ -0,0 +1,602 @@
+/*-------------------------------------------------------------------------
+ *
+ * buf_refcount.c
+ *	  Backend-private buffer refcount tracking
+ *
+ * Each buffer has a private refcount that keeps track of the number of
+ * times the buffer is pinned in the current process.  This is so that the
+ * shared refcount needs to be modified only once if a buffer is pinned more
+ * than once by an individual backend. This mechanism is also used to track
+ * whether this backend has a buffer locked, and, if so, in what mode.
+ *
+ * To avoid - as we used to - requiring an array with NBuffers entries to keep
+ * track of local buffers, we use a small sequentially searched array
+ * (PrivateRefCountArrayKeys, with the corresponding data stored in
+ * PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to
+ * keep track of backend local pins.
+ *
+ * Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all
+ * refcounts are kept track of in the array; after that, new array entries
+ * displace old ones into the hash table. That way a frequently used entry
+ * can't get "stuck" in the hashtable while infrequent ones clog the array.
+ *
+ * This was initially designed trying to optimize for the case where the
+ * number of pinned buffers is expected to not exceed REFCOUNT_ARRAY_ENTRIES.
+ * However this might not be the case with the introduction of prefetching.
+ *
+ * To enter a buffer into the refcount tracking mechanism first reserve a free
+ * entry using ReservePrivateRefCountEntry() and then later, if necessary,
+ * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing
+ * memory allocations in NewPrivateRefCountEntry() which can be important
+ * because in some scenarios it's called with a spinlock held...
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/buf_refcount.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "storage/buf_internals.h"
+#include "storage/buf_refcount.h"
+#include "storage/proc.h"
+#include "utils/hsearch.h"
+
+
+
+typedef struct PrivateRefCountData
+{
+	int32		refcount;
+	BufferLockMode lockmode;
+} PrivateRefCountData;
+
+struct PrivateRefCountEntry
+{
+	Buffer		buffer;
+	PrivateRefCountData data;
+};
+
+struct PrivateRefCountIterator
+{
+	int			array_index;
+	bool		in_hash;
+	HASH_SEQ_STATUS *hash_status;
+};
+
+/* Private refcount array and keys */
+#define REFCOUNT_ARRAY_ENTRIES 8
+static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
+static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
+
+/* Overflow hash table for when array is full */
+static HTAB *PrivateRefCountHash = NULL;
+
+/* Count of entries that have overflowed into the hash table */
+static int32 PrivateRefCountOverflowed = 0;
+
+/* Clock hand for selecting victim when array is full */
+static uint32 PrivateRefCountClock = 0;
+
+/* Reserved slot index, or -1 if none reserved */
+static int	ReservedRefCountSlot = -1;
+
+/* Cache for last accessed entry */
+static int	PrivateRefCountEntryLast = -1;
+
+/* Advisory limit on the number of pins each backend should hold */
+static uint32 MaxProportionalPins = 0;
+
+/* Forward declarations */
+static void ReservePrivateRefCountEntry(void);
+static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
+static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move);
+
+/*
+ * Initialize private refcount tracking for this backend.
+ */
+void
+InitPrivateRefCount(void)
+{
+	HASHCTL		hash_ctl;
+
+
+	/*
+	 * An advisory limit on the number of pins each backend should hold, based
+	 * on shared_buffers and the maximum number of connections possible.
+	 * That's very pessimistic, but outside toy-sized shared_buffers it should
+	 * allow plenty of pins.  LimitAdditionalPins() and
+	 * GetAdditionalPinLimit() can be used to check the remaining balance.
+	 */
+	 MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
+
+	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
+	memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
+
+	hash_ctl.keysize = sizeof(Buffer);
+	hash_ctl.entrysize = sizeof(PrivateRefCountEntry);
+
+	PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS);
+}
+
+/*
+ * Ensure that the PrivateRefCountArray has sufficient space to store one more
+ * entry.
+ */
+static void
+ReservePrivateRefCountEntry(void)
+{
+	/* Already reserved (or freed), nothing to do */
+	if (ReservedRefCountSlot != -1)
+		return;
+
+	/*
+	 * First search for a free entry the array, that'll be sufficient in the
+	 * majority of cases.
+	 */
+	{
+		int			i;
+
+		for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+		{
+			if (PrivateRefCountArrayKeys[i] == InvalidBuffer)
+			{
+				ReservedRefCountSlot = i;
+			}
+		}
+
+		if (ReservedRefCountSlot != -1)
+			return;
+	}
+
+	/*
+	 * No luck. All array entries are full. Move one array entry into the hash
+	 * table.
+	 */
+	{
+		int			victim_slot;
+		PrivateRefCountEntry *victim_entry;
+		PrivateRefCountEntry *hashent;
+		bool		found;
+
+		/* select victim slot */
+		victim_slot = PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES;
+		victim_entry = &PrivateRefCountArray[victim_slot];
+		ReservedRefCountSlot = victim_slot;
+
+		/* Better be used, otherwise we shouldn't get here. */
+		Assert(PrivateRefCountArrayKeys[victim_slot] != InvalidBuffer);
+		Assert(PrivateRefCountArray[victim_slot].buffer != InvalidBuffer);
+		Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer);
+
+		/* enter victim array entry into hashtable */
+		hashent = hash_search(PrivateRefCountHash,
+							  &PrivateRefCountArrayKeys[victim_slot],
+							  HASH_ENTER,
+							  &found);
+		Assert(!found);
+		hashent->data = victim_entry->data;
+
+		/* clear the now free array slot */
+		PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer;
+		victim_entry->buffer = InvalidBuffer;
+
+		memset(&victim_entry->data, 0, sizeof(victim_entry->data));
+		victim_entry->data.refcount = 0;
+		victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK;
+
+		PrivateRefCountOverflowed++;
+	}
+}
+
+/*
+ * Create a new refcount entry for the given buffer.
+ */
+static PrivateRefCountEntry *
+NewPrivateRefCountEntry(Buffer buffer)
+{
+	PrivateRefCountEntry *res;
+
+	/* only allowed to be called when a reservation has been made */
+	Assert(ReservedRefCountSlot != -1);
+
+	/* use up the reserved entry */
+	res = &PrivateRefCountArray[ReservedRefCountSlot];
+
+	/* and fill it */
+	PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
+	res->buffer = buffer;
+	res->data.refcount = 0;
+	res->data.lockmode = BUFFER_LOCK_UNLOCK;
+
+	/* update cache for the next lookup */
+	PrivateRefCountEntryLast = ReservedRefCountSlot;
+
+	ReservedRefCountSlot = -1;
+
+	return res;
+}
+
+/*
+ * Slow-path for GetSharedBufferEntry().
+ */
+static pg_noinline PrivateRefCountEntry *
+GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
+{
+	PrivateRefCountEntry *res;
+	int			match = -1;
+	int			i;
+
+	/*
+	 * First search for references in the array.
+	 */
+	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+	{
+		if (PrivateRefCountArrayKeys[i] == buffer)
+		{
+			match = i;
+		}
+	}
+
+	if (likely(match != -1))
+	{
+		PrivateRefCountEntryLast = match;
+		return &PrivateRefCountArray[match];
+	}
+
+	/*
+	 * Only look up the buffer in the hashtable if we've previously overflowed.
+	 */
+	if (PrivateRefCountOverflowed == 0)
+		return NULL;
+
+	res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL);
+
+	if (res == NULL)
+		return NULL;
+	else if (!do_move)
+	{
+		return res;
+	}
+	else
+	{
+		/* move buffer from hashtable into the free array slot */
+		bool		found;
+		PrivateRefCountEntry *free;
+
+		ReservePrivateRefCountEntry();
+
+		Assert(ReservedRefCountSlot != -1);
+		free = &PrivateRefCountArray[ReservedRefCountSlot];
+		Assert(free->buffer == InvalidBuffer);
+
+		free->buffer = buffer;
+		free->data = res->data;
+		PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
+		PrivateRefCountEntryLast = ReservedRefCountSlot;
+
+		ReservedRefCountSlot = -1;
+
+		hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
+		Assert(found);
+		Assert(PrivateRefCountOverflowed > 0);
+		PrivateRefCountOverflowed--;
+
+		return free;
+	}
+}
+
+/*
+ * Return the PrivateRefCountEntry for the passed buffer.
+ * Returns NULL if the buffer is not currently pinned.
+ */
+PrivateRefCountEntry *
+GetSharedBufferEntry(Buffer buffer)
+{
+	Assert(BufferIsValid(buffer));
+	Assert(!BufferIsLocal(buffer));
+
+	/* Fast path: check one-entry cache */
+	if (likely(PrivateRefCountEntryLast != -1) &&
+		likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer))
+	{
+		return &PrivateRefCountArray[PrivateRefCountEntryLast];
+	}
+
+	return GetPrivateRefCountEntrySlow(buffer, false);
+}
+
+/*
+ * Increment the private refcount for a shared buffer.
+ * Creates a new entry if one doesn't exist.
+ * Returns the entry pointer.
+ */
+PrivateRefCountEntry *
+SharedBufferRef(Buffer buffer)
+{
+	PrivateRefCountEntry *ref;
+
+	Assert(BufferIsValid(buffer));
+	Assert(!BufferIsLocal(buffer));
+
+	/* Check cache first, then slow path */
+	if (likely(PrivateRefCountEntryLast != -1) &&
+		likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer))
+	{
+		ref = &PrivateRefCountArray[PrivateRefCountEntryLast];
+	}
+	else
+	{
+		ref = GetPrivateRefCountEntrySlow(buffer, true);
+	}
+
+	if (ref == NULL)
+	{
+		/* New pin - create entry */
+		ReservePrivateRefCountEntry();
+		ref = NewPrivateRefCountEntry(buffer);
+	}
+
+	ref->data.refcount++;
+
+	return ref;
+}
+
+/*
+ * Increment the private refcount for an existing entry.
+ * Use when you already have the entry from a previous lookup.
+ */
+void
+SharedBufferRefExisting(PrivateRefCountEntry *ref)
+{
+	Assert(ref != NULL);
+	Assert(ref->data.refcount > 0);
+	ref->data.refcount++;
+}
+
+/*
+ * Decrement the private refcount for a buffer.
+ * If the refcount reaches zero, removes the entry and returns true.
+ * Returns false if the buffer still has references.
+ */
+bool
+SharedBufferUnref(PrivateRefCountEntry *ref)
+{
+	Assert(ref != NULL);
+	Assert(ref->data.refcount > 0);
+
+	ref->data.refcount--;
+
+	if (ref->data.refcount == 0)
+	{
+		/* No more references - clean up the entry */
+		Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK);
+
+		if (ref >= &PrivateRefCountArray[0] &&
+			ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES])
+		{
+			ref->buffer = InvalidBuffer;
+			PrivateRefCountArrayKeys[ref - PrivateRefCountArray] = InvalidBuffer;
+			ReservedRefCountSlot = ref - PrivateRefCountArray;
+		}
+		else
+		{
+			bool		found;
+			Buffer		buffer = ref->buffer;
+
+			hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
+			Assert(found);
+			Assert(PrivateRefCountOverflowed > 0);
+			PrivateRefCountOverflowed--;
+		}
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Accessors for refcount entry fields.
+ */
+int32
+SharedBufferRefCount(PrivateRefCountEntry *ref)
+{
+	return ref->data.refcount;
+}
+
+BufferLockMode
+SharedBufferGetLockMode(PrivateRefCountEntry *ref)
+{
+	return ref->data.lockmode;
+}
+
+void
+SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode)
+{
+	ref->data.lockmode = mode;
+}
+
+Buffer
+SharedBufferGetBuffer(PrivateRefCountEntry *ref)
+{
+	return ref->buffer;
+}
+
+/*
+ * Check for buffer refcount leaks.
+ */
+void
+CheckPrivateRefCountLeaks(void)
+{
+#ifdef USE_ASSERT_CHECKING
+	int			RefCountErrors = 0;
+	PrivateRefCountEntry *res;
+	int			i;
+	char	   *s;
+
+	/* check the array */
+	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+	{
+		if (PrivateRefCountArrayKeys[i] != InvalidBuffer)
+		{
+			res = &PrivateRefCountArray[i];
+
+			s = DebugPrintBufferRefcount(res->buffer);
+			elog(WARNING, "buffer refcount leak: %s", s);
+			pfree(s);
+
+			RefCountErrors++;
+		}
+	}
+
+	/* if necessary search the hash */
+	if (PrivateRefCountOverflowed)
+	{
+		HASH_SEQ_STATUS hstat;
+
+		hash_seq_init(&hstat, PrivateRefCountHash);
+		while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			s = DebugPrintBufferRefcount(res->buffer);
+			elog(WARNING, "buffer refcount leak: %s", s);
+			pfree(s);
+			RefCountErrors++;
+		}
+	}
+
+	Assert(RefCountErrors == 0);
+#endif
+}
+
+/*
+ * Initialize an iterator for walking all private refcount entries.
+ */
+PrivateRefCountIterator *
+InitPrivateRefCountIterator(void)
+{
+	PrivateRefCountIterator *iter = palloc(sizeof(PrivateRefCountIterator));
+
+	iter->array_index = 0;
+	iter->in_hash = false;
+	iter->hash_status = NULL;
+	return iter;
+}
+
+/*
+ * Get the next private refcount entry.
+ * Returns NULL when iteration is complete.
+ */
+PrivateRefCountEntry *
+GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
+{
+	/* First iterate through the array */
+	while (!iter->in_hash && iter->array_index < REFCOUNT_ARRAY_ENTRIES)
+	{
+		int			idx = iter->array_index++;
+
+		if (PrivateRefCountArrayKeys[idx] != InvalidBuffer)
+			return &PrivateRefCountArray[idx];
+	}
+
+	/* Then iterate through the hash if there are overflowed entries */
+	if (!iter->in_hash)
+	{
+		iter->in_hash = true;
+		if (PrivateRefCountOverflowed > 0)
+		{
+			iter->hash_status = palloc(sizeof(HASH_SEQ_STATUS));
+			hash_seq_init(iter->hash_status, PrivateRefCountHash);
+		}
+	}
+
+	if (iter->hash_status != NULL)
+	{
+		PrivateRefCountEntry *res;
+
+		res = (PrivateRefCountEntry *) hash_seq_search(iter->hash_status);
+		if (res != NULL)
+			return res;
+
+		pfree(iter->hash_status);
+		iter->hash_status = NULL;
+	}
+
+	return NULL;
+}
+
+/*
+ * Free an iterator from InitPrivateRefCountIterator.
+ */
+void
+FreePrivateRefCountIterator(PrivateRefCountIterator *iter)
+{
+	if (iter->hash_status != NULL)
+	{
+		hash_seq_term(iter->hash_status);
+		pfree(iter->hash_status);
+	}
+	pfree(iter);
+}
+
+
+/*
+ * Return the maximum number of buffers that a backend should try to pin once,
+ * to avoid exceeding its fair share.  This is the highest value that
+ * GetAdditionalPinLimit() could ever return.  Note that it may be zero on a
+ * system with a very small buffer pool relative to max_connections.
+ */
+ uint32
+ GetPinLimit(void)
+ {
+	 return MaxProportionalPins;
+ }
+
+ /*
+  * Return the maximum number of additional buffers that this backend should
+  * pin if it wants to stay under the per-backend limit, considering the number
+  * of buffers it has already pinned.  Unlike LimitAdditionalPins(), the limit
+  * return by this function can be zero.
+  */
+ uint32
+ GetAdditionalPinLimit(void)
+ {
+	 uint32		estimated_pins_held;
+
+	 /*
+	  * We get the number of "overflowed" pins for free, but don't know the
+	  * number of pins in PrivateRefCountArray.  The cost of calculating that
+	  * exactly doesn't seem worth it, so just assume the max.
+	  */
+	 estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
+
+	 /* Is this backend already holding more than its fair share? */
+	 if (estimated_pins_held > MaxProportionalPins)
+		 return 0;
+
+	 return MaxProportionalPins - estimated_pins_held;
+ }
+
+ /*
+  * Limit the number of pins a batch operation may additionally acquire, to
+  * avoid running out of pinnable buffers.
+  *
+  * One additional pin is always allowed, on the assumption that the operation
+  * requires at least one to make progress.
+  */
+ void
+ LimitAdditionalPins(uint32 *additional_pins)
+ {
+	 uint32		limit;
+
+	 if (*additional_pins <= 1)
+		 return;
+
+	 limit = GetAdditionalPinLimit();
+	 limit = Max(limit, 1);
+	 if (limit < *additional_pins)
+		 *additional_pins = limit;
+ }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5f3d083e938..aa99e97e286 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -54,6 +54,7 @@
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/buf_refcount.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -93,43 +94,6 @@
  */
 #define BUF_DROP_FULL_SCAN_THRESHOLD		(uint64) (NBuffers / 32)
 
-/*
- * This is separated out from PrivateRefCountEntry to allow for copying all
- * the data members via struct assignment.
- */
-typedef struct PrivateRefCountData
-{
-	/*
-	 * How many times has the buffer been pinned by this backend.
-	 */
-	int32		refcount;
-
-	/*
-	 * Is the buffer locked by this backend? BUFFER_LOCK_UNLOCK indicates that
-	 * the buffer is not locked.
-	 */
-	BufferLockMode lockmode;
-} PrivateRefCountData;
-
-typedef struct PrivateRefCountEntry
-{
-	/*
-	 * Note that this needs to be same as the entry's corresponding
-	 * PrivateRefCountArrayKeys[i], if the entry is stored in the array. We
-	 * store it in both places as this is used for the hashtable key and
-	 * because it is more convenient (passing around a PrivateRefCountEntry
-	 * suffices to identify the buffer) and faster (checking the keys array is
-	 * faster when checking many entries, checking the entry is faster if just
-	 * checking a single entry).
-	 */
-	Buffer		buffer;
-
-	PrivateRefCountData data;
-} PrivateRefCountEntry;
-
-/* 64 bytes, about the size of a cache line on common systems */
-#define REFCOUNT_ARRAY_ENTRIES 8
-
 /*
  * Status of buffers to checkpoint for a particular tablespace, used
  * internally in BufferSync.
@@ -213,55 +177,6 @@ int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
-/*
- * Backend-Private refcount management:
- *
- * Each buffer also has a private refcount that keeps track of the number of
- * times the buffer is pinned in the current process.  This is so that the
- * shared refcount needs to be modified only once if a buffer is pinned more
- * than once by an individual backend.  It's also used to check that no
- * buffers are still pinned at the end of transactions and when exiting. We
- * also use this mechanism to track whether this backend has a buffer locked,
- * and, if so, in what mode.
- *
- *
- * To avoid - as we used to - requiring an array with NBuffers entries to keep
- * track of local buffers, we use a small sequentially searched array
- * (PrivateRefCountArrayKeys, with the corresponding data stored in
- * PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to
- * keep track of backend local pins.
- *
- * Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all
- * refcounts are kept track of in the array; after that, new array entries
- * displace old ones into the hash table. That way a frequently used entry
- * can't get "stuck" in the hashtable while infrequent ones clog the array.
- *
- * Note that in most scenarios the number of pinned buffers will not exceed
- * REFCOUNT_ARRAY_ENTRIES.
- *
- *
- * To enter a buffer into the refcount tracking mechanism first reserve a free
- * entry using ReservePrivateRefCountEntry() and then later, if necessary,
- * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing
- * memory allocations in NewPrivateRefCountEntry() which can be important
- * because in some scenarios it's called with a spinlock held...
- */
-static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
-static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
-static HTAB *PrivateRefCountHash = NULL;
-static int32 PrivateRefCountOverflowed = 0;
-static uint32 PrivateRefCountClock = 0;
-static int	ReservedRefCountSlot = -1;
-static int	PrivateRefCountEntryLast = -1;
-
-static uint32 MaxProportionalPins;
-
-static void ReservePrivateRefCountEntry(void);
-static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
-static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
-static inline int32 GetPrivateRefCount(Buffer buffer);
-static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref);
-
 /* ResourceOwner callbacks to hold in-progress I/Os and buffer pins */
 static void ResOwnerReleaseBufferIO(Datum res);
 static char *ResOwnerPrintBufferIO(Datum res);
@@ -286,301 +201,6 @@ const ResourceOwnerDesc buffer_resowner_desc =
 	.DebugPrint = ResOwnerPrintBuffer
 };
 
-/*
- * Ensure that the PrivateRefCountArray has sufficient space to store one more
- * entry. This has to be called before using NewPrivateRefCountEntry() to fill
- * a new entry - but it's perfectly fine to not use a reserved entry.
- */
-static void
-ReservePrivateRefCountEntry(void)
-{
-	/* Already reserved (or freed), nothing to do */
-	if (ReservedRefCountSlot != -1)
-		return;
-
-	/*
-	 * First search for a free entry the array, that'll be sufficient in the
-	 * majority of cases.
-	 */
-	{
-		int			i;
-
-		for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
-		{
-			if (PrivateRefCountArrayKeys[i] == InvalidBuffer)
-			{
-				ReservedRefCountSlot = i;
-
-				/*
-				 * We could return immediately, but iterating till the end of
-				 * the array allows compiler-autovectorization.
-				 */
-			}
-		}
-
-		if (ReservedRefCountSlot != -1)
-			return;
-	}
-
-	/*
-	 * No luck. All array entries are full. Move one array entry into the hash
-	 * table.
-	 */
-	{
-		/*
-		 * Move entry from the current clock position in the array into the
-		 * hashtable. Use that slot.
-		 */
-		int			victim_slot;
-		PrivateRefCountEntry *victim_entry;
-		PrivateRefCountEntry *hashent;
-		bool		found;
-
-		/* select victim slot */
-		victim_slot = PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES;
-		victim_entry = &PrivateRefCountArray[victim_slot];
-		ReservedRefCountSlot = victim_slot;
-
-		/* Better be used, otherwise we shouldn't get here. */
-		Assert(PrivateRefCountArrayKeys[victim_slot] != InvalidBuffer);
-		Assert(PrivateRefCountArray[victim_slot].buffer != InvalidBuffer);
-		Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer);
-
-		/* enter victim array entry into hashtable */
-		hashent = hash_search(PrivateRefCountHash,
-							  &PrivateRefCountArrayKeys[victim_slot],
-							  HASH_ENTER,
-							  &found);
-		Assert(!found);
-		/* move data from the entry in the array to the hash entry */
-		hashent->data = victim_entry->data;
-
-		/* clear the now free array slot */
-		PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer;
-		victim_entry->buffer = InvalidBuffer;
-
-		/* clear the whole data member, just for future proofing */
-		memset(&victim_entry->data, 0, sizeof(victim_entry->data));
-		victim_entry->data.refcount = 0;
-		victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK;
-
-		PrivateRefCountOverflowed++;
-	}
-}
-
-/*
- * Fill a previously reserved refcount entry.
- */
-static PrivateRefCountEntry *
-NewPrivateRefCountEntry(Buffer buffer)
-{
-	PrivateRefCountEntry *res;
-
-	/* only allowed to be called when a reservation has been made */
-	Assert(ReservedRefCountSlot != -1);
-
-	/* use up the reserved entry */
-	res = &PrivateRefCountArray[ReservedRefCountSlot];
-
-	/* and fill it */
-	PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
-	res->buffer = buffer;
-	res->data.refcount = 0;
-	res->data.lockmode = BUFFER_LOCK_UNLOCK;
-
-	/* update cache for the next lookup */
-	PrivateRefCountEntryLast = ReservedRefCountSlot;
-
-	ReservedRefCountSlot = -1;
-
-	return res;
-}
-
-/*
- * Slow-path for GetPrivateRefCountEntry(). This is big enough to not be worth
- * inlining. This particularly seems to be true if the compiler is capable of
- * auto-vectorizing the code, as that imposes additional stack-alignment
- * requirements etc.
- */
-static pg_noinline PrivateRefCountEntry *
-GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
-{
-	PrivateRefCountEntry *res;
-	int			match = -1;
-	int			i;
-
-	/*
-	 * First search for references in the array, that'll be sufficient in the
-	 * majority of cases.
-	 */
-	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
-	{
-		if (PrivateRefCountArrayKeys[i] == buffer)
-		{
-			match = i;
-			/* see ReservePrivateRefCountEntry() for why we don't return */
-		}
-	}
-
-	if (likely(match != -1))
-	{
-		/* update cache for the next lookup */
-		PrivateRefCountEntryLast = match;
-
-		return &PrivateRefCountArray[match];
-	}
-
-	/*
-	 * By here we know that the buffer, if already pinned, isn't residing in
-	 * the array.
-	 *
-	 * Only look up the buffer in the hashtable if we've previously overflowed
-	 * into it.
-	 */
-	if (PrivateRefCountOverflowed == 0)
-		return NULL;
-
-	res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL);
-
-	if (res == NULL)
-		return NULL;
-	else if (!do_move)
-	{
-		/* caller doesn't want us to move the hash entry into the array */
-		return res;
-	}
-	else
-	{
-		/* move buffer from hashtable into the free array slot */
-		bool		found;
-		PrivateRefCountEntry *free;
-
-		/* Ensure there's a free array slot */
-		ReservePrivateRefCountEntry();
-
-		/* Use up the reserved slot */
-		Assert(ReservedRefCountSlot != -1);
-		free = &PrivateRefCountArray[ReservedRefCountSlot];
-		Assert(PrivateRefCountArrayKeys[ReservedRefCountSlot] == free->buffer);
-		Assert(free->buffer == InvalidBuffer);
-
-		/* and fill it */
-		free->buffer = buffer;
-		free->data = res->data;
-		PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
-		/* update cache for the next lookup */
-		PrivateRefCountEntryLast = match;
-
-		ReservedRefCountSlot = -1;
-
-
-		/* delete from hashtable */
-		hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
-		Assert(found);
-		Assert(PrivateRefCountOverflowed > 0);
-		PrivateRefCountOverflowed--;
-
-		return free;
-	}
-}
-
-/*
- * Return the PrivateRefCount entry for the passed buffer.
- *
- * Returns NULL if a buffer doesn't have a refcount entry. Otherwise, if
- * do_move is true, and the entry resides in the hashtable the entry is
- * optimized for frequent access by moving it to the array.
- */
-static inline PrivateRefCountEntry *
-GetPrivateRefCountEntry(Buffer buffer, bool do_move)
-{
-	Assert(BufferIsValid(buffer));
-	Assert(!BufferIsLocal(buffer));
-
-	/*
-	 * It's very common to look up the same buffer repeatedly. To make that
-	 * fast, we have a one-entry cache.
-	 *
-	 * In contrast to the loop in GetPrivateRefCountEntrySlow(), here it
-	 * faster to check PrivateRefCountArray[].buffer, as in the case of a hit
-	 * fewer addresses are computed and fewer cachelines are accessed. Whereas
-	 * in GetPrivateRefCountEntrySlow()'s case, checking
-	 * PrivateRefCountArrayKeys saves a lot of memory accesses.
-	 */
-	if (likely(PrivateRefCountEntryLast != -1) &&
-		likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer))
-	{
-		return &PrivateRefCountArray[PrivateRefCountEntryLast];
-	}
-
-	/*
-	 * The code for the cached lookup is small enough to be worth inlining
-	 * into the caller. In the miss case however, that empirically doesn't
-	 * seem worth it.
-	 */
-	return GetPrivateRefCountEntrySlow(buffer, do_move);
-}
-
-/*
- * Returns how many times the passed buffer is pinned by this backend.
- *
- * Only works for shared memory buffers!
- */
-static inline int32
-GetPrivateRefCount(Buffer buffer)
-{
-	PrivateRefCountEntry *ref;
-
-	Assert(BufferIsValid(buffer));
-	Assert(!BufferIsLocal(buffer));
-
-	/*
-	 * Not moving the entry - that's ok for the current users, but we might
-	 * want to change this one day.
-	 */
-	ref = GetPrivateRefCountEntry(buffer, false);
-
-	if (ref == NULL)
-		return 0;
-	return ref->data.refcount;
-}
-
-/*
- * Release resources used to track the reference count of a buffer which we no
- * longer have pinned and don't want to pin again immediately.
- */
-static void
-ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
-{
-	Assert(ref->data.refcount == 0);
-	Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK);
-
-	if (ref >= &PrivateRefCountArray[0] &&
-		ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES])
-	{
-		ref->buffer = InvalidBuffer;
-		PrivateRefCountArrayKeys[ref - PrivateRefCountArray] = InvalidBuffer;
-
-
-		/*
-		 * Mark the just used entry as reserved - in many scenarios that
-		 * allows us to avoid ever having to search the array/hash for free
-		 * entries.
-		 */
-		ReservedRefCountSlot = ref - PrivateRefCountArray;
-	}
-	else
-	{
-		bool		found;
-		Buffer		buffer = ref->buffer;
-
-		hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
-		Assert(found);
-		Assert(PrivateRefCountOverflowed > 0);
-		PrivateRefCountOverflowed--;
-	}
-}
-
 /*
  * BufferIsPinned
  *		True iff the buffer is pinned (also checks for valid buffer number).
@@ -596,7 +216,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 		BufferIsLocal(bufnum) ? \
 			(LocalRefCount[-(bufnum) - 1] > 0) \
 		: \
-	(GetPrivateRefCount(bufnum) > 0) \
+	(GetSharedBufferEntry(bufnum) != NULL) \
 )
 
 
@@ -653,7 +273,6 @@ static void RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 										   RelFileLocator dstlocator,
 										   ForkNumber forkNum, bool permanent);
 static void AtProcExit_Buffers(int code, Datum arg);
-static void CheckForBufferLeaks(void);
 #ifdef USE_ASSERT_CHECKING
 static void AssertNotCatalogBufferLock(Buffer buffer, BufferLockMode mode);
 #endif
@@ -812,7 +431,6 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 	Assert(BufferIsValid(recent_buffer));
 
 	ResourceOwnerEnlarge(CurrentResourceOwner);
-	ReservePrivateRefCountEntry();
 	InitBufferTag(&tag, &rlocator, forkNum, blockNum);
 
 	if (BufferIsLocal(recent_buffer))
@@ -2115,7 +1733,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* Make sure we will have room to remember the buffer pin */
 	ResourceOwnerEnlarge(CurrentResourceOwner);
-	ReservePrivateRefCountEntry();
 
 	/* create a tag so we can lookup the buffer */
 	InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
@@ -2327,7 +1944,7 @@ retry:
 		UnlockBufHdr(buf);
 		LWLockRelease(oldPartitionLock);
 		/* safety check: should definitely not be our *own* pin */
-		if (GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) > 0)
+		if (GetSharedBufferEntry(BufferDescriptorGetBuffer(buf)) != NULL)
 			elog(ERROR, "buffer is pinned in InvalidateBuffer");
 		WaitIO(buf);
 		goto retry;
@@ -2380,7 +1997,7 @@ InvalidateVictimBuffer(BufferDesc *buf_hdr)
 	LWLock	   *partition_lock;
 	BufferTag	tag;
 
-	Assert(GetPrivateRefCount(BufferDescriptorGetBuffer(buf_hdr)) == 1);
+	Assert(GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr)) != NULL);
 
 	/* have buffer pinned, so it's safe to read tag without lock */
 	tag = buf_hdr->tag;
@@ -2461,7 +2078,6 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 	 * Ensure, before we pin a victim buffer, that there's a free refcount
 	 * entry and resource owner slot for the pin.
 	 */
-	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
 	/* we return here if a prospective victim buffer gets used concurrently */
@@ -2595,64 +2211,6 @@ again:
 	return buf;
 }
 
-/*
- * Return the maximum number of buffers that a backend should try to pin once,
- * to avoid exceeding its fair share.  This is the highest value that
- * GetAdditionalPinLimit() could ever return.  Note that it may be zero on a
- * system with a very small buffer pool relative to max_connections.
- */
-uint32
-GetPinLimit(void)
-{
-	return MaxProportionalPins;
-}
-
-/*
- * Return the maximum number of additional buffers that this backend should
- * pin if it wants to stay under the per-backend limit, considering the number
- * of buffers it has already pinned.  Unlike LimitAdditionalPins(), the limit
- * return by this function can be zero.
- */
-uint32
-GetAdditionalPinLimit(void)
-{
-	uint32		estimated_pins_held;
-
-	/*
-	 * We get the number of "overflowed" pins for free, but don't know the
-	 * number of pins in PrivateRefCountArray.  The cost of calculating that
-	 * exactly doesn't seem worth it, so just assume the max.
-	 */
-	estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
-
-	/* Is this backend already holding more than its fair share? */
-	if (estimated_pins_held > MaxProportionalPins)
-		return 0;
-
-	return MaxProportionalPins - estimated_pins_held;
-}
-
-/*
- * Limit the number of pins a batch operation may additionally acquire, to
- * avoid running out of pinnable buffers.
- *
- * One additional pin is always allowed, on the assumption that the operation
- * requires at least one to make progress.
- */
-void
-LimitAdditionalPins(uint32 *additional_pins)
-{
-	uint32		limit;
-
-	if (*additional_pins <= 1)
-		return;
-
-	limit = GetAdditionalPinLimit();
-	limit = Max(limit, 1);
-	if (limit < *additional_pins)
-		*additional_pins = limit;
-}
-
 /*
  * Logic shared between ExtendBufferedRelBy(), ExtendBufferedRelTo(). Just to
  * avoid duplicating the tracing and relpersistence related logic.
@@ -2816,7 +2374,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 		/* in case we need to pin an existing buffer below */
 		ResourceOwnerEnlarge(CurrentResourceOwner);
-		ReservePrivateRefCountEntry();
 
 		InitBufferTag(&tag, &BMR_GET_SMGR(bmr)->smgr_rlocator.locator, fork,
 					  first_block + i);
@@ -3188,9 +2745,8 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
 	PrivateRefCountEntry *ref;
 
 	Assert(!BufferIsLocal(b));
-	Assert(ReservedRefCountSlot != -1);
 
-	ref = GetPrivateRefCountEntry(b, true);
+	ref = GetSharedBufferEntry(b);
 
 	if (ref == NULL)
 	{
@@ -3260,8 +2816,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
 		 */
 		result = (pg_atomic_read_u64(&buf->state) & BM_VALID) != 0;
 
-		Assert(ref->data.refcount > 0);
-		ref->data.refcount++;
+		SharedBufferRefExisting(ref);
 		ResourceOwnerRememberBuffer(CurrentResourceOwner, b);
 	}
 
@@ -3299,7 +2854,7 @@ PinBuffer_Locked(BufferDesc *buf)
 	 * As explained, We don't expect any preexisting pins. That allows us to
 	 * manipulate the PrivateRefCount after releasing the spinlock
 	 */
-	Assert(GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf), false) == NULL);
+	Assert(GetSharedBufferEntry(BufferDescriptorGetBuffer(buf)) == NULL);
 
 	/*
 	 * Since we hold the buffer spinlock, we can update the buffer state and
@@ -3376,11 +2931,10 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	Assert(!BufferIsLocal(b));
 
 	/* not moving as we're likely deleting it soon anyway */
-	ref = GetPrivateRefCountEntry(b, false);
+	ref = GetSharedBufferEntry(b);
 	Assert(ref != NULL);
-	Assert(ref->data.refcount > 0);
-	ref->data.refcount--;
-	if (ref->data.refcount == 0)
+
+	if (SharedBufferUnref(ref))
 	{
 		uint64		old_buf_state;
 
@@ -3405,8 +2959,6 @@ UnpinBufferNoOwner(BufferDesc *buf)
 		/* Support LockBufferForCleanup() */
 		if (old_buf_state & BM_PIN_COUNT_WAITER)
 			WakePinCountWaiter(buf);
-
-		ForgetPrivateRefCountEntry(ref);
 	}
 }
 
@@ -3417,10 +2969,7 @@ UnpinBufferNoOwner(BufferDesc *buf)
 inline void
 TrackNewBufferPin(Buffer buf)
 {
-	PrivateRefCountEntry *ref;
-
-	ref = NewPrivateRefCountEntry(buf);
-	ref->data.refcount++;
+	SharedBufferRef(buf);
 
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, buf);
 
@@ -4040,7 +3589,6 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	BufferTag	tag;
 
 	/* Make sure we can handle the pin */
-	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
 	/*
@@ -4104,11 +3652,9 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 void
 AtEOXact_Buffers(bool isCommit)
 {
-	CheckForBufferLeaks();
+	CheckPrivateRefCountLeaks();
 
 	AtEOXact_LocalBuffers(isCommit);
-
-	Assert(PrivateRefCountOverflowed == 0);
 }
 
 /*
@@ -4121,25 +3667,8 @@ AtEOXact_Buffers(bool isCommit)
 void
 InitBufferManagerAccess(void)
 {
-	HASHCTL		hash_ctl;
-
-	/*
-	 * An advisory limit on the number of pins each backend should hold, based
-	 * on shared_buffers and the maximum number of connections possible.
-	 * That's very pessimistic, but outside toy-sized shared_buffers it should
-	 * allow plenty of pins.  LimitAdditionalPins() and
-	 * GetAdditionalPinLimit() can be used to check the remaining balance.
-	 */
-	MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
-
-	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
-	memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
-
-	hash_ctl.keysize = sizeof(Buffer);
-	hash_ctl.entrysize = sizeof(PrivateRefCountEntry);
 
-	PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS);
+	InitPrivateRefCount();
 
 	/*
 	 * AtProcExit_Buffers needs LWLock access, and thereby has to be called at
@@ -4158,62 +3687,12 @@ AtProcExit_Buffers(int code, Datum arg)
 {
 	UnlockBuffers();
 
-	CheckForBufferLeaks();
+	CheckPrivateRefCountLeaks();
 
 	/* localbuf.c needs a chance too */
 	AtProcExit_LocalBuffers();
 }
 
-/*
- *		CheckForBufferLeaks - ensure this backend holds no buffer pins
- *
- *		As of PostgreSQL 8.0, buffer pins should get released by the
- *		ResourceOwner mechanism.  This routine is just a debugging
- *		cross-check that no pins remain.
- */
-static void
-CheckForBufferLeaks(void)
-{
-#ifdef USE_ASSERT_CHECKING
-	int			RefCountErrors = 0;
-	PrivateRefCountEntry *res;
-	int			i;
-	char	   *s;
-
-	/* check the array */
-	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
-	{
-		if (PrivateRefCountArrayKeys[i] != InvalidBuffer)
-		{
-			res = &PrivateRefCountArray[i];
-
-			s = DebugPrintBufferRefcount(res->buffer);
-			elog(WARNING, "buffer refcount leak: %s", s);
-			pfree(s);
-
-			RefCountErrors++;
-		}
-	}
-
-	/* if necessary search the hash */
-	if (PrivateRefCountOverflowed)
-	{
-		HASH_SEQ_STATUS hstat;
-
-		hash_seq_init(&hstat, PrivateRefCountHash);
-		while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			s = DebugPrintBufferRefcount(res->buffer);
-			elog(WARNING, "buffer refcount leak: %s", s);
-			pfree(s);
-			RefCountErrors++;
-		}
-	}
-
-	Assert(RefCountErrors == 0);
-#endif
-}
-
 #ifdef USE_ASSERT_CHECKING
 /*
  * Check for exclusive-locked catalog buffers.  This is the core of
@@ -4235,33 +3714,20 @@ CheckForBufferLeaks(void)
 void
 AssertBufferLocksPermitCatalogRead(void)
 {
+	PrivateRefCountIterator *iter;
 	PrivateRefCountEntry *res;
 
-	/* check the array */
-	for (int i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+	iter = InitPrivateRefCountIterator();
+	while ((res = GetNextPrivateRefCountEntry(iter)) != NULL)
 	{
-		if (PrivateRefCountArrayKeys[i] != InvalidBuffer)
-		{
-			res = &PrivateRefCountArray[i];
-
-			if (res->buffer == InvalidBuffer)
-				continue;
-
-			AssertNotCatalogBufferLock(res->buffer, res->data.lockmode);
-		}
-	}
+		Buffer		buf = SharedBufferGetBuffer(res);
 
-	/* if necessary search the hash */
-	if (PrivateRefCountOverflowed)
-	{
-		HASH_SEQ_STATUS hstat;
+		if (buf == InvalidBuffer)
+			continue;
 
-		hash_seq_init(&hstat, PrivateRefCountHash);
-		while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			AssertNotCatalogBufferLock(res->buffer, res->data.lockmode);
-		}
+		AssertNotCatalogBufferLock(buf, SharedBufferGetLockMode(res));
 	}
+	FreePrivateRefCountIterator(iter);
 }
 
 static void
@@ -4315,8 +3781,10 @@ DebugPrintBufferRefcount(Buffer buffer)
 	}
 	else
 	{
+		PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
+
 		buf = GetBufferDescriptor(buffer - 1);
-		loccount = GetPrivateRefCount(buffer);
+		loccount = ref ? SharedBufferRefCount(ref) : 0;
 		backend = INVALID_PROC_NUMBER;
 	}
 
@@ -5102,7 +4570,6 @@ FlushRelationBuffers(Relation rel)
 				error_context_stack = &errcallback;
 
 				/* Make sure we can handle the pin */
-				ReservePrivateRefCountEntry();
 				ResourceOwnerEnlarge(CurrentResourceOwner);
 
 				/*
@@ -5138,7 +4605,6 @@ FlushRelationBuffers(Relation rel)
 			continue;
 
 		/* Make sure we can handle the pin */
-		ReservePrivateRefCountEntry();
 		ResourceOwnerEnlarge(CurrentResourceOwner);
 
 		buf_state = LockBufHdr(bufHdr);
@@ -5233,7 +4699,6 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 			continue;
 
 		/* Make sure we can handle the pin */
-		ReservePrivateRefCountEntry();
 		ResourceOwnerEnlarge(CurrentResourceOwner);
 
 		buf_state = LockBufHdr(bufHdr);
@@ -5459,7 +4924,6 @@ FlushDatabaseBuffers(Oid dbid)
 			continue;
 
 		/* Make sure we can handle the pin */
-		ReservePrivateRefCountEntry();
 		ResourceOwnerEnlarge(CurrentResourceOwner);
 
 		buf_state = LockBufHdr(bufHdr);
@@ -5534,17 +4998,18 @@ UnlockReleaseBuffer(Buffer buffer)
 void
 IncrBufferRefCount(Buffer buffer)
 {
-	Assert(BufferIsPinned(buffer));
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 	if (BufferIsLocal(buffer))
+	{
+		Assert(LocalRefCount[-buffer - 1] > 0);
 		LocalRefCount[-buffer - 1]++;
+	}
 	else
 	{
-		PrivateRefCountEntry *ref;
+		PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
 
-		ref = GetPrivateRefCountEntry(buffer, true);
 		Assert(ref != NULL);
-		ref->data.refcount++;
+		SharedBufferRefExisting(ref);
 	}
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, buffer);
 }
@@ -5580,7 +5045,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 
 	bufHdr = GetBufferDescriptor(buffer - 1);
 
-	Assert(GetPrivateRefCount(buffer) > 0);
+	Assert(GetSharedBufferEntry(buffer) != NULL);
 	/* here, either share or exclusive lock is OK */
 	Assert(BufferIsLockedByMe(buffer));
 
@@ -5763,12 +5228,12 @@ BufferLockAcquire(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
 	 * Get reference to the refcount entry before we hold the lock, it seems
 	 * better to do before holding the lock.
 	 */
-	entry = GetPrivateRefCountEntry(buffer, true);
+	entry = GetSharedBufferEntry(buffer);
 
 	/*
 	 * We better not already hold a lock on the buffer.
 	 */
-	Assert(entry->data.lockmode == BUFFER_LOCK_UNLOCK);
+	Assert(SharedBufferGetLockMode(entry) == BUFFER_LOCK_UNLOCK);
 
 	/*
 	 * Lock out cancel/die interrupts until we exit the code section protected
@@ -5857,7 +5322,7 @@ BufferLockAcquire(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
 	}
 
 	/* Remember that we now hold this lock */
-	entry->data.lockmode = mode;
+	SharedBufferSetLockMode(entry, mode);
 
 	/*
 	 * Fix the process wait semaphore's count for any absorbed wakeups.
@@ -5908,7 +5373,7 @@ BufferLockUnlock(Buffer buffer, BufferDesc *buf_hdr)
 static bool
 BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
 {
-	PrivateRefCountEntry *entry = GetPrivateRefCountEntry(buffer, true);
+	PrivateRefCountEntry *entry = GetSharedBufferEntry(buffer);
 	bool		mustwait;
 
 	/*
@@ -5916,7 +5381,7 @@ BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
 	 * already has locked, return false, independent of the existing and
 	 * desired lock level.
 	 */
-	if (entry->data.lockmode != BUFFER_LOCK_UNLOCK)
+	if (SharedBufferGetLockMode(entry) != BUFFER_LOCK_UNLOCK)
 		return false;
 
 	/*
@@ -5936,7 +5401,7 @@ BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
 	}
 	else
 	{
-		entry->data.lockmode = mode;
+		SharedBufferSetLockMode(entry, mode);
 	}
 
 	return !mustwait;
@@ -6146,11 +5611,11 @@ BufferLockDisownInternal(Buffer buffer, BufferDesc *buf_hdr)
 	BufferLockMode mode;
 	PrivateRefCountEntry *ref;
 
-	ref = GetPrivateRefCountEntry(buffer, false);
+	ref = GetSharedBufferEntry(buffer);
 	if (ref == NULL)
 		elog(ERROR, "lock %d is not held", buffer);
-	mode = ref->data.lockmode;
-	ref->data.lockmode = BUFFER_LOCK_UNLOCK;
+	mode = SharedBufferGetLockMode(ref);
+	SharedBufferSetLockMode(ref, BUFFER_LOCK_UNLOCK);
 
 	return mode;
 }
@@ -6384,12 +5849,12 @@ static bool
 BufferLockHeldByMeInMode(BufferDesc *buf_hdr, BufferLockMode mode)
 {
 	PrivateRefCountEntry *entry =
-		GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf_hdr), false);
+		GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr));
 
 	if (!entry)
 		return false;
 	else
-		return entry->data.lockmode == mode;
+		return SharedBufferGetLockMode(entry) == mode;
 }
 
 /*
@@ -6402,12 +5867,12 @@ static bool
 BufferLockHeldByMe(BufferDesc *buf_hdr)
 {
 	PrivateRefCountEntry *entry =
-		GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf_hdr), false);
+		GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr));
 
 	if (!entry)
 		return false;
 	else
-		return entry->data.lockmode != BUFFER_LOCK_UNLOCK;
+		return SharedBufferGetLockMode(entry) != BUFFER_LOCK_UNLOCK;
 }
 
 /*
@@ -6503,9 +5968,13 @@ CheckBufferIsPinnedOnce(Buffer buffer)
 	}
 	else
 	{
-		if (GetPrivateRefCount(buffer) != 1)
-			elog(ERROR, "incorrect local pin count: %d",
-				 GetPrivateRefCount(buffer));
+		{
+			PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
+			int32		refcount = ref ? SharedBufferRefCount(ref) : 0;
+
+			if (refcount != 1)
+				elog(ERROR, "incorrect local pin count: %d", refcount);
+		}
 	}
 }
 
@@ -6686,7 +6155,7 @@ HoldingBufferPinThatDelaysRecovery(void)
 	if (bufid < 0)
 		return false;
 
-	if (GetPrivateRefCount(bufid + 1) > 0)
+	if (GetSharedBufferEntry(bufid + 1) != NULL)
 		return true;
 
 	return false;
@@ -6721,8 +6190,12 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 	}
 
 	/* There should be exactly one local pin */
-	refcount = GetPrivateRefCount(buffer);
-	Assert(refcount);
+	{
+		PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
+
+		refcount = ref ? SharedBufferRefCount(ref) : 0;
+		Assert(refcount);
+	}
 	if (refcount != 1)
 		return false;
 
@@ -6776,8 +6249,12 @@ IsBufferCleanupOK(Buffer buffer)
 	}
 
 	/* There should be exactly one local pin */
-	if (GetPrivateRefCount(buffer) != 1)
-		return false;
+	{
+		PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
+
+		if (!ref || SharedBufferRefCount(ref) != 1)
+			return false;
+	}
 
 	bufHdr = GetBufferDescriptor(buffer - 1);
 
@@ -7447,7 +6924,7 @@ ResOwnerReleaseBuffer(Datum res)
 	{
 		PrivateRefCountEntry *ref;
 
-		ref = GetPrivateRefCountEntry(buffer, false);
+		ref = GetSharedBufferEntry(buffer);
 
 		/* not having a private refcount would imply resowner corruption */
 		Assert(ref != NULL);
@@ -7456,7 +6933,7 @@ ResOwnerReleaseBuffer(Datum res)
 		 * If the buffer was locked at the time of the resowner release,
 		 * release the lock now. This should only happen after errors.
 		 */
-		if (ref->data.lockmode != BUFFER_LOCK_UNLOCK)
+		if (SharedBufferGetLockMode(ref) != BUFFER_LOCK_UNLOCK)
 		{
 			BufferDesc *buf = GetBufferDescriptor(buffer - 1);
 
@@ -7549,7 +7026,6 @@ EvictUnpinnedBuffer(Buffer buf, bool *buffer_flushed)
 
 	/* Make sure we can pin the buffer. */
 	ResourceOwnerEnlarge(CurrentResourceOwner);
-	ReservePrivateRefCountEntry();
 
 	desc = GetBufferDescriptor(buf - 1);
 	LockBufHdr(desc);
@@ -7590,7 +7066,6 @@ EvictAllUnpinnedBuffers(int32 *buffers_evicted, int32 *buffers_flushed,
 			continue;
 
 		ResourceOwnerEnlarge(CurrentResourceOwner);
-		ReservePrivateRefCountEntry();
 
 		LockBufHdr(desc);
 
@@ -7644,7 +7119,6 @@ EvictRelUnpinnedBuffers(Relation rel, int32 *buffers_evicted,
 
 		/* Make sure we can pin the buffer. */
 		ResourceOwnerEnlarge(CurrentResourceOwner);
-		ReservePrivateRefCountEntry();
 
 		buf_state = LockBufHdr(desc);
 
@@ -7736,7 +7210,6 @@ MarkDirtyUnpinnedBuffer(Buffer buf, bool *buffer_already_dirty)
 
 	/* Make sure we can pin the buffer. */
 	ResourceOwnerEnlarge(CurrentResourceOwner);
-	ReservePrivateRefCountEntry();
 
 	desc = GetBufferDescriptor(buf - 1);
 	LockBufHdr(desc);
@@ -7789,7 +7262,6 @@ MarkDirtyRelUnpinnedBuffers(Relation rel,
 
 		/* Make sure we can pin the buffer. */
 		ResourceOwnerEnlarge(CurrentResourceOwner);
-		ReservePrivateRefCountEntry();
 
 		buf_state = LockBufHdr(desc);
 
@@ -7841,7 +7313,6 @@ MarkDirtyAllUnpinnedBuffers(int32 *buffers_dirtied,
 			continue;
 
 		ResourceOwnerEnlarge(CurrentResourceOwner);
-		ReservePrivateRefCountEntry();
 
 		LockBufHdr(desc);
 
diff --git a/src/include/storage/buf_refcount.h b/src/include/storage/buf_refcount.h
new file mode 100644
index 00000000000..842760ad2ee
--- /dev/null
+++ b/src/include/storage/buf_refcount.h
@@ -0,0 +1,58 @@
+/*-------------------------------------------------------------------------
+ *
+ * buf_refcount.h
+ *	  Backend-private buffer refcount tracking
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/buf_refcount.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BUF_REFCOUNT_H
+#define BUF_REFCOUNT_H
+
+#include "storage/buf.h"
+#include "storage/bufmgr.h"
+
+/* Opaque handle to a private refcount entry */
+typedef struct PrivateRefCountEntry PrivateRefCountEntry;
+
+/* Initialization */
+extern void InitPrivateRefCount(void);
+
+/* Pure lookup */
+extern PrivateRefCountEntry *GetSharedBufferEntry(Buffer buffer);
+
+/* Reference counting - complex operations */
+extern PrivateRefCountEntry *SharedBufferRef(Buffer buffer);
+extern void SharedBufferRefExisting(PrivateRefCountEntry *ref);
+extern bool SharedBufferUnref(PrivateRefCountEntry *ref);
+
+/* Accessors  */
+extern int32 SharedBufferRefCount(PrivateRefCountEntry *ref);
+extern BufferLockMode SharedBufferGetLockMode(PrivateRefCountEntry *ref);
+extern void SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode);
+extern Buffer SharedBufferGetBuffer(PrivateRefCountEntry *ref);
+
+/* Pin limiting */
+extern uint32 GetPinLimit(void);
+extern uint32 GetAdditionalPinLimit(void);
+extern void LimitAdditionalPins(uint32 *additional_pins);
+
+/* Leak checking */
+extern void CheckPrivateRefCountLeaks(void);
+
+/*
+ * Iterator for walking all private refcount entries.
+ * Used by assertion checking code in bufmgr.c.
+ */
+typedef struct PrivateRefCountIterator PrivateRefCountIterator;
+
+extern PrivateRefCountIterator *InitPrivateRefCountIterator(void);
+extern PrivateRefCountEntry *GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter);
+extern void FreePrivateRefCountIterator(PrivateRefCountIterator *iter);
+
+
+#endif							/* BUF_REFCOUNT_H */
-- 
2.53.0



  [application/octet-stream] v1-0001-Benchmark-buffer-pinning.patch (26.6K, 6-v1-0001-Benchmark-buffer-pinning.patch)
  download | inline diff:
From 272fa376cf33c5e3d8fefc3b39678d01925cba1f Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Wed, 4 Mar 2026 15:13:53 +0000
Subject: [PATCH] Benchmark buffer pinning

Introduces a benchmark facility that will be used to establish a baseline
and evaluate each of the subsequent patches. It includes a test module, and
a convenience python script to run and plot the results.
---
 .gitignore                                    |   3 +
 src/test/modules/test_buffer_pin/Makefile     |  18 +
 src/test/modules/test_buffer_pin/benchmark.py | 190 ++++++++
 .../modules/test_buffer_pin/requirements.txt  |   9 +
 .../test_buffer_pin/test_buffer_pin--1.0.sql  |  88 ++++
 .../modules/test_buffer_pin/test_buffer_pin.c | 421 ++++++++++++++++++
 .../test_buffer_pin/test_buffer_pin.control   |   4 +
 7 files changed, 733 insertions(+)
 create mode 100644 src/test/modules/test_buffer_pin/Makefile
 create mode 100755 src/test/modules/test_buffer_pin/benchmark.py
 create mode 100644 src/test/modules/test_buffer_pin/requirements.txt
 create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql
 create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin.c
 create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin.control

diff --git a/.gitignore b/.gitignore
index 4e911395fe3..bc7a0314380 100644
--- a/.gitignore
+++ b/.gitignore
@@ -43,3 +43,6 @@ lib*.pc
 /Release/
 /tmp_install/
 /portlock/
+
+# hidden files
+.*
\ No newline at end of file
diff --git a/src/test/modules/test_buffer_pin/Makefile b/src/test/modules/test_buffer_pin/Makefile
new file mode 100644
index 00000000000..2707f84c5b0
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/Makefile
@@ -0,0 +1,18 @@
+MODULE_big = test_buffer_pin
+OBJS = test_buffer_pin.o
+
+EXTENSION = test_buffer_pin
+DATA = test_buffer_pin--1.0.sql
+
+PGFILEDESC = "test_buffer_pin - buffer pinning benchmarks"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_buffer_pin
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_buffer_pin/benchmark.py b/src/test/modules/test_buffer_pin/benchmark.py
new file mode 100755
index 00000000000..0cf795a50c1
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/benchmark.py
@@ -0,0 +1,190 @@
+#!/usr/bin/env python3
+"""
+Buffer pinning benchmark - runs SQL benchmarks and saves results.
+
+Usage:
+    python3 benchmark.py [options]
+
+Examples:
+    python3 benchmark.py --port 5434 --name pg18-baseline --title "PostgreSQL 18 Baseline"
+    python3 benchmark.py --port 5434 --name patch-0005 --title "With Patch 0005"
+"""
+
+import argparse
+import numpy as np
+import pandas as pd
+from sqlalchemy import create_engine, text
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+from pathlib import Path
+
+
+def geomspace_int(start, stop, num):
+    """Generate geometrically spaced integers, pushing duplicates forward."""
+    raw = np.geomspace(start, stop, num)
+    result = []
+    for v in raw:
+        candidate = max(int(round(v)), result[-1] + 1 if result else 1)
+        result.append(candidate)
+    return result
+
+
+def ensure_test_table(conn, table_name, min_blocks):
+    """Create a table with enough blocks for the benchmark."""
+    rows_needed = min_blocks * 226 + 10000
+    print(f"Creating table {table_name} with ~{min_blocks} blocks...")
+    conn.execute(text(f"DROP TABLE IF EXISTS {table_name}"))
+    conn.execute(text(f"CREATE UNLOGGED TABLE {table_name} AS SELECT generate_series(1, {rows_needed}) AS id"))
+    result = conn.execute(text(f"SELECT pg_relation_size('{table_name}') / 8192 as blocks")).fetchone()
+    assert result[0] >= min_blocks, f"Table {table_name} has {result[0]} blocks, expected at least {min_blocks}"
+
+def run_benchmark(conn, func_template, buffer_counts):
+    """Run a benchmark function across all buffer counts and patterns in one query."""
+    array_str = str(buffer_counts)
+    sql = text(f"""
+        SELECT
+            CASE WHEN r.random THEN 'random' ELSE 'sequential' END as pattern,
+            b.num_buffers,
+            percentile_cont(0.5) WITHIN GROUP (ORDER BY bench.per_op_ns) as median_ns
+        FROM
+            unnest(ARRAY[false, true]) AS r(random),
+            unnest(ARRAY[{array_str}]) AS b(num_buffers),
+            LATERAL {func_template} AS bench
+        WHERE bench.iteration > 0
+        GROUP BY r.random, b.num_buffers
+        ORDER BY r.random, b.num_buffers
+    """)
+    return [dict(row._mapping) for row in conn.execute(sql)]
+
+
+def save_results(results_dir, name, results_dict, title):
+    """Save benchmark results to CSV and generate SVG plot."""
+    results_dir = Path(results_dir)
+    results_dir.mkdir(parents=True, exist_ok=True)
+
+    all_data = []
+    for bench_name, data in results_dict.items():
+        df = pd.DataFrame(data)
+        df['benchmark'] = bench_name
+        all_data.append(df)
+
+    combined_df = pd.concat(all_data, ignore_index=True)
+
+    csv_path = results_dir / f"{name}.csv"
+    combined_df.to_csv(csv_path, index=False)
+    print(f"Saved CSV: {csv_path}")
+    return combined_df
+
+
+def plot_results(results_dir, name, results_dict, title):
+    """Generate SVG plot from benchmark results."""
+    results_dir = Path(results_dir)
+
+    fig, ax = plt.subplots(figsize=(10, 6))
+
+    benchmarks = [
+        ('read', 'blue', 'o', 'ReadBuffer/ReleaseBuffer'),
+        ('pinning', 'green', 's', 'IncrBufferRefCount/ReleaseBuffer'),
+        ('locking', 'purple', 'd', 'LockBuffer/UnlockBuffer'),
+        ('resowner', 'red', '^', 'ResourceOwner Remember/Forget'),
+    ]
+
+    for bench_name, color, marker, label in benchmarks:
+        if bench_name not in results_dict:
+            continue
+        df = pd.DataFrame(results_dict[bench_name])
+
+        seq_data = df[df['pattern'] == 'sequential']
+        ax.plot(seq_data['num_buffers'], seq_data['median_ns'],
+                color=color, linestyle='-', marker=marker,
+                linewidth=2, markersize=6, label=f'{label} (seq)')
+
+        rand_data = df[df['pattern'] == 'random']
+        ax.plot(rand_data['num_buffers'], rand_data['median_ns'],
+                color=color, linestyle='--', marker=marker,
+                linewidth=2, markersize=6, label=f'{label} (rand)')
+
+    ax.set_xlabel('Number of Buffers (sliding window)')
+    ax.set_ylabel('Time per operation (ns)')
+    ax.set_title(title)
+    ax.set_ylim(0, None)
+    ax.grid(True, alpha=0.3)
+    ax.set_xscale('log', base=2)
+    ax.legend(fontsize=8, loc='upper left')
+
+    plt.tight_layout()
+    svg_path = results_dir / f"{name}.svg"
+    plt.savefig(svg_path, format='svg')
+    plt.close()
+    print(f"Saved SVG: {svg_path}")
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Buffer pinning benchmark')
+    parser.add_argument('--iterations', '-n', type=int, default=10000,
+                        help='Number of operations per sample (default: 10000)')
+    parser.add_argument('--samples', type=int, default=50,
+                        help='Number of samples per data point (default: 50)')
+    parser.add_argument('--port', type=int, default=5432,
+                        help='PostgreSQL port (default: 5432)')
+    parser.add_argument('--points', type=int, default=50,
+                        help='Number of data points (default: 200)')
+    parser.add_argument('--max-dist', type=int, default=500,
+                        help='Maximum buffer count to test (default: 512)')
+    parser.add_argument('--name', default='benchmark',
+                        help='Name for output files (default: benchmark)')
+    parser.add_argument('--title', default='PostgreSQL Buffer Pinning Performance',
+                        help='Title for the plot')
+    parser.add_argument('--results-dir', default=None,
+                        help='Directory for results (default: ./results)')
+    parser.add_argument('--table', default='bench_large',
+                        help='Table name to use for benchmarks (default: bench_large)')
+    args = parser.parse_args()
+
+    if args.results_dir is None:
+        args.results_dir = Path(__file__).parent / 'results'
+
+    engine = create_engine(f'postgresql://localhost:{args.port}/postgres')
+    buffer_counts = geomspace_int(1, args.max_dist, args.points)
+
+    min_blocks_needed = args.max_dist + 100
+
+    with engine.connect().execution_options(isolation_level="AUTOCOMMIT") as conn:
+        conn.execute(text("CREATE EXTENSION IF NOT EXISTS test_buffer_pin"))
+        ensure_test_table(conn, args.table, min_blocks_needed)
+
+
+        n, s = args.iterations, args.samples
+        print(f"\nRunning benchmarks: {n} iterations, {s} samples, max {args.max_dist} buffers")
+
+        results = {}
+
+        print("  Running: bench_pinning...")
+        results['pinning'] = run_benchmark(conn,
+            f"bench_pinning('{args.table}', b.num_buffers, {n}, {s}, r.random)",
+            buffer_counts)
+
+        print("  Running: bench_locking...")
+        results['locking'] = run_benchmark(conn,
+            f"bench_locking('{args.table}', b.num_buffers, {n}, {s}, r.random)",
+            buffer_counts)
+
+        print("  Running: bench_prefetch_pipeline...")
+        results['read'] = run_benchmark(conn,
+            f"bench_prefetch_pipeline('{args.table}', b.num_buffers, {n}, {s}, r.random)",
+            buffer_counts)
+
+        print("  Running: bench_resowner...")
+        results['resowner'] = run_benchmark(conn,
+            f"bench_resowner(b.num_buffers, {n}, {s}, r.random)",
+            buffer_counts)
+
+    save_results(args.results_dir, args.name, results, args.title)
+    plot_results(args.results_dir, args.name, results, args.title)
+
+    print(f"\nDone! Results saved to {args.results_dir}/{args.name}.*")
+
+
+if __name__ == '__main__':
+    main()
diff --git a/src/test/modules/test_buffer_pin/requirements.txt b/src/test/modules/test_buffer_pin/requirements.txt
new file mode 100644
index 00000000000..e73ab569275
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/requirements.txt
@@ -0,0 +1,9 @@
+# Requirements for benchmark.py
+# pip install -r requirements.txt
+# not enforcing versions, as it might simply
+# work with your installed versions
+numpy            # tested with 1.24+
+pandas           # tested with 2.0+
+sqlalchemy       # tested with 2.0+
+matplotlib       # tested with 3.7+
+psycopg2-binary  # tested with 2.9+
diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql b/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql
new file mode 100644
index 00000000000..2fb5577fac2
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql
@@ -0,0 +1,88 @@
+/* test_buffer_pin--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_buffer_pin" to load this file. \quit
+
+CREATE FUNCTION bench_prefetch_pipeline(
+    relname text,
+    prefetch_dist int,
+    num_ops int,
+    iterations int,
+    random_access bool
+)
+RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8)
+AS 'MODULE_PATHNAME', 'bench_prefetch_pipeline'
+LANGUAGE C STRICT;
+
+COMMENT ON FUNCTION bench_prefetch_pipeline IS
+'Benchmark buffer pinning with sliding window prefetch simulation.
+Arguments:
+  relname        - name of the relation to use
+  prefetch_dist  - number of buffers to keep pinned (sliding window size)
+  num_ops        - number of pin/unpin operations to perform
+  iterations     - number of times to repeat the benchmark
+  random_access  - if true, access blocks randomly; if false, sequentially
+';
+
+CREATE FUNCTION bench_pinning(
+    relname text,
+    num_buffers int,
+    num_ops int,
+    iterations int,
+    random_access bool
+)
+RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8)
+AS 'MODULE_PATHNAME', 'bench_pinning'
+LANGUAGE C STRICT;
+
+COMMENT ON FUNCTION bench_pinning IS
+'Benchmark pure pin/unpin operations without buffer lookup.
+Uses IncrBufferRefCount/ReleaseBuffer on pre-pinned buffers to isolate
+refcount tracking overhead from buffer table lookups and I/O.
+Arguments:
+  relname       - name of the relation to use for pinning buffers
+  num_buffers   - number of buffers to keep as base pins
+  num_ops       - number of pin/unpin pairs to perform
+  iterations    - number of times to repeat the benchmark
+  random_access - if true, access buffers randomly; if false, sequentially';
+
+CREATE FUNCTION bench_locking(
+    relname text,
+    num_buffers int,
+    num_ops int,
+    iterations int,
+    random_access bool
+)
+RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8)
+AS 'MODULE_PATHNAME', 'bench_locking'
+LANGUAGE C STRICT;
+
+COMMENT ON FUNCTION bench_locking IS
+'Benchmark buffer lock/unlock operations.
+Pre-pins all blocks then times LockBuffer/UnlockBuffer cycles over
+different buffers to measure locking overhead separately from pinning.
+Arguments:
+  relname       - name of the relation to use
+  num_buffers   - sliding window size for locks
+  num_ops       - number of lock/unlock pairs to perform
+  iterations    - number of times to repeat the benchmark
+  random_access - if true, access buffers randomly; if false, sequentially';
+
+CREATE FUNCTION bench_resowner(
+    num_buffers int,
+    num_ops int,
+    iterations int,
+    random_access bool
+)
+RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8)
+AS 'MODULE_PATHNAME', 'bench_resowner'
+LANGUAGE C STRICT;
+
+COMMENT ON FUNCTION bench_resowner IS
+'Benchmark ResourceOwner remember/forget operations only.
+Uses fake resource values - no actual resources are tracked.
+Arguments:
+  num_buffers   - number of fake resources to track
+  num_ops       - number of remember/forget pairs to perform
+  iterations    - number of times to repeat the benchmark
+  random_access - if true, access resources randomly; if false, sequentially';
diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin.c b/src/test/modules/test_buffer_pin/test_buffer_pin.c
new file mode 100644
index 00000000000..d5e59974dcb
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/test_buffer_pin.c
@@ -0,0 +1,421 @@
+/*
+ * test_buffer_pin.c - Buffer pinning benchmark
+ */
+
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "catalog/namespace.h"
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "portability/instr_time.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+#include "utils/resowner.h"
+#include "utils/varlena.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(bench_prefetch_pipeline);
+PG_FUNCTION_INFO_V1(bench_pinning);
+PG_FUNCTION_INFO_V1(bench_locking);
+PG_FUNCTION_INFO_V1(bench_resowner);
+
+/* Custom ResourceOwnerDesc for benchmark - does nothing on release */
+static void bench_release_resource(Datum res) { /* no-op */ }
+
+static const ResourceOwnerDesc bench_resowner_desc = {
+	.name = "BenchmarkResource",
+	.release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
+	.release_priority = RELEASE_PRIO_FIRST,
+	.ReleaseResource = bench_release_resource,
+	.DebugPrint = NULL,
+};
+
+/*
+ * Generate an access sequence for benchmarking.
+ * If random_access is true, uses Fisher-Yates shuffle.
+ * Otherwise, generates sequential pattern modulo num_items.
+ */
+static void
+generate_access_sequence(int *sequence, int num_operations, int num_items, bool random_access)
+{
+	for (int i = 0; i < num_operations; i++)
+		sequence[i] = i % num_items;
+
+	if (random_access)
+	{
+		for (int i = num_operations - 1; i > 0; i--)
+		{
+			int j = pg_prng_uint64_range(&pg_global_prng_state, 0, i);
+			int tmp = sequence[i];
+			sequence[i] = sequence[j];
+			sequence[j] = tmp;
+		}
+	}
+}
+
+
+/*
+ * bench_pinning - benchmark pure pin/unpin operations
+ *
+ * Warms up the cache by reading all blocks in the relation.
+ * Precompute a block sequence in which the buffers will be read.
+ * Then scan times repeatedly a scan following the block sequence.
+ * with a fixed distance of `num_buffers` (plus ramp up and ramp down)
+*/
+Datum
+bench_prefetch_pipeline(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	text	   *relname_text = PG_GETARG_TEXT_PP(0);
+	int			num_buffers = PG_GETARG_INT32(1);
+	int			num_operations = PG_GETARG_INT32(2);
+	int			iterations = PG_GETARG_INT32(3);
+	bool		random_access = PG_GETARG_BOOL(4);
+
+	Oid			relid;
+	Relation	rel;
+	BlockNumber nblocks;
+	Buffer	   *pipeline;
+	int		   *block_sequence;
+	int			iter;
+
+	Tuplestorestate *tupstore;
+	TupleDesc	tupdesc;
+	Datum		values[3];
+	bool		nulls[3] = {false, false, false};
+
+	if (num_buffers < 1 || num_operations < num_buffers)
+		ereport(ERROR, (errmsg("invalid parameters")));
+
+	InitMaterializedSRF(fcinfo, 0);
+	tupstore = rsinfo->setResult;
+	tupdesc = rsinfo->setDesc;
+
+	relid = RangeVarGetRelid(makeRangeVarFromNameList(textToQualifiedNameList(relname_text)), AccessShareLock, false);
+	rel = relation_open(relid, AccessShareLock);
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	if (nblocks == 0)
+		ereport(ERROR, (errmsg("relation has no blocks")));
+
+	pipeline = palloc0(num_buffers * sizeof(Buffer));
+	block_sequence = palloc(num_operations * sizeof(int));
+
+	/* Warm up */
+	for (int i = 0; i < num_operations && i < (int) nblocks; i++)
+	{
+		Buffer buf = ReadBuffer(rel, i);
+		ReleaseBuffer(buf);
+	}
+	generate_access_sequence(block_sequence, num_operations, nblocks, random_access);
+
+	for (iter = 0; iter < iterations; iter++)
+	{
+		instr_time	start_time, end_time;
+		int64		elapsed_ns;
+
+		INSTR_TIME_SET_CURRENT(start_time);
+
+		for (int i = 0; i < num_operations + num_buffers; i++)
+		{
+			if (i >= num_buffers)
+				ReleaseBuffer(pipeline[(i - num_buffers) % num_buffers]);
+			if (i < num_operations)
+				pipeline[i % num_buffers] = ReadBuffer(rel, block_sequence[i]);
+		}
+
+		INSTR_TIME_SET_CURRENT(end_time);
+		INSTR_TIME_SUBTRACT(end_time, start_time);
+		elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time);
+
+		values[0] = Int32GetDatum(iter);
+		values[1] = Int64GetDatum(elapsed_ns);
+		values[2] = Float8GetDatum((double) elapsed_ns / num_operations);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	pfree(pipeline);
+	pfree(block_sequence);
+	relation_close(rel, AccessShareLock);
+
+	return (Datum) 0;
+}
+
+/*
+ * bench_pinning - benchmark pure pin/unpin operations
+ *
+ * Pre-reads buffers into cache, then times IncrBufferRefCount/ReleaseBuffer
+ * cycles in a pipelined fashion. This is intended to separate the
+ * pin count dependent part of the ReadBuffer/ReleaseBuffer operations
+ */
+Datum
+bench_pinning(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	text	   *relname_text = PG_GETARG_TEXT_PP(0);
+	int			num_buffers = PG_GETARG_INT32(1);
+	int			num_operations = PG_GETARG_INT32(2);
+	int			iterations = PG_GETARG_INT32(3);
+	bool		random_access = PG_GETARG_BOOL(4);
+
+	Oid			relid;
+	Relation	rel;
+	BlockNumber nblocks;
+	Buffer	   *base_buffers;
+	Buffer	   *pipeline;
+	int		   *access_sequence;
+	int			iter;
+
+	Tuplestorestate *tupstore;
+	TupleDesc	tupdesc;
+	Datum		values[3];
+	bool		nulls[3] = {false, false, false};
+
+	if (num_buffers < 1 || num_operations < num_buffers)
+		ereport(ERROR, (errmsg("invalid parameters")));
+
+	InitMaterializedSRF(fcinfo, 0);
+	tupstore = rsinfo->setResult;
+	tupdesc = rsinfo->setDesc;
+
+	relid = RangeVarGetRelid(makeRangeVarFromNameList(textToQualifiedNameList(relname_text)), AccessShareLock, false);
+	rel = relation_open(relid, AccessShareLock);
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	if ((BlockNumber) num_buffers > nblocks)
+		ereport(ERROR, (errmsg("not enough blocks in relation")));
+
+	base_buffers = palloc(num_buffers * sizeof(Buffer));
+	pipeline = palloc(num_buffers * sizeof(Buffer));
+	access_sequence = palloc(num_operations * sizeof(int));
+
+	generate_access_sequence(access_sequence, num_operations, num_buffers, random_access);
+
+	/* Pin the buffers as base pins (keeps them in cache) */
+	for (int i = 0; i < num_buffers; i++)
+		base_buffers[i] = ReadBuffer(rel, i);
+
+	for (iter = 0; iter < iterations; iter++)
+	{
+		instr_time	start_time, end_time;
+		int64		elapsed_ns;
+
+		INSTR_TIME_SET_CURRENT(start_time);
+
+		for (int i = 0; i < num_operations + num_buffers; i++)
+		{
+			if (i >= num_buffers)
+				ReleaseBuffer(pipeline[(i - num_buffers) % num_buffers]);
+			if (i < num_operations)
+			{
+				Buffer buf = base_buffers[access_sequence[i]];
+				IncrBufferRefCount(buf);
+				pipeline[i % num_buffers] = buf;
+			}
+		}
+
+		INSTR_TIME_SET_CURRENT(end_time);
+		INSTR_TIME_SUBTRACT(end_time, start_time);
+		elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time);
+
+		values[0] = Int32GetDatum(iter);
+		values[1] = Int64GetDatum(elapsed_ns);
+		values[2] = Float8GetDatum((double) elapsed_ns / num_operations);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* Release base pins */
+	for (int i = 0; i < num_buffers; i++)
+		ReleaseBuffer(base_buffers[i]);
+
+	pfree(access_sequence);
+	pfree(pipeline);
+	pfree(base_buffers);
+	relation_close(rel, AccessShareLock);
+
+	return (Datum) 0;
+}
+
+/*
+ * bench_locking - benchmark buffer lock/unlock operations
+ *
+ * Pre-pins all blocks to keep them in cache, then times LockBuffer/UnlockBuffer
+ * cycles in a pipelined fashion over different buffers. This measures locking
+ * overhead separately from pinning, while accessing many different buffers.
+ */
+Datum
+bench_locking(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	text	   *relname_text = PG_GETARG_TEXT_PP(0);
+	int			num_buffers = PG_GETARG_INT32(1);
+	int			num_operations = PG_GETARG_INT32(2);
+	int			iterations = PG_GETARG_INT32(3);
+	bool		random_access = PG_GETARG_BOOL(4);
+
+	Oid			relid;
+	Relation	rel;
+	BlockNumber nblocks;
+	Buffer	   *base_buffers;
+	Buffer	   *pipeline;
+	int		   *block_sequence;
+	int			iter;
+
+	Tuplestorestate *tupstore;
+	TupleDesc	tupdesc;
+	Datum		values[3];
+	bool		nulls[3] = {false, false, false};
+
+	if (num_buffers < 1 || num_operations < num_buffers)
+		ereport(ERROR, (errmsg("invalid parameters")));
+
+	InitMaterializedSRF(fcinfo, 0);
+	tupstore = rsinfo->setResult;
+	tupdesc = rsinfo->setDesc;
+
+	relid = RangeVarGetRelid(makeRangeVarFromNameList(textToQualifiedNameList(relname_text)), AccessShareLock, false);
+	rel = relation_open(relid, AccessShareLock);
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	if (nblocks == 0)
+		ereport(ERROR, (errmsg("relation has no blocks")));
+
+	base_buffers = palloc(nblocks * sizeof(Buffer));
+	pipeline = palloc(num_buffers * sizeof(Buffer));
+	block_sequence = palloc(num_operations * sizeof(int));
+
+	/* Generate access pattern over all blocks */
+	generate_access_sequence(block_sequence, num_operations, nblocks, random_access);
+
+	/* Pin ALL blocks as base pins to keep them in cache */
+	for (BlockNumber i = 0; i < nblocks; i++)
+		base_buffers[i] = ReadBuffer(rel, i);
+
+	for (iter = 0; iter < iterations; iter++)
+	{
+		instr_time	start_time, end_time;
+		int64		elapsed_ns;
+
+		INSTR_TIME_SET_CURRENT(start_time);
+
+		for (int i = 0; i < num_operations + num_buffers; i++)
+		{
+			if (i >= num_buffers)
+				LockBuffer(pipeline[(i - num_buffers) % num_buffers], BUFFER_LOCK_UNLOCK);
+			if (i < num_operations)
+			{
+				Buffer		buf = base_buffers[block_sequence[i]];
+
+				LockBuffer(buf, BUFFER_LOCK_SHARE);
+				pipeline[i % num_buffers] = buf;
+			}
+		}
+
+		INSTR_TIME_SET_CURRENT(end_time);
+		INSTR_TIME_SUBTRACT(end_time, start_time);
+		elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time);
+
+		values[0] = Int32GetDatum(iter);
+		values[1] = Int64GetDatum(elapsed_ns);
+		values[2] = Float8GetDatum((double) elapsed_ns / num_operations);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* Release all base pins */
+	for (BlockNumber i = 0; i < nblocks; i++)
+		ReleaseBuffer(base_buffers[i]);
+
+	pfree(block_sequence);
+	pfree(pipeline);
+	pfree(base_buffers);
+	relation_close(rel, AccessShareLock);
+
+	return (Datum) 0;
+}
+
+/*
+ * bench_resowner - benchmark ResourceOwner remember/forget operations
+ *
+ * Uses fake resource values to test pure ResourceOwner tracking overhead
+ * without any actual resource operations. Uses same pipelined structure.
+ */
+Datum
+bench_resowner(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	int			num_buffers = PG_GETARG_INT32(0);
+	int			num_operations = PG_GETARG_INT32(1);
+	int			iterations = PG_GETARG_INT32(2);
+	bool		random_access = PG_GETARG_BOOL(3);
+
+	Datum	   *resources;
+	Datum	   *pipeline;
+	int		   *access_sequence;
+	int			iter;
+
+	Tuplestorestate *tupstore;
+	TupleDesc	tupdesc;
+	Datum		values[3];
+	bool		nulls[3] = {false, false, false};
+
+	if (num_buffers < 1 || num_operations < num_buffers)
+		ereport(ERROR, (errmsg("invalid parameters")));
+
+	InitMaterializedSRF(fcinfo, 0);
+	tupstore = rsinfo->setResult;
+	tupdesc = rsinfo->setDesc;
+
+	resources = palloc(num_buffers * sizeof(Datum));
+	pipeline = palloc(num_buffers * sizeof(Datum));
+	access_sequence = palloc(num_operations * sizeof(int));
+
+	/* Use fake resource values (pointers to array elements for uniqueness) */
+	for (int i = 0; i < num_buffers; i++)
+		resources[i] = PointerGetDatum(&resources[i]);
+
+	generate_access_sequence(access_sequence, num_operations, num_buffers, random_access);
+
+	for (iter = 0; iter < iterations; iter++)
+	{
+		instr_time	start_time, end_time;
+		int64		elapsed_ns;
+
+		INSTR_TIME_SET_CURRENT(start_time);
+
+		for (int i = 0; i < num_operations + num_buffers; i++)
+		{
+			if (i >= num_buffers)
+				ResourceOwnerForget(CurrentResourceOwner,
+									pipeline[(i - num_buffers) % num_buffers],
+									&bench_resowner_desc);
+			if (i < num_operations)
+			{
+				Datum		res = resources[access_sequence[i]];
+				ResourceOwnerEnlarge(CurrentResourceOwner);
+				ResourceOwnerRemember(CurrentResourceOwner, res, &bench_resowner_desc);
+				pipeline[i % num_buffers] = res;
+			}
+		}
+
+		INSTR_TIME_SET_CURRENT(end_time);
+		INSTR_TIME_SUBTRACT(end_time, start_time);
+		elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time);
+
+		values[0] = Int32GetDatum(iter);
+		values[1] = Int64GetDatum(elapsed_ns);
+		values[2] = Float8GetDatum((double) elapsed_ns / num_operations);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	pfree(access_sequence);
+	pfree(pipeline);
+	pfree(resources);
+
+	return (Datum) 0;
+}
diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin.control b/src/test/modules/test_buffer_pin/test_buffer_pin.control
new file mode 100644
index 00000000000..f0659417127
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/test_buffer_pin.control
@@ -0,0 +1,4 @@
+comment = 'test_buffer_pin - buffer pinning benchmarks'
+default_version = '1.0'
+module_pathname = '$libdir/test_buffer_pin'
+relocatable = true
-- 
2.53.0



  [application/octet-stream] v1-0005-REFCOUNT_ARRAY_ENTRIES-0.patch (6.5K, 7-v1-0005-REFCOUNT_ARRAY_ENTRIES-0.patch)
  download | inline diff:
From 85abc829474a4b0f40461af169dd0803e2e22c88 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Fri, 6 Mar 2026 17:46:38 +0000
Subject: [PATCH 5/5] REFCOUNT_ARRAY_ENTRIES=0

The simple hash performance is is fairly close to the direct array.
For few pins we are trading one small for loop for an index calculation
(buffer % N), this could be a (buffer & (N-1)) if we restrict the
simple array to use powers of 2 sizes.

For more than REFCOUNT_ARRAY_ENTRIES on distinct buffers pinned/unpinned in
a FIFO fashion, this is a strict improvement as every pin requires a hash
table operation.

---
 src/backend/storage/buffer/buf_refcount.c     |   60 +-
 src/include/storage/buf_refcount.h            |    3 +
 2 files changed

diff --git a/src/backend/storage/buffer/buf_refcount.c b/src/backend/storage/buffer/buf_refcount.c
index 29dfb720997..90cb42edbb5 100644
--- a/src/backend/storage/buffer/buf_refcount.c
+++ b/src/backend/storage/buffer/buf_refcount.c
@@ -88,17 +88,18 @@ struct PrivateRefCountIterator
 #define SH_DEFINE
 #include "lib/simplehash.h"
 
-/* Private refcount array and keys */
-#define REFCOUNT_ARRAY_ENTRIES 8
+
+/*
+ * Private refcount array and keys
+ * If set to 0, all the code handling the transfers between the array
+ * and the hash table is disabled at compilation time.
+*/
+#define REFCOUNT_ARRAY_ENTRIES 0
+
+#if REFCOUNT_ARRAY_ENTRIES > 0
 static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
 static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
 
-/* Overflow hash table for when array is full */
-static refcount_hash *PrivateRefCountHash = NULL;
-
-/* Count of entries that have overflowed into the hash table */
-static int32 PrivateRefCountOverflowed = 0;
-
 /* Clock hand for selecting victim when array is full */
 static uint32 PrivateRefCountClock = 0;
 
@@ -107,14 +108,23 @@ static int	ReservedRefCountSlot = -1;
 
 /* Cache for last accessed entry */
 static int	PrivateRefCountEntryLast = -1;
+#endif
 
 /* Advisory limit on the number of pins each backend should hold */
 static uint32 MaxProportionalPins = 0;
 
+/* Hash table (overflow when array used, primary when hash-only) */
+static refcount_hash *PrivateRefCountHash = NULL;
+
+/* Count of entries in the hash table */
+static int32 PrivateRefCountOverflowed = 0;
+
+#if REFCOUNT_ARRAY_ENTRIES > 0
 /* Forward declarations */
 static void ReservePrivateRefCountEntry(void);
 static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
 static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer);
+#endif
 
 /*
  * Initialize private refcount tracking for this backend.
@@ -130,12 +140,15 @@ InitPrivateRefCount(void)
 	 * GetAdditionalPinLimit() can be used to check the remaining balance.
 	*/
 	MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
+#if REFCOUNT_ARRAY_ENTRIES > 0
 	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
 	memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
+#endif
 
 	PrivateRefCountHash = refcount_create(CurrentMemoryContext, 64, NULL);
 }
 
+#if REFCOUNT_ARRAY_ENTRIES > 0
 /*
  * Ensure that the PrivateRefCountArray has sufficient space to store one more
  * entry.
@@ -233,7 +246,9 @@ NewPrivateRefCountEntry(Buffer buffer)
 
 	return res;
 }
+#endif /* REFCOUNT_ARRAY_ENTRIES > 0 */
 
+#if REFCOUNT_ARRAY_ENTRIES > 0
 /*
  * Slow-path for GetSharedBufferEntry().
  */
@@ -270,6 +285,7 @@ GetPrivateRefCountEntrySlow(Buffer buffer)
 	res = refcount_lookup(PrivateRefCountHash, buffer);
 	return res;
 }
+#endif /* REFCOUNT_ARRAY_ENTRIES > 0 */
 
 /*
  * Return the PrivateRefCount entry for the passed buffer.
@@ -281,6 +297,7 @@ GetSharedBufferEntry(Buffer buffer)
 	Assert(BufferIsValid(buffer));
 	Assert(!BufferIsLocal(buffer));
 
+#if REFCOUNT_ARRAY_ENTRIES > 0
 	/* Fast path: check one-entry cache */
 	if (likely(PrivateRefCountEntryLast != -1) &&
 		likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer))
@@ -289,6 +306,10 @@ GetSharedBufferEntry(Buffer buffer)
 	}
 
 	return GetPrivateRefCountEntrySlow(buffer);
+#else
+	/* Hash-only mode: direct lookup */
+	return refcount_lookup(PrivateRefCountHash, buffer);
+#endif
 }
 
 /*
@@ -308,10 +329,20 @@ SharedBufferRef(Buffer buffer)
 
 	if (ref == NULL)
 	{
-		/* New pin - create entry */
+#if REFCOUNT_ARRAY_ENTRIES > 0
+		/* New pin - create entry in array */
 		ReservePrivateRefCountEntry();
 		ref = NewPrivateRefCountEntry(buffer);
 		ref->data = ONE_PRIVATE_REFERENCE;
+#else
+		/* Hash-only mode: insert directly */
+		bool		found;
+
+		ref = refcount_insert(PrivateRefCountHash, buffer, &found);
+		Assert(!found);
+		ref->data = ONE_PRIVATE_REFERENCE;
+		PrivateRefCountOverflowed++;
+#endif
 	}
 	else
 	{
@@ -352,6 +383,7 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
 		/* No more references - clean up the entry */
 		Assert(SharedBufferGetLockMode(ref) == BUFFER_LOCK_UNLOCK);
 
+#if REFCOUNT_ARRAY_ENTRIES > 0
 		if (ref >= &PrivateRefCountArray[0] &&
 			ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES])
 		{
@@ -360,6 +392,7 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
 			ReservedRefCountSlot = ref - PrivateRefCountArray;
 		}
 		else
+#endif
 		{
 			/* could make slightly more efficient by using the pointer */
 			refcount_delete(PrivateRefCountHash, ref->buffer);
@@ -409,11 +442,11 @@ CheckPrivateRefCountLeaks(void)
 #ifdef USE_ASSERT_CHECKING
 	int			RefCountErrors = 0;
 	PrivateRefCountEntry *res;
-	int			i;
 	char	   *s;
 
+#if REFCOUNT_ARRAY_ENTRIES > 0
 	/* check the array */
-	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+	for (int i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
 	{
 		if (PrivateRefCountArrayKeys[i] != InvalidBuffer)
 		{
@@ -426,8 +459,9 @@ CheckPrivateRefCountLeaks(void)
 			RefCountErrors++;
 		}
 	}
+#endif
 
-	/* if necessary search the hash */
+	/* search the hash */
 	if (PrivateRefCountOverflowed)
 	{
 		refcount_iterator iter;
@@ -467,6 +501,7 @@ InitPrivateRefCountIterator(void)
 PrivateRefCountEntry *
 GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
 {
+#if REFCOUNT_ARRAY_ENTRIES > 0
 	/* First iterate through the array */
 	while (!iter->in_hash && iter->array_index < REFCOUNT_ARRAY_ENTRIES)
 	{
@@ -475,8 +510,9 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
 		if (PrivateRefCountArrayKeys[idx] != InvalidBuffer)
 			return &PrivateRefCountArray[idx];
 	}
+#endif
 
-	/* Then iterate through the hash if there are overflowed entries */
+	/* Then iterate through the hash if there are entries */
 	if (!iter->in_hash)
 	{
 		iter->in_hash = true;
-- 
2.53.0



  [text/x-sh] run-all.sh (2.7K, 8-run-all.sh)
  download | inline:
#!/bin/bash
set -e

SCRIPT_DIR=$(cd "$(dirname "$0")" && pwd)
PATCH_DIR="$SCRIPT_DIR/"
REPO_ROOT=$(cd "$SCRIPT_DIR/../.." && pwd)
BUILDS_DIR="$REPO_ROOT/.builds"
DBDATA_DIR="$REPO_ROOT/.dbdata"
RESULTS_DIR="$REPO_ROOT/src/test/modules/test_buffer_pin/results"
BENCHMARK_DIR="$REPO_ROOT/src/test/modules/test_buffer_pin"
mkdir -p $BENCHMARK_DIR/results/
PORT=5434

cd "$REPO_ROOT"

#AM_OPTS="--whitespace=nowarn --3way"
AM_OPTS=""

function clean_repo() {
    git reset --hard
#    git clean -fd
}

function build_and_test() {
    local NAME="$1"
    local TITLE="$2"
    local PREFIX="$BUILDS_DIR/$NAME"
    
    echo ""
    echo "============================================================"
    echo "Building: $NAME"
    echo "============================================================"
    
    # Configure and build
    make distclean >/dev/null 2>&1 || true
    ./configure --prefix="$PREFIX" --without-icu --without-readline --without-zlib >/dev/null
    make -j8 -s
    make install -s
    
    # Build and install extension
    cd "$BENCHMARK_DIR"
    make clean -s 2>/dev/null || true
    make PG_CONFIG="$PREFIX/bin/pg_config" USE_PGXS=1 install -s
    cd "$REPO_ROOT"
    
    echo ""
    echo "============================================================"
    echo "Testing: $NAME"
    echo "============================================================"
    
    # Stop any running server
    "$PREFIX/bin/pg_ctl" -D "$DBDATA_DIR" stop 2>/dev/null || true
    sleep 1
    
    # Initialize fresh database
    rm -rf "$DBDATA_DIR"
    "$PREFIX/bin/initdb" -D "$DBDATA_DIR" >/dev/null
    
    # Start server
    "$PREFIX/bin/pg_ctl" -D "$DBDATA_DIR" -l "$REPO_ROOT/logfile_$NAME" -o "-p $PORT" start
    sleep 2
    
    # Run benchmark
    mkdir -p "$RESULTS_DIR"
    cd "$BENCHMARK_DIR"
    python3 benchmark.py --port $PORT --name "$NAME" --title "$TITLE" \
        --max-dist 10000 --points 200 --samples 50 --iterations 20000
    cd "$REPO_ROOT"
    
    # Stop server
    "$PREFIX/bin/pg_ctl" -D "$DBDATA_DIR" stop
    
    echo "Results saved to: $RESULTS_DIR/$NAME.{svg,csv,txt}"
}

# # Master with patches applied incrementally
git checkout -B pins origin/master
# clean_repo

# Then apply each numbered patch and benchmark after each
for patch in "$PATCH_DIR"/*.patch; do
    [ -f "$patch" ] || continue
    
    # Extract patch number (e.g., 0001 -> 1)
    patchnum=$(basename "$patch" | cut -c1-4 | sed 's/^0*//')
    patchname=$(basename "$patch" .patch)
    
    git am $AM_OPTS "$patch"
    build_and_test "patch-$patchnum" "Patch $patchnum: $patchname"
done

echo ""
echo "============================================================"
echo "All benchmarks complete!"
echo "Results in: $RESULTS_DIR/"
echo "============================================================"

  [text/x-python-script] compare-patches.py (3.4K, 9-compare-patches.py)
  download | inline:
#!/usr/bin/env python3
"""Generate comparison charts for step*.csv benchmark results."""

import pandas as pd
import matplotlib.pyplot as plt
import glob
import os

RESULTS_DIR = os.path.dirname(os.path.abspath(__file__)) + '/results'

# Find all step*.csv files
csv_files = sorted(glob.glob(f'{RESULTS_DIR}/*.csv'))

# Filter to main steps only
main_steps = [f'patch-{i}' for i in [1, 2, 3, 4, 5]]
csv_files = [f for f in csv_files if any(s in f for s in main_steps)]

print(f"Found {len(csv_files)} step CSV files:")
for f in csv_files:
    print(f"  - {os.path.basename(f)}")

# Load all data as pivoted tables
all_data = []
for csv_file in csv_files:
    df = pd.read_csv(csv_file)
    step_name = os.path.basename(csv_file).replace('.csv', '')
    
    pivot = df.pivot_table(index=['pattern', 'num_buffers'], 
                           columns='benchmark', values='median_ns').reset_index()
    pivot['step'] = step_name
    
    # Compute derived metrics
    if 'read' in pivot.columns and 'resowner' in pivot.columns:
        pivot['read-resowner'] = pivot['read'] - pivot['resowner']
    
    all_data.append(pivot)

data = pd.concat(all_data, ignore_index=True)

# Define nice labels
step_labels = {
    'v4-pg18-baseline': 'PostgreSQL 18 Baseline',
    'patch-1': 'Patch 1: Pg19 Baseline',
    'patch-2': 'Patch 2: Refactoring',
    'patch-3': 'Patch 3: Simple hash',
    'patch-4': 'Patch 4: Comopact entry',
    'patch-5': 'Patch 5: No array',
}

# Color scheme
colors = {k: c for k, c in zip(
    step_labels.keys(), 
    ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00']
)}

def plot_benchmark(benchmark_name, title, output_name):
    """Plot comparison for a specific benchmark."""
    if benchmark_name not in data.columns:
        print(f"Skipping {benchmark_name}: column not found")
        return
    
    fig, axes = plt.subplots(1, 2, figsize=(10, 5))
    fig.suptitle(title, fontsize=14, fontweight='bold')
    
    for idx, pattern in enumerate(['sequential', 'random']):
        ax = axes[idx]
        ax.set_title(f'{pattern.capitalize()} Access Pattern')
        ax.set_xlabel('Number of Buffers')
        ax.set_ylabel('Time (ns)')
        ax.set_xscale('log')
        ax.grid(True, alpha=0.3)
        
        for step in main_steps:
            subset = data[(data['pattern'] == pattern) & (data['step'] == step)]
            if len(subset) > 0 and benchmark_name in subset.columns:
                ax.plot(subset['num_buffers'], subset[benchmark_name], 
                       label=step_labels.get(step, step),
                       color=colors.get(step, 'gray'),
                       linewidth=2, marker='o', markersize=3)
        
        ax.legend(loc='upper left', fontsize=9)
    
    plt.tight_layout()
    output_path = f'{RESULTS_DIR}/{output_name}.svg'
    plt.savefig(output_path, format='svg', bbox_inches='tight')
    plt.savefig(output_path.replace('.svg', '.png'), format='png', dpi=150, bbox_inches='tight')
    print(f"Saved: {output_path}")
    plt.close()

# Generate three comparison figures
plot_benchmark('read-resowner', 'Read/Release excluding Resowner', 'compare-read-resowner')
plot_benchmark('read', 'Read/Release Performance Comparison', 'compare-read')
plot_benchmark('resowner', 'ResourceOwner Performance Comparison', 'compare-resowner')
plot_benchmark('pinning', 'Pin/Unpin Performance Comparison', 'compare-pinning')
plot_benchmark('locking', 'Lock/Unlock Performance Comparison', 'compare-locking')

print("\nDone! Generated comparison charts.")

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Addressing buffer private reference count scalability issue
@ 2026-03-08 17:09  Andres Freund <[email protected]>
  parent: Alexandre Felipe <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Andres Freund @ 2026-03-08 17:09 UTC (permalink / raw)
  To: Alexandre Felipe <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

Hi,

On 2026-03-08 16:09:07 +0000, Alexandre Felipe wrote:
>    2.
> 
>    Refactoring reference counting: Before starting to change code and
>    potentially breaking things I considered prudent to isolate it to limit the
>    damage. This code was part of a 8k+ LOC file.

Not allowing, at least the fast paths, to be inlined will make the most common
cases measurably slower, I've tried that.


>    3.
> 
>    Using simplehash: Simply replacing the HTAB for a simplehash, and adding
>    a new set of macros SH_ENTRY_EMPTY, SH_MAKE_EMPTY, SH_MAKE_IN_USE.

Yea, we'll need to that. Peter Geoghegan has a patch for it as well.


> Here I assume that the buffer buffer sequence is independent enough from the
> array size, so I use the buffer as the hash key directly, omitting a hash
> function call.

I doubt that that's good enough. The hash table code relies on the bits being
well mixed, but you won't get that with buffer ids.


>    4.
> 
>    Compact PrivateRefCountEntry: The original implementation used a 4-byte
>    key and 8-byte value. Reference count uses 32 bits, and it is unreasonable
>    to expect one backend to pin the same buffer 1 billion times. The lock mode
>    uses 32 bits but can only assume 4 values. So I packed them in one single
>    uint32, giving 30 bits for count and 2 bits for lock mode. This makes the
>    entries 8-byte long, on 64-bit CPUs this represents more than a 1/3
>    reduction in memory. This makes the array aligned with the 64-bit words,
>    copying one entry can be completed in one instruction, and every entry will
>    be aligned.
>    5.

I'm rather sceptical that the overhead of needing to shift and mask is worth
it. I suspect we'll also want to add more state for each entry (e.g. I think
it may be worth getting rid of the io-in-progress resowner).



>    REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array
>    lookup, it is worth trying to remove it completely and keep only the hash.
>    For small values we would be trading a few branches for a buffer % SIZE,
>    for the use case of prefetch where pin/unpin in a FIFO fashion, it will
>    save an 8-entry array lookup, and some extra data moves.

I doubt that that's ok, in the vast majority of access we will have 0-2
buffers pinned. And even when we have pinned more buffers, it's *exceedingly*
common to access the same entry repeatedly (e.g. pin, lock, unlock, unpin),
adding a few cycles to each of those repeated accesses will quickly show up.


> From 56bfdd6d7296397a7b3f0b282e239fdc86d6580d Mon Sep 17 00:00:00 2001
> From: Alexandre Felipe <[email protected]>
> Date: Fri, 6 Mar 2026 17:15:44 +0000
> Subject: [PATCH 4/5] Compact PrivateRefCountEntry
> 
> The previous implementation used an 8-bite (64-bit) entry to store
> a uint32 count and an uint32 lock mode. That is fine when we store
> the data separate from the key (buffer). But in the simplehash
> {key, value} are stored together, so each entry is 12-bytes.
> This is somewhat awkward as we have to either pad the entry to 16-bytes,
> or the access will be an alternating aligned/misaligned addreses.
> 
> However, we are probably not going to need even 16-bits for the count
> and 2-bits is enough for the lock mode. So we can easily pack these
> int to a single uint32.

I wouldn't want to rely on a 16bit pin counter anyway.


> Incrementing/decrementing the count continue the same, just using
> 4 instead of 1, lock mode access will require one or two additional
> bitwise operations.
> 
> No bit-shifts are required.

I don't know how that last sentence can be true, given that:

> -struct PrivateRefCountEntry
> +#define PRIVATEREFCOUNT_LOCKMODE_MASK	0x3
> +#define ONE_PRIVATE_REFERENCE			4	/* 1 << 2 */
> +
> +#define PrivateRefCountGetLockMode(d)	((BufferLockMode)((d) & PRIVATEREFCOUNT_LOCKMODE_MASK))
> +#define PrivateRefCountSetLockMode(d, m) ((d) = ((d) & ~PRIVATEREFCOUNT_LOCKMODE_MASK) | (m))
> +#define PrivateRefCountGetRefCount(d)	((int32)((d) >> 2))
> +#define PrivateRefCountIsZero(d)		((d) < ONE_PRIVATE_REFERENCE)

Involves shifts at least in PrivateRefCountGetRefCount()..


Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Addressing buffer private reference count scalability issue
@ 2026-03-11 18:03  Alexandre Felipe <[email protected]>
  parent: Andres Freund <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Alexandre Felipe @ 2026-03-11 18:03 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

Hi Hackers,

Please find attached version 2 of this patch and the results of the
ReadBuffer/ReleaseBuffer benchmark

For those who prefer plain text here goes an excerpt
                        baseline   patch
num_buffers pattern
1           random        194.76  183.50
            sequential    183.75  168.02
2           random        204.74  187.02
            sequential    195.27  169.58
3           random        222.62  192.57
            sequential    212.81  169.55
7           random        233.96  219.64
            sequential    221.94  171.94
11          random        365.50  247.96
            sequential    318.24  183.96
19          random        390.58  267.96
            sequential    369.51  195.35
100         random        424.03  286.33
            sequential    420.74  280.23
988         random        445.75  301.37
            sequential    453.61  313.30
10000       random        587.13  418.91
            sequential    497.92  394.10

On Sun, Mar 8, 2026 at 5:09 PM Andres Freund <[email protected]> wrote:

> Hi,
>
> On 2026-03-08 16:09:07 +0000, Alexandre Felipe wrote:
> >    2.
> >
> >    Refactoring reference counting: Before starting to change code and
> >    potentially breaking things I considered prudent to isolate it to
> limit the
> >    damage. This code was part of a 8k+ LOC file.
>
> Not allowing, at least the fast paths, to be inlined will make the most
> common
> cases measurably slower, I've tried that.
>

I made it inline in this time. I placed in a separate file, I am including
the
.h at the top and the .c


> > Here I assume that the buffer buffer sequence is independent enough from
> the
> > array size, so I use the buffer as the hash key directly, omitting a hash
> > function call.
>
> I doubt that that's good enough. The hash table code relies on the bits
> being
> well mixed, but you won't get that with buffer ids.
>

Introduced murmurhash32, it is noticeably worse for sequential access
(by buffer index), but has some mixing.


> I'm rather sceptical that the overhead of needing to shift and mask is
> worth
> it. I suspect we'll also want to add more state for each entry (e.g. I
> think
> it may be worth getting rid of the io-in-progress resowner).
>

io-in-progress resowner name suggests that it runs only for reads...
so maybe it could be taken out of the way of buffer hits?


> >    REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array
> >    lookup, it is worth trying to remove it completely and keep only the
> hash.
> >    For small values we would be trading a few branches for a buffer %
> SIZE,
> >    for the use case of prefetch where pin/unpin in a FIFO fashion, it
> will
> >    save an 8-entry array lookup, and some extra data moves.
>
> I doubt that that's ok, in the vast majority of access we will have 0-2
> buffers pinned. And even when we have pinned more buffers, it's
> *exceedingly*
> common to access the same entry repeatedly (e.g. pin, lock, unlock, unpin),
> adding a few cycles to each of those repeated accesses will quickly show
> up.
>

I brought back the array, but I eliminated the linear search.

1. USE_REFCOUNT_CACHE_ENTRY will enable the last entry cache.

2a. the dynamic array case
REFCOUNT_ARRAY_MAX_SIZE!=REFCOUNT_ARRAY_INITIAL_SIZE
will grow the array when it reaches a certain level of occupation.
I have set the default occupation level to 86% so that, if enabled, for a
random input it will grow when we have about 2*size pins in total.
If we find a sequential pattern then it will grow without growing the hash
table.
For the array lookup I don't use a hash, so for small number of pins
it will be very fast.

2b. the static case
REFCOUNT_ARRAY_MAX_SIZE==REFCOUNT_ARRAY_INITIAL_SIZE
will use a static array, just as we had before and will not perform the
linear
search. It still has to read the size and do mask input.

I tested the 4 variations and the winner is with the static array without
the cache for the last entry.
I increased the array size from 8 to 32, since you suggested before that
that this could help. At that point it would have the tradeoff of a longer
linear search, so it may help even more now.

> Incrementing/decrementing the count continue the same, just using
> > 4 instead of 1, lock mode access will require one or two additional
> > bitwise operations.
> >
> > No bit-shifts are required.
>
> I don't know how that last sentence can be true, given that:
>
> > -struct PrivateRefCountEntry
> > +#define PRIVATEREFCOUNT_LOCKMODE_MASK        0x3
> > +#define ONE_PRIVATE_REFERENCE                        4       /* 1 << 2
> */
> > +
> > +#define PrivateRefCountGetLockMode(d)        ((BufferLockMode)((d) &
> PRIVATEREFCOUNT_LOCKMODE_MASK))
> > +#define PrivateRefCountSetLockMode(d, m) ((d) = ((d) &
> ~PRIVATEREFCOUNT_LOCKMODE_MASK) | (m))
> > +#define PrivateRefCountGetRefCount(d)        ((int32)((d) >> 2))
> > +#define PrivateRefCountIsZero(d)             ((d) <
> ONE_PRIVATE_REFERENCE)
>
> Involves shifts at least in PrivateRefCountGetRefCount()..
>

Yes, there was some unintended code there.
I am adding SharedBufferHasSingleRef to replace the (...) == 1 checks.
The actual count is used only for debugging, so the shift there shouldn't
be a problem.

Regards,
Alexandre


Attachments:

  [image/png] compare-read.png (156.2K, 3-compare-read.png)
  download | view image

  [application/x-patch] v2-0004-Compact-PrivateRefCountEntry.patch (12.8K, 4-v2-0004-Compact-PrivateRefCountEntry.patch)
  download | inline diff:
From 4b96a723bfc3a6fe78e05bc3ce454b0b160679d1 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Wed, 11 Mar 2026 12:08:03 +0000
Subject: [PATCH 4/5] Compact PrivateRefCountEntry

The previous implementation used an 8-bytes (64-bit) entry to store
a uint32 count and an uint32 lock mode. That is fine when we store
the data separate from the key (buffer). But in the simplehash
{key, value} are stored together, so each entry is 12-bytes.
This is somewhat awkward as we have to either pad the entry to 16-bytes,
or the access will be an alternating aligned/misaligned addreses.

Lock can assume only 4 values, and 2^30 is a decent limit for the
number of pins on a single buffer. So this change is packing the
{count[31:2], lock[1:0]} into a single uint32.

Incrementing/decrementing the count continue the same, just using
4 instead of 1, lock mode access will require one or two additional
bitwise operations. The exact count requires one shift, and is used
only for debugging. A special function is provided to check whether
count == 1.


No bit-shifts are required.
---
 src/backend/storage/buffer/bufmgr_refcnt.c | 186 +++++++++------------
 src/backend/storage/buffer/bufmgr_refcnt.h |   2 +-
 2 files changed, 76 insertions(+), 112 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr_refcnt.c b/src/backend/storage/buffer/bufmgr_refcnt.c
index dd95d953f6..bf84e9f62e 100644
--- a/src/backend/storage/buffer/bufmgr_refcnt.c
+++ b/src/backend/storage/buffer/bufmgr_refcnt.c
@@ -47,47 +47,45 @@
 #include "bufmgr_refcnt.h"
 #include "storage/proc.h"
 
-/* Structure definitions - internal to this file */
-typedef struct PrivateRefCountData
+/*
+ * Implementation details - opaque to callers.
+ * Packed refcount and lockmode in a single uint32:
+ *   Bits 0-1:  lockmode (4 values: UNLOCK, SHARE, SHARE_EXCLUSIVE, EXCLUSIVE)
+ *   Bits 2-31: refcount (30 bits, max ~1 billion)
+ */
+struct PrivateRefCountEntry
 {
-	int32		refcount;
-	BufferLockMode lockmode;
-} PrivateRefCountData;
+	Buffer		buffer;
+	uint32		data;
+};
 
-struct PrivateRefCountEntry
+#define PRIVATEREFCOUNT_LOCKMODE_MASK	0x3
+#define ONE_PRIVATE_REFERENCE			4	/* 1 << 2 */
+
+#define PrivateRefCountIsZero(d)		((d) < ONE_PRIVATE_REFERENCE)
+
+struct PrivateRefCountIterator
 {
-	Buffer		buffer;		/* hash key - InvalidBuffer means empty */
-	PrivateRefCountData data;
+	int			array_index;
+	bool		in_hash;
+	void	   *hash_status;
 };
 
-/*
- * Define simplehash parameters for the overflow hash table.
- * We use buffer == InvalidBuffer as the "empty" marker to avoid needing
- * a separate status field.
- */
+/* Define simplehash for private refcount overflow hash table */
 #define SH_PREFIX refcount
 #define SH_ELEMENT_TYPE PrivateRefCountEntry
 #define SH_KEY_TYPE Buffer
 #define SH_KEY buffer
-#define SH_HASH_KEY(tb, key) murmurhash32(key)
+#define SH_HASH_KEY(tb, key) ((uint32) murmurhash32(key))
 #define SH_EQUAL(tb, a, b) ((a) == (b))
 #define SH_SCOPE static inline
-#define SH_DEFINE
-#define SH_DECLARE
-/* Use buffer field as empty marker - no separate status needed */
 #define SH_ENTRY_EMPTY(entry) ((entry)->buffer == InvalidBuffer)
 #define SH_MAKE_EMPTY(entry) ((entry)->buffer = InvalidBuffer)
-#define SH_MAKE_IN_USE(entry) ((void)0)  /* key assignment handles this */
+#define SH_MAKE_IN_USE(entry) ((void) 0)
+#define SH_DECLARE
+#define SH_DEFINE
 #include "lib/simplehash.h"
 
-struct PrivateRefCountIterator
-{
-	int			array_index;
-	bool		in_hash;
-	refcount_iterator hash_iter;
-	bool		hash_iter_valid;
-};
-
 /* Private refcount array and keys */
 #define REFCOUNT_ARRAY_ENTRIES 8
 static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
@@ -112,9 +110,9 @@ static int	PrivateRefCountEntryLast = -1;
 static uint32 MaxProportionalPins = 0;
 
 /* Forward declarations */
-static void ReservePrivateRefCountEntry(void);
+static void ReservePrivateRefCountEntry(Buffer buffer);
 static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
-static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move);
+static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer);
 
 /*
  * Initialize private refcount tracking for this backend.
@@ -133,7 +131,7 @@ InitPrivateRefCount(void)
 	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
 	memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
 
-	PrivateRefCountHash = refcount_create(CurrentMemoryContext, 16, NULL);
+	PrivateRefCountHash = refcount_create(CurrentMemoryContext, 64, NULL);
 }
 
 /*
@@ -141,15 +139,15 @@ InitPrivateRefCount(void)
  * entry.
  */
 static void
-ReservePrivateRefCountEntry(void)
+ReservePrivateRefCountEntry(Buffer buffer)
 {
 	/* Already reserved (or freed), nothing to do */
 	if (ReservedRefCountSlot != -1)
 		return;
 
 	/*
-	 * First search for a free entry the array, that'll be sufficient in the
-	 * majority of cases.
+	 * First search for a free entry in the array, that'll be sufficient in
+	 * the majority of cases.
 	 */
 	{
 		int			i;
@@ -159,11 +157,9 @@ ReservePrivateRefCountEntry(void)
 			if (PrivateRefCountArrayKeys[i] == InvalidBuffer)
 			{
 				ReservedRefCountSlot = i;
+				return;
 			}
 		}
-
-		if (ReservedRefCountSlot != -1)
-			return;
 	}
 
 	/*
@@ -196,10 +192,7 @@ ReservePrivateRefCountEntry(void)
 		/* clear the now free array slot */
 		PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer;
 		victim_entry->buffer = InvalidBuffer;
-
-		memset(&victim_entry->data, 0, sizeof(victim_entry->data));
-		victim_entry->data.refcount = 0;
-		victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK;
+		victim_entry->data = 0;
 
 		PrivateRefCountOverflowed++;
 	}
@@ -222,8 +215,7 @@ NewPrivateRefCountEntry(Buffer buffer)
 	/* and fill it */
 	PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
 	res->buffer = buffer;
-	res->data.refcount = 0;
-	res->data.lockmode = BUFFER_LOCK_UNLOCK;
+	res->data = 0;
 
 	/* update cache for the next lookup */
 	PrivateRefCountEntryLast = ReservedRefCountSlot;
@@ -237,10 +229,9 @@ NewPrivateRefCountEntry(Buffer buffer)
  * Slow-path for GetSharedBufferEntry().
  */
 static pg_noinline PrivateRefCountEntry *
-GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
+GetPrivateRefCountEntrySlow(Buffer buffer)
 {
 	PrivateRefCountEntry *res;
-	int			match = -1;
 	int			i;
 
 	/*
@@ -250,16 +241,11 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
 	{
 		if (PrivateRefCountArrayKeys[i] == buffer)
 		{
-			match = i;
+			PrivateRefCountEntryLast = i;
+			return &PrivateRefCountArray[i];
 		}
 	}
 
-	if (likely(match != -1))
-	{
-		PrivateRefCountEntryLast = match;
-		return &PrivateRefCountArray[match];
-	}
-
 	/*
 	 * Only look up the buffer in the hashtable if we've previously overflowed.
 	 */
@@ -267,41 +253,11 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
 		return NULL;
 
 	res = refcount_lookup(PrivateRefCountHash, buffer);
-
-	if (res == NULL)
-		return NULL;
-	else if (!do_move)
-	{
-		return res;
-	}
-	else
-	{
-		/* move buffer from hashtable into the free array slot */
-		PrivateRefCountEntry *free;
-
-		ReservePrivateRefCountEntry();
-
-		Assert(ReservedRefCountSlot != -1);
-		free = &PrivateRefCountArray[ReservedRefCountSlot];
-		Assert(free->buffer == InvalidBuffer);
-
-		free->buffer = buffer;
-		free->data = res->data;
-		PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
-		PrivateRefCountEntryLast = ReservedRefCountSlot;
-
-		ReservedRefCountSlot = -1;
-
-		refcount_delete_item(PrivateRefCountHash, res);
-		Assert(PrivateRefCountOverflowed > 0);
-		PrivateRefCountOverflowed--;
-
-		return free;
-	}
+	return res;
 }
 
 /*
- * Return the PrivateRefCountEntry for the passed buffer.
+ * Return the PrivateRefCount entry for the passed buffer.
  * Returns NULL if the buffer is not currently pinned.
  */
 static PrivateRefCountEntry *
@@ -317,7 +273,7 @@ GetSharedBufferEntry(Buffer buffer)
 		return &PrivateRefCountArray[PrivateRefCountEntryLast];
 	}
 
-	return GetPrivateRefCountEntrySlow(buffer, false);
+	return GetPrivateRefCountEntrySlow(buffer);
 }
 
 /*
@@ -333,9 +289,9 @@ SharedBufferCreateRef(Buffer buffer)
 	Assert(BufferIsValid(buffer));
 	Assert(!BufferIsLocal(buffer));
 
-	ReservePrivateRefCountEntry();
+	ReservePrivateRefCountEntry(buffer);
 	ref = NewPrivateRefCountEntry(buffer);
-	ref->data.refcount++;
+	ref->data = ONE_PRIVATE_REFERENCE;
 
 	return ref;
 }
@@ -348,8 +304,8 @@ static void
 SharedBufferRefExisting(PrivateRefCountEntry *ref)
 {
 	Assert(ref != NULL);
-	Assert(ref->data.refcount > 0);
-	ref->data.refcount++;
+	Assert(!PrivateRefCountIsZero(ref->data));
+	ref->data += ONE_PRIVATE_REFERENCE;
 }
 
 /*
@@ -361,14 +317,14 @@ static bool
 SharedBufferUnref(PrivateRefCountEntry *ref)
 {
 	Assert(ref != NULL);
-	Assert(ref->data.refcount > 0);
+	Assert(!PrivateRefCountIsZero(ref->data));
 
-	ref->data.refcount--;
+	ref->data -= ONE_PRIVATE_REFERENCE;
 
-	if (ref->data.refcount == 0)
+	if (PrivateRefCountIsZero(ref->data))
 	{
 		/* No more references - clean up the entry */
-		Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK);
+		Assert(SharedBufferGetLockMode(ref) == BUFFER_LOCK_UNLOCK);
 
 		if (ref >= &PrivateRefCountArray[0] &&
 			ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES])
@@ -379,7 +335,8 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
 		}
 		else
 		{
-			refcount_delete_item(PrivateRefCountHash, ref);
+			/* could make slightly more efficient by using the pointer */
+			refcount_delete(PrivateRefCountHash, ref->buffer);
 			Assert(PrivateRefCountOverflowed > 0);
 			PrivateRefCountOverflowed--;
 		}
@@ -390,25 +347,16 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
 	return false;
 }
 
-/*
- * Accessors for refcount entry fields.
- */
-static int32
-SharedBufferRefCount(PrivateRefCountEntry *ref)
-{
-	return ref->data.refcount;
-}
-
 static BufferLockMode
 SharedBufferGetLockMode(PrivateRefCountEntry *ref)
 {
-	return ref->data.lockmode;
+	return (BufferLockMode)(ref->data & PRIVATEREFCOUNT_LOCKMODE_MASK);
 }
 
 static void
 SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode)
 {
-	ref->data.lockmode = mode;
+	ref->data = (ref->data & ~PRIVATEREFCOUNT_LOCKMODE_MASK) | mode;
 }
 
 /*
@@ -417,7 +365,18 @@ SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode)
 static bool
 SharedBufferHasSingleRef(PrivateRefCountEntry *ref)
 {
-	return ref != NULL && ref->data.refcount == 1;
+	return ref != NULL &&
+		   ref->data >= ONE_PRIVATE_REFERENCE &&
+		   ref->data < (2 * ONE_PRIVATE_REFERENCE);
+}
+
+/*
+ * Get the refcount for a buffer entry (for debug output only).
+ */
+static int32
+SharedBufferRefCount(PrivateRefCountEntry *ref)
+{
+	return (int32)(ref->data >> 2);
 }
 
 /*
@@ -488,7 +447,7 @@ AssertBufferLocksPermitCatalogRead(void)
 		if (buf == InvalidBuffer)
 			continue;
 
-		AssertNotCatalogBufferLock(buf, res->data.lockmode);
+		AssertNotCatalogBufferLock(buf, SharedBufferGetLockMode(res));
 	}
 	FreePrivateRefCountIterator(iter);
 }
@@ -504,7 +463,7 @@ InitPrivateRefCountIterator(void)
 
 	iter->array_index = 0;
 	iter->in_hash = false;
-	iter->hash_iter_valid = false;
+	iter->hash_status = NULL;
 	return iter;
 }
 
@@ -530,20 +489,23 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
 		iter->in_hash = true;
 		if (PrivateRefCountOverflowed > 0)
 		{
-			refcount_start_iterate(PrivateRefCountHash, &iter->hash_iter);
-			iter->hash_iter_valid = true;
+			refcount_iterator *hiter = palloc(sizeof(refcount_iterator));
+
+			refcount_start_iterate(PrivateRefCountHash, hiter);
+			iter->hash_status = hiter;
 		}
 	}
 
-	if (iter->hash_iter_valid)
+	if (iter->hash_status != NULL)
 	{
 		PrivateRefCountEntry *res;
 
-		res = refcount_iterate(PrivateRefCountHash, &iter->hash_iter);
+		res = refcount_iterate(PrivateRefCountHash, (refcount_iterator *) iter->hash_status);
 		if (res != NULL)
 			return res;
 
-		iter->hash_iter_valid = false;
+		pfree(iter->hash_status);
+		iter->hash_status = NULL;
 	}
 
 	return NULL;
@@ -555,6 +517,8 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
 void
 FreePrivateRefCountIterator(PrivateRefCountIterator *iter)
 {
+	if (iter->hash_status != NULL)
+		pfree(iter->hash_status);
 	pfree(iter);
 }
 
diff --git a/src/backend/storage/buffer/bufmgr_refcnt.h b/src/backend/storage/buffer/bufmgr_refcnt.h
index f2a6f45d84..8d31f6ee6f 100644
--- a/src/backend/storage/buffer/bufmgr_refcnt.h
+++ b/src/backend/storage/buffer/bufmgr_refcnt.h
@@ -39,8 +39,8 @@ static void SharedBufferRefExisting(PrivateRefCountEntry *ref);
 static bool SharedBufferUnref(PrivateRefCountEntry *ref);
 static BufferLockMode SharedBufferGetLockMode(PrivateRefCountEntry *ref);
 static void SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode);
-static int32 SharedBufferRefCount(PrivateRefCountEntry *ref);
 static bool SharedBufferHasSingleRef(PrivateRefCountEntry *ref);
+static int32 SharedBufferRefCount(PrivateRefCountEntry *ref);
 
 /* Pin limiting */
 extern uint32 GetPinLimit(void);
-- 
2.34.1



  [application/x-patch] v2-0002-Refactoring-reference-counting.patch (54.0K, 5-v2-0002-Refactoring-reference-counting.patch)
  download | inline diff:
From 5de90fcd14e5d32c3165c3e7a278adaa44f4d9d1 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Wed, 11 Mar 2026 12:05:50 +0000
Subject: [PATCH 2/5] Refactoring reference counting

This moves the private reference count logic out of
backend/storage/buffer/buffmgr.c that is still 8000+
lines long. Preparing for the changes to come.

---
 .gitignore                                 |   2 +
 src/backend/storage/buffer/Makefile        |   3 +
 src/backend/storage/buffer/bufmgr.c        | 704 +++------------------
 src/backend/storage/buffer/bufmgr_refcnt.c | 615 ++++++++++++++++++
 src/backend/storage/buffer/bufmgr_refcnt.h |  59 ++
 5 files changed, 751 insertions(+), 632 deletions(-)
 create mode 100644 src/backend/storage/buffer/bufmgr_refcnt.c
 create mode 100644 src/backend/storage/buffer/bufmgr_refcnt.h

diff --git a/.gitignore b/.gitignore
index 4e911395fe..fddb7f861d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -43,3 +43,5 @@ lib*.pc
 /Release/
 /tmp_install/
 /portlock/
+
+.*
\ No newline at end of file
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index fd7c40dcb0..6f45674cae 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -20,3 +20,6 @@ OBJS = \
 	localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
+
+# bufmgr.c includes bufmgr_refcnt.c for inlining
+bufmgr.o: bufmgr_refcnt.c bufmgr_refcnt.h
\ No newline at end of file
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0546ee0193..82af282f70 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -54,6 +54,7 @@
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "bufmgr_refcnt.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -70,7 +71,6 @@
 #include "utils/timestamp.h"
 #include "utils/wait_event.h"
 
-
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
 #define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
@@ -93,43 +93,6 @@
  */
 #define BUF_DROP_FULL_SCAN_THRESHOLD		(uint64) (NBuffers / 32)
 
-/*
- * This is separated out from PrivateRefCountEntry to allow for copying all
- * the data members via struct assignment.
- */
-typedef struct PrivateRefCountData
-{
-	/*
-	 * How many times has the buffer been pinned by this backend.
-	 */
-	int32		refcount;
-
-	/*
-	 * Is the buffer locked by this backend? BUFFER_LOCK_UNLOCK indicates that
-	 * the buffer is not locked.
-	 */
-	BufferLockMode lockmode;
-} PrivateRefCountData;
-
-typedef struct PrivateRefCountEntry
-{
-	/*
-	 * Note that this needs to be same as the entry's corresponding
-	 * PrivateRefCountArrayKeys[i], if the entry is stored in the array. We
-	 * store it in both places as this is used for the hashtable key and
-	 * because it is more convenient (passing around a PrivateRefCountEntry
-	 * suffices to identify the buffer) and faster (checking the keys array is
-	 * faster when checking many entries, checking the entry is faster if just
-	 * checking a single entry).
-	 */
-	Buffer		buffer;
-
-	PrivateRefCountData data;
-} PrivateRefCountEntry;
-
-/* 64 bytes, about the size of a cache line on common systems */
-#define REFCOUNT_ARRAY_ENTRIES 8
-
 /*
  * Status of buffers to checkpoint for a particular tablespace, used
  * internally in BufferSync.
@@ -213,55 +176,6 @@ int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
-/*
- * Backend-Private refcount management:
- *
- * Each buffer also has a private refcount that keeps track of the number of
- * times the buffer is pinned in the current process.  This is so that the
- * shared refcount needs to be modified only once if a buffer is pinned more
- * than once by an individual backend.  It's also used to check that no
- * buffers are still pinned at the end of transactions and when exiting. We
- * also use this mechanism to track whether this backend has a buffer locked,
- * and, if so, in what mode.
- *
- *
- * To avoid - as we used to - requiring an array with NBuffers entries to keep
- * track of local buffers, we use a small sequentially searched array
- * (PrivateRefCountArrayKeys, with the corresponding data stored in
- * PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to
- * keep track of backend local pins.
- *
- * Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all
- * refcounts are kept track of in the array; after that, new array entries
- * displace old ones into the hash table. That way a frequently used entry
- * can't get "stuck" in the hashtable while infrequent ones clog the array.
- *
- * Note that in most scenarios the number of pinned buffers will not exceed
- * REFCOUNT_ARRAY_ENTRIES.
- *
- *
- * To enter a buffer into the refcount tracking mechanism first reserve a free
- * entry using ReservePrivateRefCountEntry() and then later, if necessary,
- * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing
- * memory allocations in NewPrivateRefCountEntry() which can be important
- * because in some scenarios it's called with a spinlock held...
- */
-static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
-static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
-static HTAB *PrivateRefCountHash = NULL;
-static int32 PrivateRefCountOverflowed = 0;
-static uint32 PrivateRefCountClock = 0;
-static int	ReservedRefCountSlot = -1;
-static int	PrivateRefCountEntryLast = -1;
-
-static uint32 MaxProportionalPins;
-
-static void ReservePrivateRefCountEntry(void);
-static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
-static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
-static inline int32 GetPrivateRefCount(Buffer buffer);
-static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref);
-
 /* ResourceOwner callbacks to hold in-progress I/Os and buffer pins */
 static void ResOwnerReleaseBufferIO(Datum res);
 static char *ResOwnerPrintBufferIO(Datum res);
@@ -286,301 +200,6 @@ const ResourceOwnerDesc buffer_resowner_desc =
 	.DebugPrint = ResOwnerPrintBuffer
 };
 
-/*
- * Ensure that the PrivateRefCountArray has sufficient space to store one more
- * entry. This has to be called before using NewPrivateRefCountEntry() to fill
- * a new entry - but it's perfectly fine to not use a reserved entry.
- */
-static void
-ReservePrivateRefCountEntry(void)
-{
-	/* Already reserved (or freed), nothing to do */
-	if (ReservedRefCountSlot != -1)
-		return;
-
-	/*
-	 * First search for a free entry the array, that'll be sufficient in the
-	 * majority of cases.
-	 */
-	{
-		int			i;
-
-		for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
-		{
-			if (PrivateRefCountArrayKeys[i] == InvalidBuffer)
-			{
-				ReservedRefCountSlot = i;
-
-				/*
-				 * We could return immediately, but iterating till the end of
-				 * the array allows compiler-autovectorization.
-				 */
-			}
-		}
-
-		if (ReservedRefCountSlot != -1)
-			return;
-	}
-
-	/*
-	 * No luck. All array entries are full. Move one array entry into the hash
-	 * table.
-	 */
-	{
-		/*
-		 * Move entry from the current clock position in the array into the
-		 * hashtable. Use that slot.
-		 */
-		int			victim_slot;
-		PrivateRefCountEntry *victim_entry;
-		PrivateRefCountEntry *hashent;
-		bool		found;
-
-		/* select victim slot */
-		victim_slot = PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES;
-		victim_entry = &PrivateRefCountArray[victim_slot];
-		ReservedRefCountSlot = victim_slot;
-
-		/* Better be used, otherwise we shouldn't get here. */
-		Assert(PrivateRefCountArrayKeys[victim_slot] != InvalidBuffer);
-		Assert(PrivateRefCountArray[victim_slot].buffer != InvalidBuffer);
-		Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer);
-
-		/* enter victim array entry into hashtable */
-		hashent = hash_search(PrivateRefCountHash,
-							  &PrivateRefCountArrayKeys[victim_slot],
-							  HASH_ENTER,
-							  &found);
-		Assert(!found);
-		/* move data from the entry in the array to the hash entry */
-		hashent->data = victim_entry->data;
-
-		/* clear the now free array slot */
-		PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer;
-		victim_entry->buffer = InvalidBuffer;
-
-		/* clear the whole data member, just for future proofing */
-		memset(&victim_entry->data, 0, sizeof(victim_entry->data));
-		victim_entry->data.refcount = 0;
-		victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK;
-
-		PrivateRefCountOverflowed++;
-	}
-}
-
-/*
- * Fill a previously reserved refcount entry.
- */
-static PrivateRefCountEntry *
-NewPrivateRefCountEntry(Buffer buffer)
-{
-	PrivateRefCountEntry *res;
-
-	/* only allowed to be called when a reservation has been made */
-	Assert(ReservedRefCountSlot != -1);
-
-	/* use up the reserved entry */
-	res = &PrivateRefCountArray[ReservedRefCountSlot];
-
-	/* and fill it */
-	PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
-	res->buffer = buffer;
-	res->data.refcount = 0;
-	res->data.lockmode = BUFFER_LOCK_UNLOCK;
-
-	/* update cache for the next lookup */
-	PrivateRefCountEntryLast = ReservedRefCountSlot;
-
-	ReservedRefCountSlot = -1;
-
-	return res;
-}
-
-/*
- * Slow-path for GetPrivateRefCountEntry(). This is big enough to not be worth
- * inlining. This particularly seems to be true if the compiler is capable of
- * auto-vectorizing the code, as that imposes additional stack-alignment
- * requirements etc.
- */
-static pg_noinline PrivateRefCountEntry *
-GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
-{
-	PrivateRefCountEntry *res;
-	int			match = -1;
-	int			i;
-
-	/*
-	 * First search for references in the array, that'll be sufficient in the
-	 * majority of cases.
-	 */
-	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
-	{
-		if (PrivateRefCountArrayKeys[i] == buffer)
-		{
-			match = i;
-			/* see ReservePrivateRefCountEntry() for why we don't return */
-		}
-	}
-
-	if (likely(match != -1))
-	{
-		/* update cache for the next lookup */
-		PrivateRefCountEntryLast = match;
-
-		return &PrivateRefCountArray[match];
-	}
-
-	/*
-	 * By here we know that the buffer, if already pinned, isn't residing in
-	 * the array.
-	 *
-	 * Only look up the buffer in the hashtable if we've previously overflowed
-	 * into it.
-	 */
-	if (PrivateRefCountOverflowed == 0)
-		return NULL;
-
-	res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL);
-
-	if (res == NULL)
-		return NULL;
-	else if (!do_move)
-	{
-		/* caller doesn't want us to move the hash entry into the array */
-		return res;
-	}
-	else
-	{
-		/* move buffer from hashtable into the free array slot */
-		bool		found;
-		PrivateRefCountEntry *free;
-
-		/* Ensure there's a free array slot */
-		ReservePrivateRefCountEntry();
-
-		/* Use up the reserved slot */
-		Assert(ReservedRefCountSlot != -1);
-		free = &PrivateRefCountArray[ReservedRefCountSlot];
-		Assert(PrivateRefCountArrayKeys[ReservedRefCountSlot] == free->buffer);
-		Assert(free->buffer == InvalidBuffer);
-
-		/* and fill it */
-		free->buffer = buffer;
-		free->data = res->data;
-		PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
-		/* update cache for the next lookup */
-		PrivateRefCountEntryLast = match;
-
-		ReservedRefCountSlot = -1;
-
-
-		/* delete from hashtable */
-		hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
-		Assert(found);
-		Assert(PrivateRefCountOverflowed > 0);
-		PrivateRefCountOverflowed--;
-
-		return free;
-	}
-}
-
-/*
- * Return the PrivateRefCount entry for the passed buffer.
- *
- * Returns NULL if a buffer doesn't have a refcount entry. Otherwise, if
- * do_move is true, and the entry resides in the hashtable the entry is
- * optimized for frequent access by moving it to the array.
- */
-static inline PrivateRefCountEntry *
-GetPrivateRefCountEntry(Buffer buffer, bool do_move)
-{
-	Assert(BufferIsValid(buffer));
-	Assert(!BufferIsLocal(buffer));
-
-	/*
-	 * It's very common to look up the same buffer repeatedly. To make that
-	 * fast, we have a one-entry cache.
-	 *
-	 * In contrast to the loop in GetPrivateRefCountEntrySlow(), here it
-	 * faster to check PrivateRefCountArray[].buffer, as in the case of a hit
-	 * fewer addresses are computed and fewer cachelines are accessed. Whereas
-	 * in GetPrivateRefCountEntrySlow()'s case, checking
-	 * PrivateRefCountArrayKeys saves a lot of memory accesses.
-	 */
-	if (likely(PrivateRefCountEntryLast != -1) &&
-		likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer))
-	{
-		return &PrivateRefCountArray[PrivateRefCountEntryLast];
-	}
-
-	/*
-	 * The code for the cached lookup is small enough to be worth inlining
-	 * into the caller. In the miss case however, that empirically doesn't
-	 * seem worth it.
-	 */
-	return GetPrivateRefCountEntrySlow(buffer, do_move);
-}
-
-/*
- * Returns how many times the passed buffer is pinned by this backend.
- *
- * Only works for shared memory buffers!
- */
-static inline int32
-GetPrivateRefCount(Buffer buffer)
-{
-	PrivateRefCountEntry *ref;
-
-	Assert(BufferIsValid(buffer));
-	Assert(!BufferIsLocal(buffer));
-
-	/*
-	 * Not moving the entry - that's ok for the current users, but we might
-	 * want to change this one day.
-	 */
-	ref = GetPrivateRefCountEntry(buffer, false);
-
-	if (ref == NULL)
-		return 0;
-	return ref->data.refcount;
-}
-
-/*
- * Release resources used to track the reference count of a buffer which we no
- * longer have pinned and don't want to pin again immediately.
- */
-static void
-ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
-{
-	Assert(ref->data.refcount == 0);
-	Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK);
-
-	if (ref >= &PrivateRefCountArray[0] &&
-		ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES])
-	{
-		ref->buffer = InvalidBuffer;
-		PrivateRefCountArrayKeys[ref - PrivateRefCountArray] = InvalidBuffer;
-
-
-		/*
-		 * Mark the just used entry as reserved - in many scenarios that
-		 * allows us to avoid ever having to search the array/hash for free
-		 * entries.
-		 */
-		ReservedRefCountSlot = ref - PrivateRefCountArray;
-	}
-	else
-	{
-		bool		found;
-		Buffer		buffer = ref->buffer;
-
-		hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
-		Assert(found);
-		Assert(PrivateRefCountOverflowed > 0);
-		PrivateRefCountOverflowed--;
-	}
-}
-
 /*
  * BufferIsPinned
  *		True iff the buffer is pinned (also checks for valid buffer number).
@@ -596,7 +215,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 		BufferIsLocal(bufnum) ? \
 			(LocalRefCount[-(bufnum) - 1] > 0) \
 		: \
-	(GetPrivateRefCount(bufnum) > 0) \
+	(GetSharedBufferEntry(bufnum) != NULL) \
 )
 
 
@@ -653,7 +272,6 @@ static void RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 										   RelFileLocator dstlocator,
 										   ForkNumber forkNum, bool permanent);
 static void AtProcExit_Buffers(int code, Datum arg);
-static void CheckForBufferLeaks(void);
 #ifdef USE_ASSERT_CHECKING
 static void AssertNotCatalogBufferLock(Buffer buffer, BufferLockMode mode);
 #endif
@@ -812,7 +430,6 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 	Assert(BufferIsValid(recent_buffer));
 
 	ResourceOwnerEnlarge(CurrentResourceOwner);
-	ReservePrivateRefCountEntry();
 	InitBufferTag(&tag, &rlocator, forkNum, blockNum);
 
 	if (BufferIsLocal(recent_buffer))
@@ -2115,7 +1732,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* Make sure we will have room to remember the buffer pin */
 	ResourceOwnerEnlarge(CurrentResourceOwner);
-	ReservePrivateRefCountEntry();
 
 	/* create a tag so we can lookup the buffer */
 	InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
@@ -2327,7 +1943,7 @@ retry:
 		UnlockBufHdr(buf);
 		LWLockRelease(oldPartitionLock);
 		/* safety check: should definitely not be our *own* pin */
-		if (GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) > 0)
+		if (GetSharedBufferEntry(BufferDescriptorGetBuffer(buf)) != NULL)
 			elog(ERROR, "buffer is pinned in InvalidateBuffer");
 		WaitIO(buf);
 		goto retry;
@@ -2380,7 +1996,7 @@ InvalidateVictimBuffer(BufferDesc *buf_hdr)
 	LWLock	   *partition_lock;
 	BufferTag	tag;
 
-	Assert(GetPrivateRefCount(BufferDescriptorGetBuffer(buf_hdr)) == 1);
+	Assert(GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr)) != NULL);
 
 	/* have buffer pinned, so it's safe to read tag without lock */
 	tag = buf_hdr->tag;
@@ -2461,7 +2077,6 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 	 * Ensure, before we pin a victim buffer, that there's a free refcount
 	 * entry and resource owner slot for the pin.
 	 */
-	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
 	/* we return here if a prospective victim buffer gets used concurrently */
@@ -2591,64 +2206,6 @@ again:
 	return buf;
 }
 
-/*
- * Return the maximum number of buffers that a backend should try to pin once,
- * to avoid exceeding its fair share.  This is the highest value that
- * GetAdditionalPinLimit() could ever return.  Note that it may be zero on a
- * system with a very small buffer pool relative to max_connections.
- */
-uint32
-GetPinLimit(void)
-{
-	return MaxProportionalPins;
-}
-
-/*
- * Return the maximum number of additional buffers that this backend should
- * pin if it wants to stay under the per-backend limit, considering the number
- * of buffers it has already pinned.  Unlike LimitAdditionalPins(), the limit
- * return by this function can be zero.
- */
-uint32
-GetAdditionalPinLimit(void)
-{
-	uint32		estimated_pins_held;
-
-	/*
-	 * We get the number of "overflowed" pins for free, but don't know the
-	 * number of pins in PrivateRefCountArray.  The cost of calculating that
-	 * exactly doesn't seem worth it, so just assume the max.
-	 */
-	estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
-
-	/* Is this backend already holding more than its fair share? */
-	if (estimated_pins_held > MaxProportionalPins)
-		return 0;
-
-	return MaxProportionalPins - estimated_pins_held;
-}
-
-/*
- * Limit the number of pins a batch operation may additionally acquire, to
- * avoid running out of pinnable buffers.
- *
- * One additional pin is always allowed, on the assumption that the operation
- * requires at least one to make progress.
- */
-void
-LimitAdditionalPins(uint32 *additional_pins)
-{
-	uint32		limit;
-
-	if (*additional_pins <= 1)
-		return;
-
-	limit = GetAdditionalPinLimit();
-	limit = Max(limit, 1);
-	if (limit < *additional_pins)
-		*additional_pins = limit;
-}
-
 /*
  * Logic shared between ExtendBufferedRelBy(), ExtendBufferedRelTo(). Just to
  * avoid duplicating the tracing and relpersistence related logic.
@@ -2812,7 +2369,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 		/* in case we need to pin an existing buffer below */
 		ResourceOwnerEnlarge(CurrentResourceOwner);
-		ReservePrivateRefCountEntry();
 
 		InitBufferTag(&tag, &BMR_GET_SMGR(bmr)->smgr_rlocator.locator, fork,
 					  first_block + i);
@@ -3185,9 +2741,8 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
 	PrivateRefCountEntry *ref;
 
 	Assert(!BufferIsLocal(b));
-	Assert(ReservedRefCountSlot != -1);
 
-	ref = GetPrivateRefCountEntry(b, true);
+	ref = GetSharedBufferEntry(b);
 
 	if (ref == NULL)
 	{
@@ -3257,8 +2812,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
 		 */
 		result = (pg_atomic_read_u64(&buf->state) & BM_VALID) != 0;
 
-		Assert(ref->data.refcount > 0);
-		ref->data.refcount++;
+		SharedBufferRefExisting(ref);
 		ResourceOwnerRememberBuffer(CurrentResourceOwner, b);
 	}
 
@@ -3296,7 +2850,7 @@ PinBuffer_Locked(BufferDesc *buf)
 	 * As explained, We don't expect any preexisting pins. That allows us to
 	 * manipulate the PrivateRefCount after releasing the spinlock
 	 */
-	Assert(GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf), false) == NULL);
+	Assert(GetSharedBufferEntry(BufferDescriptorGetBuffer(buf)) == NULL);
 
 	/*
 	 * Since we hold the buffer spinlock, we can update the buffer state and
@@ -3373,11 +2927,10 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	Assert(!BufferIsLocal(b));
 
 	/* not moving as we're likely deleting it soon anyway */
-	ref = GetPrivateRefCountEntry(b, false);
+	ref = GetSharedBufferEntry(b);
 	Assert(ref != NULL);
-	Assert(ref->data.refcount > 0);
-	ref->data.refcount--;
-	if (ref->data.refcount == 0)
+
+	if (SharedBufferUnref(ref))
 	{
 		uint64		old_buf_state;
 
@@ -3402,8 +2955,6 @@ UnpinBufferNoOwner(BufferDesc *buf)
 		/* Support LockBufferForCleanup() */
 		if (old_buf_state & BM_PIN_COUNT_WAITER)
 			WakePinCountWaiter(buf);
-
-		ForgetPrivateRefCountEntry(ref);
 	}
 }
 
@@ -3414,10 +2965,7 @@ UnpinBufferNoOwner(BufferDesc *buf)
 inline void
 TrackNewBufferPin(Buffer buf)
 {
-	PrivateRefCountEntry *ref;
-
-	ref = NewPrivateRefCountEntry(buf);
-	ref->data.refcount++;
+	SharedBufferCreateRef(buf);
 
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, buf);
 
@@ -4037,7 +3585,6 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	BufferTag	tag;
 
 	/* Make sure we can handle the pin */
-	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
 	/*
@@ -4101,11 +3648,9 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 void
 AtEOXact_Buffers(bool isCommit)
 {
-	CheckForBufferLeaks();
+	CheckPrivateRefCountLeaks();
 
 	AtEOXact_LocalBuffers(isCommit);
-
-	Assert(PrivateRefCountOverflowed == 0);
 }
 
 /*
@@ -4118,25 +3663,8 @@ AtEOXact_Buffers(bool isCommit)
 void
 InitBufferManagerAccess(void)
 {
-	HASHCTL		hash_ctl;
-
-	/*
-	 * An advisory limit on the number of pins each backend should hold, based
-	 * on shared_buffers and the maximum number of connections possible.
-	 * That's very pessimistic, but outside toy-sized shared_buffers it should
-	 * allow plenty of pins.  LimitAdditionalPins() and
-	 * GetAdditionalPinLimit() can be used to check the remaining balance.
-	 */
-	MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
-
-	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
-	memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
-
-	hash_ctl.keysize = sizeof(Buffer);
-	hash_ctl.entrysize = sizeof(PrivateRefCountEntry);
 
-	PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS);
+	InitPrivateRefCount();
 
 	/*
 	 * AtProcExit_Buffers needs LWLock access, and thereby has to be called at
@@ -4155,112 +3683,13 @@ AtProcExit_Buffers(int code, Datum arg)
 {
 	UnlockBuffers();
 
-	CheckForBufferLeaks();
+	CheckPrivateRefCountLeaks();
 
 	/* localbuf.c needs a chance too */
 	AtProcExit_LocalBuffers();
 }
 
-/*
- *		CheckForBufferLeaks - ensure this backend holds no buffer pins
- *
- *		As of PostgreSQL 8.0, buffer pins should get released by the
- *		ResourceOwner mechanism.  This routine is just a debugging
- *		cross-check that no pins remain.
- */
-static void
-CheckForBufferLeaks(void)
-{
 #ifdef USE_ASSERT_CHECKING
-	int			RefCountErrors = 0;
-	PrivateRefCountEntry *res;
-	int			i;
-	char	   *s;
-
-	/* check the array */
-	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
-	{
-		if (PrivateRefCountArrayKeys[i] != InvalidBuffer)
-		{
-			res = &PrivateRefCountArray[i];
-
-			s = DebugPrintBufferRefcount(res->buffer);
-			elog(WARNING, "buffer refcount leak: %s", s);
-			pfree(s);
-
-			RefCountErrors++;
-		}
-	}
-
-	/* if necessary search the hash */
-	if (PrivateRefCountOverflowed)
-	{
-		HASH_SEQ_STATUS hstat;
-
-		hash_seq_init(&hstat, PrivateRefCountHash);
-		while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			s = DebugPrintBufferRefcount(res->buffer);
-			elog(WARNING, "buffer refcount leak: %s", s);
-			pfree(s);
-			RefCountErrors++;
-		}
-	}
-
-	Assert(RefCountErrors == 0);
-#endif
-}
-
-#ifdef USE_ASSERT_CHECKING
-/*
- * Check for exclusive-locked catalog buffers.  This is the core of
- * AssertCouldGetRelation().
- *
- * A backend would self-deadlock on the content lock if the catalog scan read
- * the exclusive-locked buffer.  The main threat is exclusive-locked buffers
- * of catalogs used in relcache, because a catcache search on any catalog may
- * build that catalog's relcache entry.  We don't have an inventory of
- * catalogs relcache uses, so just check buffers of most catalogs.
- *
- * It's better to minimize waits while holding an exclusive buffer lock, so it
- * would be nice to broaden this check not to be catalog-specific.  However,
- * bttextcmp() accesses pg_collation, and non-core opclasses might similarly
- * read tables.  That is deadlock-free as long as there's no loop in the
- * dependency graph: modifying table A may cause an opclass to read table B,
- * but it must not cause a read of table A.
- */
-void
-AssertBufferLocksPermitCatalogRead(void)
-{
-	PrivateRefCountEntry *res;
-
-	/* check the array */
-	for (int i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
-	{
-		if (PrivateRefCountArrayKeys[i] != InvalidBuffer)
-		{
-			res = &PrivateRefCountArray[i];
-
-			if (res->buffer == InvalidBuffer)
-				continue;
-
-			AssertNotCatalogBufferLock(res->buffer, res->data.lockmode);
-		}
-	}
-
-	/* if necessary search the hash */
-	if (PrivateRefCountOverflowed)
-	{
-		HASH_SEQ_STATUS hstat;
-
-		hash_seq_init(&hstat, PrivateRefCountHash);
-		while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			AssertNotCatalogBufferLock(res->buffer, res->data.lockmode);
-		}
-	}
-}
-
 static void
 AssertNotCatalogBufferLock(Buffer buffer, BufferLockMode mode)
 {
@@ -4312,8 +3741,10 @@ DebugPrintBufferRefcount(Buffer buffer)
 	}
 	else
 	{
+		PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
+
 		buf = GetBufferDescriptor(buffer - 1);
-		loccount = GetPrivateRefCount(buffer);
+		loccount = ref ? SharedBufferRefCount(ref) : 0;
 		backend = INVALID_PROC_NUMBER;
 	}
 
@@ -5100,7 +4531,6 @@ FlushRelationBuffers(Relation rel)
 				error_context_stack = &errcallback;
 
 				/* Make sure we can handle the pin */
-				ReservePrivateRefCountEntry();
 				ResourceOwnerEnlarge(CurrentResourceOwner);
 
 				/*
@@ -5136,7 +4566,6 @@ FlushRelationBuffers(Relation rel)
 			continue;
 
 		/* Make sure we can handle the pin */
-		ReservePrivateRefCountEntry();
 		ResourceOwnerEnlarge(CurrentResourceOwner);
 
 		buf_state = LockBufHdr(bufHdr);
@@ -5231,7 +4660,6 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 			continue;
 
 		/* Make sure we can handle the pin */
-		ReservePrivateRefCountEntry();
 		ResourceOwnerEnlarge(CurrentResourceOwner);
 
 		buf_state = LockBufHdr(bufHdr);
@@ -5457,7 +4885,6 @@ FlushDatabaseBuffers(Oid dbid)
 			continue;
 
 		/* Make sure we can handle the pin */
-		ReservePrivateRefCountEntry();
 		ResourceOwnerEnlarge(CurrentResourceOwner);
 
 		buf_state = LockBufHdr(bufHdr);
@@ -5532,17 +4959,18 @@ UnlockReleaseBuffer(Buffer buffer)
 void
 IncrBufferRefCount(Buffer buffer)
 {
-	Assert(BufferIsPinned(buffer));
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 	if (BufferIsLocal(buffer))
+	{
+		Assert(LocalRefCount[-buffer - 1] > 0);
 		LocalRefCount[-buffer - 1]++;
+	}
 	else
 	{
-		PrivateRefCountEntry *ref;
+		PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
 
-		ref = GetPrivateRefCountEntry(buffer, true);
 		Assert(ref != NULL);
-		ref->data.refcount++;
+		SharedBufferRefExisting(ref);
 	}
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, buffer);
 }
@@ -5561,11 +4989,9 @@ MarkSharedBufferDirtyHint(Buffer buffer, BufferDesc *bufHdr, uint64 lockstate,
 {
 	Page		page = BufferGetPage(buffer);
 
-	Assert(GetPrivateRefCount(buffer) > 0);
-
+	Assert(GetSharedBufferEntry(buffer) != NULL);
 	/* here, either share-exclusive or exclusive lock is OK */
-	Assert(BufferLockHeldByMeInMode(bufHdr, BUFFER_LOCK_EXCLUSIVE) ||
-		   BufferLockHeldByMeInMode(bufHdr, BUFFER_LOCK_SHARE_EXCLUSIVE));
+	Assert(BufferIsLockedByMe(buffer));
 
 	/*
 	 * This routine might get called many times on the same page, if we are
@@ -5777,12 +5203,12 @@ BufferLockAcquire(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
 	 * Get reference to the refcount entry before we hold the lock, it seems
 	 * better to do before holding the lock.
 	 */
-	entry = GetPrivateRefCountEntry(buffer, true);
+	entry = GetSharedBufferEntry(buffer);
 
 	/*
 	 * We better not already hold a lock on the buffer.
 	 */
-	Assert(entry->data.lockmode == BUFFER_LOCK_UNLOCK);
+	Assert(SharedBufferGetLockMode(entry) == BUFFER_LOCK_UNLOCK);
 
 	/*
 	 * Lock out cancel/die interrupts until we exit the code section protected
@@ -5871,7 +5297,7 @@ BufferLockAcquire(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
 	}
 
 	/* Remember that we now hold this lock */
-	entry->data.lockmode = mode;
+	SharedBufferSetLockMode(entry, mode);
 
 	/*
 	 * Fix the process wait semaphore's count for any absorbed wakeups.
@@ -5922,7 +5348,7 @@ BufferLockUnlock(Buffer buffer, BufferDesc *buf_hdr)
 static bool
 BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
 {
-	PrivateRefCountEntry *entry = GetPrivateRefCountEntry(buffer, true);
+	PrivateRefCountEntry *entry = GetSharedBufferEntry(buffer);
 	bool		mustwait;
 
 	/*
@@ -5930,7 +5356,7 @@ BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
 	 * already has locked, return false, independent of the existing and
 	 * desired lock level.
 	 */
-	if (entry->data.lockmode != BUFFER_LOCK_UNLOCK)
+	if (SharedBufferGetLockMode(entry) != BUFFER_LOCK_UNLOCK)
 		return false;
 
 	/*
@@ -5950,7 +5376,7 @@ BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
 	}
 	else
 	{
-		entry->data.lockmode = mode;
+		SharedBufferSetLockMode(entry, mode);
 	}
 
 	return !mustwait;
@@ -6160,11 +5586,11 @@ BufferLockDisownInternal(Buffer buffer, BufferDesc *buf_hdr)
 	BufferLockMode mode;
 	PrivateRefCountEntry *ref;
 
-	ref = GetPrivateRefCountEntry(buffer, false);
+	ref = GetSharedBufferEntry(buffer);
 	if (ref == NULL)
 		elog(ERROR, "lock %d is not held", buffer);
-	mode = ref->data.lockmode;
-	ref->data.lockmode = BUFFER_LOCK_UNLOCK;
+	mode = SharedBufferGetLockMode(ref);
+	SharedBufferSetLockMode(ref, BUFFER_LOCK_UNLOCK);
 
 	return mode;
 }
@@ -6398,12 +5824,12 @@ static bool
 BufferLockHeldByMeInMode(BufferDesc *buf_hdr, BufferLockMode mode)
 {
 	PrivateRefCountEntry *entry =
-		GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf_hdr), false);
+		GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr));
 
 	if (!entry)
 		return false;
 	else
-		return entry->data.lockmode == mode;
+		return SharedBufferGetLockMode(entry) == mode;
 }
 
 /*
@@ -6416,12 +5842,12 @@ static bool
 BufferLockHeldByMe(BufferDesc *buf_hdr)
 {
 	PrivateRefCountEntry *entry =
-		GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf_hdr), false);
+		GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr));
 
 	if (!entry)
 		return false;
 	else
-		return entry->data.lockmode != BUFFER_LOCK_UNLOCK;
+		return SharedBufferGetLockMode(entry) != BUFFER_LOCK_UNLOCK;
 }
 
 /*
@@ -6517,9 +5943,13 @@ CheckBufferIsPinnedOnce(Buffer buffer)
 	}
 	else
 	{
-		if (GetPrivateRefCount(buffer) != 1)
-			elog(ERROR, "incorrect local pin count: %d",
-				 GetPrivateRefCount(buffer));
+		{
+			PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
+
+			if (!ref || !SharedBufferHasSingleRef(ref))
+				elog(ERROR, "incorrect pin count: %d",
+					 ref ? SharedBufferRefCount(ref) : 0);
+		}
 	}
 }
 
@@ -6700,7 +6130,7 @@ HoldingBufferPinThatDelaysRecovery(void)
 	if (bufid < 0)
 		return false;
 
-	if (GetPrivateRefCount(bufid + 1) > 0)
+	if (GetSharedBufferEntry(bufid + 1) != NULL)
 		return true;
 
 	return false;
@@ -6735,10 +6165,13 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 	}
 
 	/* There should be exactly one local pin */
-	refcount = GetPrivateRefCount(buffer);
-	Assert(refcount);
-	if (refcount != 1)
-		return false;
+	{
+		PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
+
+		Assert(ref != NULL);
+		if (!SharedBufferHasSingleRef(ref))
+			return false;
+	}
 
 	/* Try to acquire lock */
 	if (!ConditionalLockBuffer(buffer))
@@ -6790,8 +6223,12 @@ IsBufferCleanupOK(Buffer buffer)
 	}
 
 	/* There should be exactly one local pin */
-	if (GetPrivateRefCount(buffer) != 1)
-		return false;
+	{
+		PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
+
+		if (!SharedBufferHasSingleRef(ref))
+			return false;
+	}
 
 	bufHdr = GetBufferDescriptor(buffer - 1);
 
@@ -6827,12 +6264,12 @@ SharedBufferBeginSetHintBits(Buffer buffer, BufferDesc *buf_hdr, uint64 *locksta
 	PrivateRefCountEntry *ref;
 	BufferLockMode mode;
 
-	ref = GetPrivateRefCountEntry(buffer, true);
+	ref = GetSharedBufferEntry(buffer);
 
 	if (ref == NULL)
 		elog(ERROR, "buffer is not pinned");
 
-	mode = ref->data.lockmode;
+	mode = SharedBufferGetLockMode(ref);
 	if (mode == BUFFER_LOCK_UNLOCK)
 		elog(ERROR, "buffer is not locked");
 
@@ -6874,7 +6311,7 @@ SharedBufferBeginSetHintBits(Buffer buffer, BufferDesc *buf_hdr, uint64 *locksta
 		if (likely(pg_atomic_compare_exchange_u64(&buf_hdr->state,
 												  &old_state, desired_state)))
 		{
-			ref->data.lockmode = BUFFER_LOCK_SHARE_EXCLUSIVE;
+			SharedBufferSetLockMode(ref, BUFFER_LOCK_SHARE_EXCLUSIVE);
 			*lockstate = desired_state;
 
 			return true;
@@ -7647,7 +7084,7 @@ ResOwnerReleaseBuffer(Datum res)
 	{
 		PrivateRefCountEntry *ref;
 
-		ref = GetPrivateRefCountEntry(buffer, false);
+		ref = GetSharedBufferEntry(buffer);
 
 		/* not having a private refcount would imply resowner corruption */
 		Assert(ref != NULL);
@@ -7656,7 +7093,7 @@ ResOwnerReleaseBuffer(Datum res)
 		 * If the buffer was locked at the time of the resowner release,
 		 * release the lock now. This should only happen after errors.
 		 */
-		if (ref->data.lockmode != BUFFER_LOCK_UNLOCK)
+		if (SharedBufferGetLockMode(ref) != BUFFER_LOCK_UNLOCK)
 		{
 			BufferDesc *buf = GetBufferDescriptor(buffer - 1);
 
@@ -7749,7 +7186,6 @@ EvictUnpinnedBuffer(Buffer buf, bool *buffer_flushed)
 
 	/* Make sure we can pin the buffer. */
 	ResourceOwnerEnlarge(CurrentResourceOwner);
-	ReservePrivateRefCountEntry();
 
 	desc = GetBufferDescriptor(buf - 1);
 	LockBufHdr(desc);
@@ -7790,7 +7226,6 @@ EvictAllUnpinnedBuffers(int32 *buffers_evicted, int32 *buffers_flushed,
 			continue;
 
 		ResourceOwnerEnlarge(CurrentResourceOwner);
-		ReservePrivateRefCountEntry();
 
 		LockBufHdr(desc);
 
@@ -7844,7 +7279,6 @@ EvictRelUnpinnedBuffers(Relation rel, int32 *buffers_evicted,
 
 		/* Make sure we can pin the buffer. */
 		ResourceOwnerEnlarge(CurrentResourceOwner);
-		ReservePrivateRefCountEntry();
 
 		buf_state = LockBufHdr(desc);
 
@@ -7936,7 +7370,6 @@ MarkDirtyUnpinnedBuffer(Buffer buf, bool *buffer_already_dirty)
 
 	/* Make sure we can pin the buffer. */
 	ResourceOwnerEnlarge(CurrentResourceOwner);
-	ReservePrivateRefCountEntry();
 
 	desc = GetBufferDescriptor(buf - 1);
 	LockBufHdr(desc);
@@ -7989,7 +7422,6 @@ MarkDirtyRelUnpinnedBuffers(Relation rel,
 
 		/* Make sure we can pin the buffer. */
 		ResourceOwnerEnlarge(CurrentResourceOwner);
-		ReservePrivateRefCountEntry();
 
 		buf_state = LockBufHdr(desc);
 
@@ -8041,7 +7473,6 @@ MarkDirtyAllUnpinnedBuffers(int32 *buffers_dirtied,
 			continue;
 
 		ResourceOwnerEnlarge(CurrentResourceOwner);
-		ReservePrivateRefCountEntry();
 
 		LockBufHdr(desc);
 
@@ -8740,3 +8171,12 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+
+/*
+ * Intentionally included at the bottom so that the compiler can inline 
+ * functions but developers are forced to use accessors.
+ * bufmgr.c should not be concerned with bufmgr_refcnt.c implementation
+ * details.
+ */
+ #include "bufmgr_refcnt.c"
diff --git a/src/backend/storage/buffer/bufmgr_refcnt.c b/src/backend/storage/buffer/bufmgr_refcnt.c
new file mode 100644
index 0000000000..fef8f98037
--- /dev/null
+++ b/src/backend/storage/buffer/bufmgr_refcnt.c
@@ -0,0 +1,615 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmgr_refcnt.c
+ *	  Backend-private buffer refcount tracking
+ *
+ * This file is included at the end of bufmgr.c to allow the compiler
+ * to inline functions. Do not compile this file separately.
+ *
+ * Each buffer has a private refcount that keeps track of the number of
+ * times the buffer is pinned in the current process.  This is so that the
+ * shared refcount needs to be modified only once if a buffer is pinned more
+ * than once by an individual backend. This mechanism is also used to track
+ * whether this backend has a buffer locked, and, if so, in what mode.
+ *
+ * To avoid - as we used to - requiring an array with NBuffers entries to keep
+ * track of local buffers, we use a small sequentially searched array
+ * (PrivateRefCountArrayKeys, with the corresponding data stored in
+ * PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to
+ * keep track of backend local pins.
+ *
+ * Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all
+ * refcounts are kept track of in the array; after that, new array entries
+ * displace old ones into the hash table. That way a frequently used entry
+ * can't get "stuck" in the hashtable while infrequent ones clog the array.
+ *
+ * This was initially designed trying to optimize for the case where the
+ * number of pinned buffers is expected to not exceed REFCOUNT_ARRAY_ENTRIES.
+ * However this might not be the case with the introduction of prefetching.
+ *
+ * To enter a buffer into the refcount tracking mechanism first reserve a free
+ * entry using ReservePrivateRefCountEntry() and then later, if necessary,
+ * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing
+ * memory allocations in NewPrivateRefCountEntry() which can be important
+ * because in some scenarios it's called with a spinlock held...
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/bufmgr_refcnt.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "utils/hsearch.h"
+
+/* Structure definitions - internal to this file */
+typedef struct PrivateRefCountData
+{
+	int32		refcount;
+	BufferLockMode lockmode;
+} PrivateRefCountData;
+
+struct PrivateRefCountEntry
+{
+	Buffer		buffer;
+	PrivateRefCountData data;
+};
+
+struct PrivateRefCountIterator
+{
+	int			array_index;
+	bool		in_hash;
+	HASH_SEQ_STATUS *hash_status;
+};
+
+/* Private refcount array and keys */
+#define REFCOUNT_ARRAY_ENTRIES 8
+static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
+static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
+
+/* Overflow hash table for when array is full */
+static HTAB *PrivateRefCountHash = NULL;
+
+/* Count of entries that have overflowed into the hash table */
+static int32 PrivateRefCountOverflowed = 0;
+
+/* Clock hand for selecting victim when array is full */
+static uint32 PrivateRefCountClock = 0;
+
+/* Reserved slot index, or -1 if none reserved */
+static int	ReservedRefCountSlot = -1;
+
+/* Cache for last accessed entry */
+static int	PrivateRefCountEntryLast = -1;
+
+/* Advisory limit on the number of pins each backend should hold */
+static uint32 MaxProportionalPins = 0;
+
+/* Forward declarations */
+static void ReservePrivateRefCountEntry(void);
+static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
+static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move);
+
+/*
+ * Initialize private refcount tracking for this backend.
+ */
+void
+InitPrivateRefCount(void)
+{
+	HASHCTL		hash_ctl;
+
+
+	/*
+	 * An advisory limit on the number of pins each backend should hold, based
+	 * on shared_buffers and the maximum number of connections possible.
+	 * That's very pessimistic, but outside toy-sized shared_buffers it should
+	 * allow plenty of pins.  LimitAdditionalPins() and
+	 * GetAdditionalPinLimit() can be used to check the remaining balance.
+	 */
+	 MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
+
+	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
+	memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
+
+	hash_ctl.keysize = sizeof(Buffer);
+	hash_ctl.entrysize = sizeof(PrivateRefCountEntry);
+
+	PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS);
+}
+
+/*
+ * Ensure that the PrivateRefCountArray has sufficient space to store one more
+ * entry.
+ */
+static void
+ReservePrivateRefCountEntry(void)
+{
+	/* Already reserved (or freed), nothing to do */
+	if (ReservedRefCountSlot != -1)
+		return;
+
+	/*
+	 * First search for a free entry the array, that'll be sufficient in the
+	 * majority of cases.
+	 */
+	{
+		int			i;
+
+		for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+		{
+			if (PrivateRefCountArrayKeys[i] == InvalidBuffer)
+			{
+				ReservedRefCountSlot = i;
+			}
+		}
+
+		if (ReservedRefCountSlot != -1)
+			return;
+	}
+
+	/*
+	 * No luck. All array entries are full. Move one array entry into the hash
+	 * table.
+	 */
+	{
+		int			victim_slot;
+		PrivateRefCountEntry *victim_entry;
+		PrivateRefCountEntry *hashent;
+		bool		found;
+
+		/* select victim slot */
+		victim_slot = PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES;
+		victim_entry = &PrivateRefCountArray[victim_slot];
+		ReservedRefCountSlot = victim_slot;
+
+		/* Better be used, otherwise we shouldn't get here. */
+		Assert(PrivateRefCountArrayKeys[victim_slot] != InvalidBuffer);
+		Assert(PrivateRefCountArray[victim_slot].buffer != InvalidBuffer);
+		Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer);
+
+		/* enter victim array entry into hashtable */
+		hashent = hash_search(PrivateRefCountHash,
+							  &PrivateRefCountArrayKeys[victim_slot],
+							  HASH_ENTER,
+							  &found);
+		Assert(!found);
+		hashent->data = victim_entry->data;
+
+		/* clear the now free array slot */
+		PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer;
+		victim_entry->buffer = InvalidBuffer;
+
+		memset(&victim_entry->data, 0, sizeof(victim_entry->data));
+		victim_entry->data.refcount = 0;
+		victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK;
+
+		PrivateRefCountOverflowed++;
+	}
+}
+
+/*
+ * Create a new refcount entry for the given buffer.
+ */
+static PrivateRefCountEntry *
+NewPrivateRefCountEntry(Buffer buffer)
+{
+	PrivateRefCountEntry *res;
+
+	/* only allowed to be called when a reservation has been made */
+	Assert(ReservedRefCountSlot != -1);
+
+	/* use up the reserved entry */
+	res = &PrivateRefCountArray[ReservedRefCountSlot];
+
+	/* and fill it */
+	PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
+	res->buffer = buffer;
+	res->data.refcount = 0;
+	res->data.lockmode = BUFFER_LOCK_UNLOCK;
+
+	/* update cache for the next lookup */
+	PrivateRefCountEntryLast = ReservedRefCountSlot;
+
+	ReservedRefCountSlot = -1;
+
+	return res;
+}
+
+/*
+ * Slow-path for GetSharedBufferEntry().
+ */
+static pg_noinline PrivateRefCountEntry *
+GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
+{
+	PrivateRefCountEntry *res;
+	int			match = -1;
+	int			i;
+
+	/*
+	 * First search for references in the array.
+	 */
+	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+	{
+		if (PrivateRefCountArrayKeys[i] == buffer)
+		{
+			match = i;
+		}
+	}
+
+	if (likely(match != -1))
+	{
+		PrivateRefCountEntryLast = match;
+		return &PrivateRefCountArray[match];
+	}
+
+	/*
+	 * Only look up the buffer in the hashtable if we've previously overflowed.
+	 */
+	if (PrivateRefCountOverflowed == 0)
+		return NULL;
+
+	res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL);
+
+	if (res == NULL)
+		return NULL;
+	else if (!do_move)
+	{
+		return res;
+	}
+	else
+	{
+		/* move buffer from hashtable into the free array slot */
+		bool		found;
+		PrivateRefCountEntry *free;
+
+		ReservePrivateRefCountEntry();
+
+		Assert(ReservedRefCountSlot != -1);
+		free = &PrivateRefCountArray[ReservedRefCountSlot];
+		Assert(free->buffer == InvalidBuffer);
+
+		free->buffer = buffer;
+		free->data = res->data;
+		PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
+		PrivateRefCountEntryLast = ReservedRefCountSlot;
+
+		ReservedRefCountSlot = -1;
+
+		hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
+		Assert(found);
+		Assert(PrivateRefCountOverflowed > 0);
+		PrivateRefCountOverflowed--;
+
+		return free;
+	}
+}
+
+/*
+ * Return the PrivateRefCountEntry for the passed buffer.
+ * Returns NULL if the buffer is not currently pinned.
+ */
+static PrivateRefCountEntry *
+GetSharedBufferEntry(Buffer buffer)
+{
+	Assert(BufferIsValid(buffer));
+	Assert(!BufferIsLocal(buffer));
+
+	/* Fast path: check one-entry cache */
+	if (likely(PrivateRefCountEntryLast != -1) &&
+		likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer))
+	{
+		return &PrivateRefCountArray[PrivateRefCountEntryLast];
+	}
+
+	return GetPrivateRefCountEntrySlow(buffer, false);
+}
+
+/*
+ * Create a new refcount entry for a buffer that is known to not be pinned.
+ * This is a fast path that skips the cache/hash lookup.
+ * Returns the new entry pointer with refcount already incremented.
+ */
+static PrivateRefCountEntry *
+SharedBufferCreateRef(Buffer buffer)
+{
+	PrivateRefCountEntry *ref;
+
+	Assert(BufferIsValid(buffer));
+	Assert(!BufferIsLocal(buffer));
+
+	ReservePrivateRefCountEntry();
+	ref = NewPrivateRefCountEntry(buffer);
+	ref->data.refcount++;
+
+	return ref;
+}
+
+/*
+ * Increment the private refcount for an existing entry.
+ * Use when you already have the entry from a previous lookup.
+ */
+static void
+SharedBufferRefExisting(PrivateRefCountEntry *ref)
+{
+	Assert(ref != NULL);
+	Assert(ref->data.refcount > 0);
+	ref->data.refcount++;
+}
+
+/*
+ * Decrement the private refcount for a buffer.
+ * If the refcount reaches zero, removes the entry and returns true.
+ * Returns false if the buffer still has references.
+ */
+static bool
+SharedBufferUnref(PrivateRefCountEntry *ref)
+{
+	Assert(ref != NULL);
+	Assert(ref->data.refcount > 0);
+
+	ref->data.refcount--;
+
+	if (ref->data.refcount == 0)
+	{
+		/* No more references - clean up the entry */
+		Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK);
+
+		if (ref >= &PrivateRefCountArray[0] &&
+			ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES])
+		{
+			ref->buffer = InvalidBuffer;
+			PrivateRefCountArrayKeys[ref - PrivateRefCountArray] = InvalidBuffer;
+			ReservedRefCountSlot = ref - PrivateRefCountArray;
+		}
+		else
+		{
+			bool		found;
+			Buffer		buffer = ref->buffer;
+
+			hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
+			Assert(found);
+			Assert(PrivateRefCountOverflowed > 0);
+			PrivateRefCountOverflowed--;
+		}
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Accessors for refcount entry fields.
+ */
+static int32
+SharedBufferRefCount(PrivateRefCountEntry *ref)
+{
+	return ref->data.refcount;
+}
+
+static BufferLockMode
+SharedBufferGetLockMode(PrivateRefCountEntry *ref)
+{
+	return ref->data.lockmode;
+}
+
+static void
+SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode)
+{
+	ref->data.lockmode = mode;
+}
+
+/*
+ * Check if the entry has exactly one reference.
+ */
+static bool
+SharedBufferHasSingleRef(PrivateRefCountEntry *ref)
+{
+	return ref != NULL && ref->data.refcount == 1;
+}
+
+/*
+ * Check for buffer refcount leaks.
+ */
+void
+CheckPrivateRefCountLeaks(void)
+{
+#ifdef USE_ASSERT_CHECKING
+	int			RefCountErrors = 0;
+	PrivateRefCountEntry *res;
+	int			i;
+	char	   *s;
+
+	/* check the array */
+	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+	{
+		if (PrivateRefCountArrayKeys[i] != InvalidBuffer)
+		{
+			res = &PrivateRefCountArray[i];
+
+			s = DebugPrintBufferRefcount(res->buffer);
+			elog(WARNING, "buffer refcount leak: %s", s);
+			pfree(s);
+
+			RefCountErrors++;
+		}
+	}
+
+	/* if necessary search the hash */
+	if (PrivateRefCountOverflowed)
+	{
+		HASH_SEQ_STATUS hstat;
+
+		hash_seq_init(&hstat, PrivateRefCountHash);
+		while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			s = DebugPrintBufferRefcount(res->buffer);
+			elog(WARNING, "buffer refcount leak: %s", s);
+			pfree(s);
+			RefCountErrors++;
+		}
+	}
+
+	Assert(RefCountErrors == 0);
+#endif
+}
+
+#ifdef USE_ASSERT_CHECKING
+/* Forward declaration - defined in bufmgr.c */
+static void AssertNotCatalogBufferLock(Buffer buffer, BufferLockMode mode);
+
+/*
+ * Check for exclusive-locked catalog buffers.  This is the core of
+ * AssertCouldGetRelation().
+ */
+void
+AssertBufferLocksPermitCatalogRead(void)
+{
+	PrivateRefCountIterator *iter;
+	PrivateRefCountEntry *res;
+
+	iter = InitPrivateRefCountIterator();
+	while ((res = GetNextPrivateRefCountEntry(iter)) != NULL)
+	{
+		Buffer		buf = res->buffer;
+
+		if (buf == InvalidBuffer)
+			continue;
+
+		AssertNotCatalogBufferLock(buf, res->data.lockmode);
+	}
+	FreePrivateRefCountIterator(iter);
+}
+#endif
+
+/*
+ * Initialize an iterator for walking all private refcount entries.
+ */
+PrivateRefCountIterator *
+InitPrivateRefCountIterator(void)
+{
+	PrivateRefCountIterator *iter = palloc(sizeof(PrivateRefCountIterator));
+
+	iter->array_index = 0;
+	iter->in_hash = false;
+	iter->hash_status = NULL;
+	return iter;
+}
+
+/*
+ * Get the next private refcount entry.
+ * Returns NULL when iteration is complete.
+ */
+PrivateRefCountEntry *
+GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
+{
+	/* First iterate through the array */
+	while (!iter->in_hash && iter->array_index < REFCOUNT_ARRAY_ENTRIES)
+	{
+		int			idx = iter->array_index++;
+
+		if (PrivateRefCountArrayKeys[idx] != InvalidBuffer)
+			return &PrivateRefCountArray[idx];
+	}
+
+	/* Then iterate through the hash if there are overflowed entries */
+	if (!iter->in_hash)
+	{
+		iter->in_hash = true;
+		if (PrivateRefCountOverflowed > 0)
+		{
+			iter->hash_status = palloc(sizeof(HASH_SEQ_STATUS));
+			hash_seq_init(iter->hash_status, PrivateRefCountHash);
+		}
+	}
+
+	if (iter->hash_status != NULL)
+	{
+		PrivateRefCountEntry *res;
+
+		res = (PrivateRefCountEntry *) hash_seq_search(iter->hash_status);
+		if (res != NULL)
+			return res;
+
+		pfree(iter->hash_status);
+		iter->hash_status = NULL;
+	}
+
+	return NULL;
+}
+
+/*
+ * Free an iterator from InitPrivateRefCountIterator.
+ */
+void
+FreePrivateRefCountIterator(PrivateRefCountIterator *iter)
+{
+	if (iter->hash_status != NULL)
+	{
+		hash_seq_term(iter->hash_status);
+		pfree(iter->hash_status);
+	}
+	pfree(iter);
+}
+
+
+/*
+ * Return the maximum number of buffers that a backend should try to pin once,
+ * to avoid exceeding its fair share.  This is the highest value that
+ * GetAdditionalPinLimit() could ever return.  Note that it may be zero on a
+ * system with a very small buffer pool relative to max_connections.
+ */
+ uint32
+ GetPinLimit(void)
+ {
+	 return MaxProportionalPins;
+ }
+
+ /*
+  * Return the maximum number of additional buffers that this backend should
+  * pin if it wants to stay under the per-backend limit, considering the number
+  * of buffers it has already pinned.  Unlike LimitAdditionalPins(), the limit
+  * return by this function can be zero.
+  */
+ uint32
+ GetAdditionalPinLimit(void)
+ {
+	 uint32		estimated_pins_held;
+
+	 /*
+	  * We get the number of "overflowed" pins for free, but don't know the
+	  * number of pins in PrivateRefCountArray.  The cost of calculating that
+	  * exactly doesn't seem worth it, so just assume the max.
+	  */
+	 estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
+
+	 /* Is this backend already holding more than its fair share? */
+	 if (estimated_pins_held > MaxProportionalPins)
+		 return 0;
+
+	 return MaxProportionalPins - estimated_pins_held;
+ }
+
+ /*
+  * Limit the number of pins a batch operation may additionally acquire, to
+  * avoid running out of pinnable buffers.
+  *
+  * One additional pin is always allowed, on the assumption that the operation
+  * requires at least one to make progress.
+  */
+ void
+ LimitAdditionalPins(uint32 *additional_pins)
+ {
+	 uint32		limit;
+
+	 if (*additional_pins <= 1)
+		 return;
+
+	 limit = GetAdditionalPinLimit();
+	 limit = Max(limit, 1);
+	 if (limit < *additional_pins)
+		 *additional_pins = limit;
+ }
diff --git a/src/backend/storage/buffer/bufmgr_refcnt.h b/src/backend/storage/buffer/bufmgr_refcnt.h
new file mode 100644
index 0000000000..f2a6f45d84
--- /dev/null
+++ b/src/backend/storage/buffer/bufmgr_refcnt.h
@@ -0,0 +1,59 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmgr_refcnt.h
+ *	  Backend-private buffer refcount tracking
+ *
+ * This header provides opaque declarations for the private refcount
+ * tracking system. The implementation is in bufmgr_refcnt.c which is
+ * included at the end of bufmgr.c for inlining.
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/storage/buffer/bufmgr_refcnt.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BUFMGR_REFCNT_H
+#define BUFMGR_REFCNT_H
+
+#include "storage/buf.h"
+#include "storage/bufmgr.h"
+
+/* Opaque handle to a private refcount entry */
+typedef struct PrivateRefCountEntry PrivateRefCountEntry;
+
+/* Opaque handle to an iterator */
+typedef struct PrivateRefCountIterator PrivateRefCountIterator;
+
+/* Initialization */
+extern void InitPrivateRefCount(void);
+
+/*
+ * Hot-path functions - forward declarations.
+ * Defined as static in bufmgr_refcnt.c which is included in bufmgr.c.
+ */
+static PrivateRefCountEntry *GetSharedBufferEntry(Buffer buffer);
+static PrivateRefCountEntry *SharedBufferCreateRef(Buffer buffer);
+static void SharedBufferRefExisting(PrivateRefCountEntry *ref);
+static bool SharedBufferUnref(PrivateRefCountEntry *ref);
+static BufferLockMode SharedBufferGetLockMode(PrivateRefCountEntry *ref);
+static void SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode);
+static int32 SharedBufferRefCount(PrivateRefCountEntry *ref);
+static bool SharedBufferHasSingleRef(PrivateRefCountEntry *ref);
+
+/* Pin limiting */
+extern uint32 GetPinLimit(void);
+extern uint32 GetAdditionalPinLimit(void);
+extern void LimitAdditionalPins(uint32 *additional_pins);
+
+/* Leak checking */
+extern void CheckPrivateRefCountLeaks(void);
+
+/* Iterator functions */
+extern PrivateRefCountIterator *InitPrivateRefCountIterator(void);
+extern PrivateRefCountEntry *GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter);
+extern void FreePrivateRefCountIterator(PrivateRefCountIterator *iter);
+
+
+#endif							/* BUFMGR_REFCNT_H */
-- 
2.34.1



  [application/x-patch] v2-0003-Using-simplehash.patch (13.2K, 6-v2-0003-Using-simplehash.patch)
  download | inline diff:
From bf146598cd94110da255c6fb45b4a5c58002a91d Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Wed, 11 Mar 2026 12:07:54 +0000
Subject: [PATCH 3/5] Using simplehash

This patch replaces the HTAB implementation with a simplehash
as suggested by Andres Freund [1]. The simplehash implements a templated
open addressing hash, which have entry size information at compile time.

The access strategy of the simplehash is very close to the plain array
that was seen to be very efficient compared to the HTAB implementation,
with the additional advantage of using the key value to initialize the
search closer to where the key actually is, instead of always scanning
the entire array.

Adds one macro to the simplehash implementation enable overriding the
empty position detection, to enable using buffer == InvalidBuffer
instead of requiring a separate status field.

[1] https://www.postgresql.org/message-id/s5p7iou7pdhxhvmv4rohmskwqmr36dc4rybvwlep5yvwrjs4pa%406oxsemms5mw4

---
 src/backend/storage/buffer/bufmgr_refcnt.c | 89 +++++++++++-----------
 src/include/lib/simplehash.h               | 59 +++++++++-----
 2 files changed, 84 insertions(+), 64 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr_refcnt.c b/src/backend/storage/buffer/bufmgr_refcnt.c
index fef8f98037..dd95d953f6 100644
--- a/src/backend/storage/buffer/bufmgr_refcnt.c
+++ b/src/backend/storage/buffer/bufmgr_refcnt.c
@@ -42,7 +42,10 @@
  *-------------------------------------------------------------------------
  */
 
-#include "utils/hsearch.h"
+#include "common/hashfn.h"
+#include "storage/buf_internals.h"
+#include "bufmgr_refcnt.h"
+#include "storage/proc.h"
 
 /* Structure definitions - internal to this file */
 typedef struct PrivateRefCountData
@@ -53,15 +56,36 @@ typedef struct PrivateRefCountData
 
 struct PrivateRefCountEntry
 {
-	Buffer		buffer;
+	Buffer		buffer;		/* hash key - InvalidBuffer means empty */
 	PrivateRefCountData data;
 };
 
+/*
+ * Define simplehash parameters for the overflow hash table.
+ * We use buffer == InvalidBuffer as the "empty" marker to avoid needing
+ * a separate status field.
+ */
+#define SH_PREFIX refcount
+#define SH_ELEMENT_TYPE PrivateRefCountEntry
+#define SH_KEY_TYPE Buffer
+#define SH_KEY buffer
+#define SH_HASH_KEY(tb, key) murmurhash32(key)
+#define SH_EQUAL(tb, a, b) ((a) == (b))
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+/* Use buffer field as empty marker - no separate status needed */
+#define SH_ENTRY_EMPTY(entry) ((entry)->buffer == InvalidBuffer)
+#define SH_MAKE_EMPTY(entry) ((entry)->buffer = InvalidBuffer)
+#define SH_MAKE_IN_USE(entry) ((void)0)  /* key assignment handles this */
+#include "lib/simplehash.h"
+
 struct PrivateRefCountIterator
 {
 	int			array_index;
 	bool		in_hash;
-	HASH_SEQ_STATUS *hash_status;
+	refcount_iterator hash_iter;
+	bool		hash_iter_valid;
 };
 
 /* Private refcount array and keys */
@@ -70,7 +94,7 @@ static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
 static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
 
 /* Overflow hash table for when array is full */
-static HTAB *PrivateRefCountHash = NULL;
+static refcount_hash *PrivateRefCountHash = NULL;
 
 /* Count of entries that have overflowed into the hash table */
 static int32 PrivateRefCountOverflowed = 0;
@@ -98,26 +122,18 @@ static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buff
 void
 InitPrivateRefCount(void)
 {
-	HASHCTL		hash_ctl;
-
-
 	/*
 	 * An advisory limit on the number of pins each backend should hold, based
 	 * on shared_buffers and the maximum number of connections possible.
 	 * That's very pessimistic, but outside toy-sized shared_buffers it should
 	 * allow plenty of pins.  LimitAdditionalPins() and
 	 * GetAdditionalPinLimit() can be used to check the remaining balance.
-	 */
-	 MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
-
+	*/
+	MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
 	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
 	memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
 
-	hash_ctl.keysize = sizeof(Buffer);
-	hash_ctl.entrysize = sizeof(PrivateRefCountEntry);
-
-	PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS);
+	PrivateRefCountHash = refcount_create(CurrentMemoryContext, 16, NULL);
 }
 
 /*
@@ -171,10 +187,9 @@ ReservePrivateRefCountEntry(void)
 		Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer);
 
 		/* enter victim array entry into hashtable */
-		hashent = hash_search(PrivateRefCountHash,
-							  &PrivateRefCountArrayKeys[victim_slot],
-							  HASH_ENTER,
-							  &found);
+		hashent = refcount_insert(PrivateRefCountHash,
+								  PrivateRefCountArrayKeys[victim_slot],
+								  &found);
 		Assert(!found);
 		hashent->data = victim_entry->data;
 
@@ -251,7 +266,7 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
 	if (PrivateRefCountOverflowed == 0)
 		return NULL;
 
-	res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL);
+	res = refcount_lookup(PrivateRefCountHash, buffer);
 
 	if (res == NULL)
 		return NULL;
@@ -262,7 +277,6 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
 	else
 	{
 		/* move buffer from hashtable into the free array slot */
-		bool		found;
 		PrivateRefCountEntry *free;
 
 		ReservePrivateRefCountEntry();
@@ -278,8 +292,7 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
 
 		ReservedRefCountSlot = -1;
 
-		hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
-		Assert(found);
+		refcount_delete_item(PrivateRefCountHash, res);
 		Assert(PrivateRefCountOverflowed > 0);
 		PrivateRefCountOverflowed--;
 
@@ -366,11 +379,7 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
 		}
 		else
 		{
-			bool		found;
-			Buffer		buffer = ref->buffer;
-
-			hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
-			Assert(found);
+			refcount_delete_item(PrivateRefCountHash, ref);
 			Assert(PrivateRefCountOverflowed > 0);
 			PrivateRefCountOverflowed--;
 		}
@@ -441,10 +450,10 @@ CheckPrivateRefCountLeaks(void)
 	/* if necessary search the hash */
 	if (PrivateRefCountOverflowed)
 	{
-		HASH_SEQ_STATUS hstat;
+		refcount_iterator iter;
 
-		hash_seq_init(&hstat, PrivateRefCountHash);
-		while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL)
+		refcount_start_iterate(PrivateRefCountHash, &iter);
+		while ((res = refcount_iterate(PrivateRefCountHash, &iter)) != NULL)
 		{
 			s = DebugPrintBufferRefcount(res->buffer);
 			elog(WARNING, "buffer refcount leak: %s", s);
@@ -495,7 +504,7 @@ InitPrivateRefCountIterator(void)
 
 	iter->array_index = 0;
 	iter->in_hash = false;
-	iter->hash_status = NULL;
+	iter->hash_iter_valid = false;
 	return iter;
 }
 
@@ -521,21 +530,20 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
 		iter->in_hash = true;
 		if (PrivateRefCountOverflowed > 0)
 		{
-			iter->hash_status = palloc(sizeof(HASH_SEQ_STATUS));
-			hash_seq_init(iter->hash_status, PrivateRefCountHash);
+			refcount_start_iterate(PrivateRefCountHash, &iter->hash_iter);
+			iter->hash_iter_valid = true;
 		}
 	}
 
-	if (iter->hash_status != NULL)
+	if (iter->hash_iter_valid)
 	{
 		PrivateRefCountEntry *res;
 
-		res = (PrivateRefCountEntry *) hash_seq_search(iter->hash_status);
+		res = refcount_iterate(PrivateRefCountHash, &iter->hash_iter);
 		if (res != NULL)
 			return res;
 
-		pfree(iter->hash_status);
-		iter->hash_status = NULL;
+		iter->hash_iter_valid = false;
 	}
 
 	return NULL;
@@ -547,11 +555,6 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
 void
 FreePrivateRefCountIterator(PrivateRefCountIterator *iter)
 {
-	if (iter->hash_status != NULL)
-	{
-		hash_seq_term(iter->hash_status);
-		pfree(iter->hash_status);
-	}
 	pfree(iter);
 }
 
diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index 848719232a..3c03a7e9c9 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -287,6 +287,20 @@ SH_SCOPE void SH_STAT(SH_TYPE * tb);
 #define SH_COMPARE_KEYS(tb, ahash, akey, b) (SH_EQUAL(tb, b->SH_KEY, akey))
 #endif
 
+/*
+ * Macros to check/set entry status. Users can override these to avoid
+ * needing a separate status field if their key type has an "invalid" value.
+ */
+#ifndef SH_ENTRY_EMPTY
+#define SH_ENTRY_EMPTY(entry) ((entry)->status == SH_STATUS_EMPTY)
+#endif
+#ifndef SH_MAKE_EMPTY
+#define SH_MAKE_EMPTY(entry) ((entry)->status = SH_STATUS_EMPTY)
+#endif
+#ifndef SH_MAKE_IN_USE
+#define SH_MAKE_IN_USE(entry) ((entry)->status = SH_STATUS_IN_USE)
+#endif
+
 /*
  * Wrap the following definitions in include guards, to avoid multiple
  * definition errors if this header is included more than once.  The rest of
@@ -544,7 +558,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 		uint32		hash;
 		uint32		optimal;
 
-		if (oldentry->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_EMPTY(oldentry))
 		{
 			startelem = i;
 			break;
@@ -566,7 +580,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 	{
 		SH_ELEMENT_TYPE *oldentry = &olddata[copyelem];
 
-		if (oldentry->status == SH_STATUS_IN_USE)
+		if (!SH_ENTRY_EMPTY(oldentry))
 		{
 			uint32		hash;
 			uint32		startelem2;
@@ -582,7 +596,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 			{
 				newentry = &newdata[curelem];
 
-				if (newentry->status == SH_STATUS_EMPTY)
+				if (SH_ENTRY_EMPTY(newentry))
 				{
 					break;
 				}
@@ -653,14 +667,14 @@ restart:
 		SH_ELEMENT_TYPE *entry = &data[curelem];
 
 		/* any empty bucket can directly be used */
-		if (entry->status == SH_STATUS_EMPTY)
+		if (SH_ENTRY_EMPTY(entry))
 		{
 			tb->members++;
 			entry->SH_KEY = key;
 #ifdef SH_STORE_HASH
 			SH_GET_HASH(tb, entry) = hash;
 #endif
-			entry->status = SH_STATUS_IN_USE;
+			SH_MAKE_IN_USE(entry);
 			*found = false;
 			return entry;
 		}
@@ -675,7 +689,7 @@ restart:
 
 		if (SH_COMPARE_KEYS(tb, hash, key, entry))
 		{
-			Assert(entry->status == SH_STATUS_IN_USE);
+			Assert(!SH_ENTRY_EMPTY(entry));
 			*found = true;
 			return entry;
 		}
@@ -699,7 +713,7 @@ restart:
 				emptyelem = SH_NEXT(tb, emptyelem, startelem);
 				emptyentry = &data[emptyelem];
 
-				if (emptyentry->status == SH_STATUS_EMPTY)
+				if (SH_ENTRY_EMPTY(emptyentry))
 				{
 					lastentry = emptyentry;
 					break;
@@ -748,7 +762,7 @@ restart:
 #ifdef SH_STORE_HASH
 			SH_GET_HASH(tb, entry) = hash;
 #endif
-			entry->status = SH_STATUS_IN_USE;
+			SH_MAKE_IN_USE(entry);
 			*found = false;
 			return entry;
 		}
@@ -810,12 +824,12 @@ SH_LOOKUP_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash)
 	{
 		SH_ELEMENT_TYPE *entry = &tb->data[curelem];
 
-		if (entry->status == SH_STATUS_EMPTY)
+		if (SH_ENTRY_EMPTY(entry))
 		{
 			return NULL;
 		}
 
-		Assert(entry->status == SH_STATUS_IN_USE);
+		Assert(!SH_ENTRY_EMPTY(entry));
 
 		if (SH_COMPARE_KEYS(tb, hash, key, entry))
 			return entry;
@@ -868,10 +882,10 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 	{
 		SH_ELEMENT_TYPE *entry = &tb->data[curelem];
 
-		if (entry->status == SH_STATUS_EMPTY)
+		if (SH_ENTRY_EMPTY(entry))
 			return false;
 
-		if (entry->status == SH_STATUS_IN_USE &&
+		if (!SH_ENTRY_EMPTY(entry) &&
 			SH_COMPARE_KEYS(tb, hash, key, entry))
 		{
 			SH_ELEMENT_TYPE *lastentry = entry;
@@ -894,9 +908,9 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 				curelem = SH_NEXT(tb, curelem, startelem);
 				curentry = &tb->data[curelem];
 
-				if (curentry->status != SH_STATUS_IN_USE)
+				if (SH_ENTRY_EMPTY(curentry))
 				{
-					lastentry->status = SH_STATUS_EMPTY;
+					SH_MAKE_EMPTY(lastentry);
 					break;
 				}
 
@@ -906,7 +920,7 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 				/* current is at optimal position, done */
 				if (curoptimal == curelem)
 				{
-					lastentry->status = SH_STATUS_EMPTY;
+					SH_MAKE_EMPTY(lastentry);
 					break;
 				}
 
@@ -957,9 +971,9 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
 		curelem = SH_NEXT(tb, curelem, startelem);
 		curentry = &tb->data[curelem];
 
-		if (curentry->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_EMPTY(curentry))
 		{
-			lastentry->status = SH_STATUS_EMPTY;
+			SH_MAKE_EMPTY(lastentry);
 			break;
 		}
 
@@ -969,7 +983,7 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
 		/* current is at optimal position, done */
 		if (curoptimal == curelem)
 		{
-			lastentry->status = SH_STATUS_EMPTY;
+			SH_MAKE_EMPTY(lastentry);
 			break;
 		}
 
@@ -997,7 +1011,7 @@ SH_START_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter)
 	{
 		SH_ELEMENT_TYPE *entry = &tb->data[i];
 
-		if (entry->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_EMPTY(entry))
 		{
 			startelem = i;
 			break;
@@ -1063,7 +1077,7 @@ SH_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter)
 
 		if ((iter->cur & tb->sizemask) == (iter->end & tb->sizemask))
 			iter->done = true;
-		if (elem->status == SH_STATUS_IN_USE)
+		if (!SH_ENTRY_EMPTY(elem))
 		{
 			return elem;
 		}
@@ -1140,7 +1154,7 @@ SH_STAT(SH_TYPE * tb)
 
 		elem = &tb->data[i];
 
-		if (elem->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_EMPTY(elem))
 			continue;
 
 		hash = SH_ENTRY_HASH(tb, elem);
@@ -1205,6 +1219,9 @@ SH_STAT(SH_TYPE * tb)
 #undef SH_STORE_HASH
 #undef SH_USE_NONDEFAULT_ALLOCATOR
 #undef SH_EQUAL
+#undef SH_ENTRY_EMPTY
+#undef SH_MAKE_EMPTY
+#undef SH_MAKE_IN_USE
 
 /* undefine locally declared macros */
 #undef SH_MAKE_PREFIX
-- 
2.34.1



  [application/x-patch] v2-0001-Benchmark-buffer-pinning.patch (22.3K, 7-v2-0001-Benchmark-buffer-pinning.patch)
  download | inline diff:
From 36b69e2028c8febb35ca2c45327384251c44b775 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Wed, 11 Mar 2026 12:05:50 +0000
Subject: [PATCH 1/5] Benchmark buffer pinning

---
 src/test/modules/test_buffer_pin/Makefile     |  18 +
 src/test/modules/test_buffer_pin/benchmark.py | 185 ++++++++++
 .../modules/test_buffer_pin/requirements.txt  |   9 +
 .../test_buffer_pin/test_buffer_pin--1.0.sql  |  66 ++++
 .../modules/test_buffer_pin/test_buffer_pin.c | 323 ++++++++++++++++++
 .../test_buffer_pin/test_buffer_pin.control   |   4 +
 6 files changed, 605 insertions(+)
 create mode 100644 src/test/modules/test_buffer_pin/Makefile
 create mode 100755 src/test/modules/test_buffer_pin/benchmark.py
 create mode 100644 src/test/modules/test_buffer_pin/requirements.txt
 create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql
 create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin.c
 create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin.control

diff --git a/src/test/modules/test_buffer_pin/Makefile b/src/test/modules/test_buffer_pin/Makefile
new file mode 100644
index 0000000000..2707f84c5b
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/Makefile
@@ -0,0 +1,18 @@
+MODULE_big = test_buffer_pin
+OBJS = test_buffer_pin.o
+
+EXTENSION = test_buffer_pin
+DATA = test_buffer_pin--1.0.sql
+
+PGFILEDESC = "test_buffer_pin - buffer pinning benchmarks"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_buffer_pin
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_buffer_pin/benchmark.py b/src/test/modules/test_buffer_pin/benchmark.py
new file mode 100755
index 0000000000..004044bc5c
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/benchmark.py
@@ -0,0 +1,185 @@
+#!/usr/bin/env python3
+"""
+Buffer pinning benchmark - runs SQL benchmarks and saves results.
+
+Usage:
+    python3 benchmark.py [options]
+
+Examples:
+    python3 benchmark.py --port 5434 --name pg18-baseline --title "PostgreSQL 18 Baseline"
+    python3 benchmark.py --port 5434 --name patch-0005 --title "With Patch 0005"
+"""
+
+import argparse
+import numpy as np
+import pandas as pd
+from sqlalchemy import create_engine, text
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+from pathlib import Path
+
+
+def geomspace_int(start, stop, num):
+    """Generate geometrically spaced integers, pushing duplicates forward."""
+    raw = np.geomspace(start, stop, num)
+    result = []
+    for v in raw:
+        candidate = max(int(round(v)), result[-1] + 1 if result else 1)
+        result.append(candidate)
+    return result
+
+
+def ensure_test_table(conn, table_name, min_blocks):
+    """Create a table with enough blocks for the benchmark."""
+    rows_needed = min_blocks * 226 + 10000
+    print(f"Creating table {table_name} with ~{min_blocks} blocks...")
+    conn.execute(text(f"DROP TABLE IF EXISTS {table_name}"))
+    conn.execute(text(f"CREATE UNLOGGED TABLE {table_name} AS SELECT generate_series(1, {rows_needed}) AS id"))
+    result = conn.execute(text(f"SELECT pg_relation_size('{table_name}') / 8192 as blocks")).fetchone()
+    assert result[0] >= min_blocks, f"Table {table_name} has {result[0]} blocks, expected at least {min_blocks}"
+
+def run_benchmark(conn, func_template, buffer_counts):
+    """Run a benchmark function across all buffer counts and patterns in one query."""
+    array_str = str(buffer_counts)
+    sql = text(f"""
+        SELECT
+            CASE WHEN r.random THEN 'random' ELSE 'sequential' END as pattern,
+            b.num_buffers,
+            percentile_cont(0.5) WITHIN GROUP (ORDER BY bench.per_op_ns) as median_ns
+        FROM
+            unnest(ARRAY[false, true]) AS r(random),
+            unnest(ARRAY[{array_str}]) AS b(num_buffers),
+            LATERAL {func_template} AS bench
+        WHERE bench.iteration > 0
+        GROUP BY r.random, b.num_buffers
+        ORDER BY r.random, b.num_buffers
+    """)
+    return [dict(row._mapping) for row in conn.execute(sql)]
+
+
+def save_results(results_dir, name, results_dict, title):
+    """Save benchmark results to CSV and generate SVG plot."""
+    results_dir = Path(results_dir)
+    results_dir.mkdir(parents=True, exist_ok=True)
+
+    all_data = []
+    for bench_name, data in results_dict.items():
+        df = pd.DataFrame(data)
+        df['benchmark'] = bench_name
+        all_data.append(df)
+
+    combined_df = pd.concat(all_data, ignore_index=True)
+
+    csv_path = results_dir / f"{name}.csv"
+    combined_df.to_csv(csv_path, index=False)
+    print(f"Saved CSV: {csv_path}")
+    return combined_df
+
+
+def plot_results(results_dir, name, results_dict, title):
+    """Generate SVG plot from benchmark results."""
+    results_dir = Path(results_dir)
+
+    fig, ax = plt.subplots(figsize=(10, 6))
+
+    benchmarks = [
+        ('read', 'blue', 'o', 'ReadBuffer/ReleaseBuffer'),
+        ('pinning', 'green', 's', 'IncrBufferRefCount/ReleaseBuffer'),
+        ('locking', 'purple', 'd', 'LockBuffer/UnlockBuffer'),
+        ('resowner', 'red', '^', 'ResourceOwner Remember/Forget'),
+    ]
+
+    for bench_name, color, marker, label in benchmarks:
+        if bench_name not in results_dict:
+            continue
+        df = pd.DataFrame(results_dict[bench_name])
+
+        seq_data = df[df['pattern'] == 'sequential']
+        ax.plot(seq_data['num_buffers'], seq_data['median_ns'],
+                color=color, linestyle='-', marker=marker,
+                linewidth=2, markersize=6, label=f'{label} (seq)')
+
+        rand_data = df[df['pattern'] == 'random']
+        ax.plot(rand_data['num_buffers'], rand_data['median_ns'],
+                color=color, linestyle='--', marker=marker,
+                linewidth=2, markersize=6, label=f'{label} (rand)')
+
+    ax.set_xlabel('Number of Buffers (sliding window)')
+    ax.set_ylabel('Time per operation (ns)')
+    ax.set_title(title)
+    ax.set_ylim(0, None)
+    ax.grid(True, alpha=0.3)
+    ax.set_xscale('log', base=2)
+    ax.legend(fontsize=8, loc='upper left')
+
+    plt.tight_layout()
+    svg_path = results_dir / f"{name}.svg"
+    plt.savefig(svg_path, format='svg')
+    plt.close()
+    print(f"Saved SVG: {svg_path}")
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Buffer pinning benchmark')
+    parser.add_argument('--iterations', '-n', type=int, default=10000,
+                        help='Number of operations per sample (default: 10000)')
+    parser.add_argument('--samples', type=int, default=50,
+                        help='Number of samples per data point (default: 50)')
+    parser.add_argument('--port', type=int, default=5432,
+                        help='PostgreSQL port (default: 5432)')
+    parser.add_argument('--points', type=int, default=50,
+                        help='Number of data points (default: 200)')
+    parser.add_argument('--max-dist', type=int, default=500,
+                        help='Maximum buffer count to test (default: 512)')
+    parser.add_argument('--name', default='benchmark',
+                        help='Name for output files (default: benchmark)')
+    parser.add_argument('--title', default='PostgreSQL Buffer Pinning Performance',
+                        help='Title for the plot')
+    parser.add_argument('--results-dir', default=None,
+                        help='Directory for results (default: ./results)')
+    parser.add_argument('--table', default='bench_large',
+                        help='Table name to use for benchmarks (default: bench_large)')
+    args = parser.parse_args()
+
+    if args.results_dir is None:
+        args.results_dir = Path(__file__).parent / 'results'
+
+    engine = create_engine(f'postgresql://localhost:{args.port}/postgres')
+    buffer_counts = geomspace_int(1, args.max_dist, args.points)
+
+    min_blocks_needed = args.max_dist + 100
+
+    with engine.connect().execution_options(isolation_level="AUTOCOMMIT") as conn:
+        conn.execute(text("CREATE EXTENSION IF NOT EXISTS test_buffer_pin"))
+        ensure_test_table(conn, args.table, min_blocks_needed)
+
+
+        n, s = args.iterations, args.samples
+        print(f"\nRunning benchmarks: {n} iterations, {s} samples, max {args.max_dist} buffers")
+
+        results = {}
+
+        print("  Running: bench_pinning...")
+        results['pinning'] = run_benchmark(conn,
+            f"bench_pinning('{args.table}', b.num_buffers, {n}, {s}, r.random)",
+            buffer_counts)
+
+        print("  Running: bench_prefetch_pipeline...")
+        results['read'] = run_benchmark(conn,
+            f"bench_prefetch_pipeline('{args.table}', b.num_buffers, {n}, {s}, r.random)",
+            buffer_counts)
+
+        print("  Running: bench_resowner...")
+        results['resowner'] = run_benchmark(conn,
+            f"bench_resowner(b.num_buffers, {n}, {s}, r.random)",
+            buffer_counts)
+
+    save_results(args.results_dir, args.name, results, args.title)
+    plot_results(args.results_dir, args.name, results, args.title)
+
+    print(f"\nDone! Results saved to {args.results_dir}/{args.name}.*")
+
+
+if __name__ == '__main__':
+    main()
diff --git a/src/test/modules/test_buffer_pin/requirements.txt b/src/test/modules/test_buffer_pin/requirements.txt
new file mode 100644
index 0000000000..e73ab56927
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/requirements.txt
@@ -0,0 +1,9 @@
+# Requirements for benchmark.py
+# pip install -r requirements.txt
+# not enforcing versions, as it might simply
+# work with your installed versions
+numpy            # tested with 1.24+
+pandas           # tested with 2.0+
+sqlalchemy       # tested with 2.0+
+matplotlib       # tested with 3.7+
+psycopg2-binary  # tested with 2.9+
diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql b/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql
new file mode 100644
index 0000000000..4c43c41496
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql
@@ -0,0 +1,66 @@
+/* test_buffer_pin--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_buffer_pin" to load this file. \quit
+
+CREATE FUNCTION bench_prefetch_pipeline(
+    relname text,
+    prefetch_dist int,
+    num_ops int,
+    iterations int,
+    random_access bool
+)
+RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8)
+AS 'MODULE_PATHNAME', 'bench_prefetch_pipeline'
+LANGUAGE C STRICT;
+
+COMMENT ON FUNCTION bench_prefetch_pipeline IS
+'Benchmark buffer pinning with sliding window prefetch simulation.
+Arguments:
+  relname        - name of the relation to use
+  prefetch_dist  - number of buffers to keep pinned (sliding window size)
+  num_ops        - number of pin/unpin operations to perform
+  iterations     - number of times to repeat the benchmark
+  random_access  - if true, access blocks randomly; if false, sequentially
+';
+
+CREATE FUNCTION bench_pinning(
+    relname text,
+    num_buffers int,
+    num_ops int,
+    iterations int,
+    random_access bool
+)
+RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8)
+AS 'MODULE_PATHNAME', 'bench_pinning'
+LANGUAGE C STRICT;
+
+COMMENT ON FUNCTION bench_pinning IS
+'Benchmark pure pin/unpin operations without buffer lookup.
+Uses IncrBufferRefCount/ReleaseBuffer on pre-pinned buffers to isolate
+refcount tracking overhead from buffer table lookups and I/O.
+Arguments:
+  relname       - name of the relation to use for pinning buffers
+  num_buffers   - number of buffers to keep as base pins
+  num_ops       - number of pin/unpin pairs to perform
+  iterations    - number of times to repeat the benchmark
+  random_access - if true, access buffers randomly; if false, sequentially';
+
+CREATE FUNCTION bench_resowner(
+    num_buffers int,
+    num_ops int,
+    iterations int,
+    random_access bool
+)
+RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8)
+AS 'MODULE_PATHNAME', 'bench_resowner'
+LANGUAGE C STRICT;
+
+COMMENT ON FUNCTION bench_resowner IS
+'Benchmark ResourceOwner remember/forget operations only.
+Uses fake resource values - no actual resources are tracked.
+Arguments:
+  num_buffers   - number of fake resources to track
+  num_ops       - number of remember/forget pairs to perform
+  iterations    - number of times to repeat the benchmark
+  random_access - if true, access resources randomly; if false, sequentially';
diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin.c b/src/test/modules/test_buffer_pin/test_buffer_pin.c
new file mode 100644
index 0000000000..7b6c91678f
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/test_buffer_pin.c
@@ -0,0 +1,323 @@
+/*
+ * test_buffer_pin.c - Buffer pinning benchmark
+ */
+
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "catalog/namespace.h"
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "portability/instr_time.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+#include "utils/resowner.h"
+#include "utils/varlena.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(bench_prefetch_pipeline);
+PG_FUNCTION_INFO_V1(bench_pinning);
+PG_FUNCTION_INFO_V1(bench_resowner);
+
+/* Custom ResourceOwnerDesc for benchmark - does nothing on release */
+static void bench_release_resource(Datum res) { /* no-op */ }
+
+static const ResourceOwnerDesc bench_resowner_desc = {
+	.name = "BenchmarkResource",
+	.release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
+	.release_priority = RELEASE_PRIO_FIRST,
+	.ReleaseResource = bench_release_resource,
+	.DebugPrint = NULL,
+};
+
+/*
+ * Generate an access sequence for benchmarking.
+ * If random_access is true, uses Fisher-Yates shuffle.
+ * Otherwise, generates sequential pattern modulo num_items.
+ */
+static void
+generate_access_sequence(int *sequence, int num_operations, int num_items, bool random_access)
+{
+	for (int i = 0; i < num_operations; i++)
+		sequence[i] = i % num_items;
+
+	if (random_access)
+	{
+		for (int i = num_operations - 1; i > 0; i--)
+		{
+			int j = pg_prng_uint64_range(&pg_global_prng_state, 0, i);
+			int tmp = sequence[i];
+			sequence[i] = sequence[j];
+			sequence[j] = tmp;
+		}
+	}
+}
+
+
+/*
+ * bench_pinning - benchmark pure pin/unpin operations
+ *
+ * Warms up the cache by reading all blocks in the relation.
+ * Precompute a block sequence in which the buffers will be read.
+ * Then scan times repeatedly a scan following the block sequence.
+ * with a fixed distance of `num_buffers` (plus ramp up and ramp down)
+*/
+Datum
+bench_prefetch_pipeline(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	text	   *relname_text = PG_GETARG_TEXT_PP(0);
+	int			num_buffers = PG_GETARG_INT32(1);
+	int			num_operations = PG_GETARG_INT32(2);
+	int			iterations = PG_GETARG_INT32(3);
+	bool		random_access = PG_GETARG_BOOL(4);
+
+	Oid			relid;
+	Relation	rel;
+	BlockNumber nblocks;
+	Buffer	   *pipeline;
+	int		   *block_sequence;
+	int			iter;
+
+	Tuplestorestate *tupstore;
+	TupleDesc	tupdesc;
+	Datum		values[3];
+	bool		nulls[3] = {false, false, false};
+
+	if (num_buffers < 1 || num_operations < num_buffers)
+		ereport(ERROR, (errmsg("invalid parameters")));
+
+	InitMaterializedSRF(fcinfo, 0);
+	tupstore = rsinfo->setResult;
+	tupdesc = rsinfo->setDesc;
+
+	relid = RangeVarGetRelid(makeRangeVarFromNameList(textToQualifiedNameList(relname_text)), AccessShareLock, false);
+	rel = relation_open(relid, AccessShareLock);
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	if (nblocks == 0)
+		ereport(ERROR, (errmsg("relation has no blocks")));
+
+	pipeline = palloc0(num_buffers * sizeof(Buffer));
+	block_sequence = palloc(num_operations * sizeof(int));
+
+	/* Warm up */
+	for (int i = 0; i < num_operations && i < (int) nblocks; i++)
+	{
+		Buffer buf = ReadBuffer(rel, i);
+		ReleaseBuffer(buf);
+	}
+	generate_access_sequence(block_sequence, num_operations, nblocks, random_access);
+
+	for (iter = 0; iter < iterations; iter++)
+	{
+		instr_time	start_time, end_time;
+		int64		elapsed_ns;
+
+		INSTR_TIME_SET_CURRENT(start_time);
+
+		for (int i = 0; i < num_operations + num_buffers; i++)
+		{
+			if (i >= num_buffers)
+				ReleaseBuffer(pipeline[(i - num_buffers) % num_buffers]);
+			if (i < num_operations)
+				pipeline[i % num_buffers] = ReadBuffer(rel, block_sequence[i]);
+		}
+
+		INSTR_TIME_SET_CURRENT(end_time);
+		INSTR_TIME_SUBTRACT(end_time, start_time);
+		elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time);
+
+		values[0] = Int32GetDatum(iter);
+		values[1] = Int64GetDatum(elapsed_ns);
+		values[2] = Float8GetDatum((double) elapsed_ns / num_operations);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	pfree(pipeline);
+	pfree(block_sequence);
+	relation_close(rel, AccessShareLock);
+
+	return (Datum) 0;
+}
+
+/*
+ * bench_pinning - benchmark pure pin/unpin operations
+ *
+ * Pre-reads buffers into cache, then times IncrBufferRefCount/ReleaseBuffer
+ * cycles in a pipelined fashion. This is intended to separate the
+ * pin count dependent part of the ReadBuffer/ReleaseBuffer operations
+ */
+Datum
+bench_pinning(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	text	   *relname_text = PG_GETARG_TEXT_PP(0);
+	int			num_buffers = PG_GETARG_INT32(1);
+	int			num_operations = PG_GETARG_INT32(2);
+	int			iterations = PG_GETARG_INT32(3);
+	bool		random_access = PG_GETARG_BOOL(4);
+
+	Oid			relid;
+	Relation	rel;
+	BlockNumber nblocks;
+	Buffer	   *base_buffers;
+	Buffer	   *pipeline;
+	int		   *access_sequence;
+	int			iter;
+
+	Tuplestorestate *tupstore;
+	TupleDesc	tupdesc;
+	Datum		values[3];
+	bool		nulls[3] = {false, false, false};
+
+	if (num_buffers < 1 || num_operations < num_buffers)
+		ereport(ERROR, (errmsg("invalid parameters")));
+
+	InitMaterializedSRF(fcinfo, 0);
+	tupstore = rsinfo->setResult;
+	tupdesc = rsinfo->setDesc;
+
+	relid = RangeVarGetRelid(makeRangeVarFromNameList(textToQualifiedNameList(relname_text)), AccessShareLock, false);
+	rel = relation_open(relid, AccessShareLock);
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	if ((BlockNumber) num_buffers > nblocks)
+		ereport(ERROR, (errmsg("not enough blocks in relation")));
+
+	base_buffers = palloc(num_buffers * sizeof(Buffer));
+	pipeline = palloc(num_buffers * sizeof(Buffer));
+	access_sequence = palloc(num_operations * sizeof(int));
+
+	generate_access_sequence(access_sequence, num_operations, num_buffers, random_access);
+
+	/* Pin the buffers as base pins (keeps them in cache) */
+	for (int i = 0; i < num_buffers; i++)
+		base_buffers[i] = ReadBuffer(rel, i);
+
+	for (iter = 0; iter < iterations; iter++)
+	{
+		instr_time	start_time, end_time;
+		int64		elapsed_ns;
+
+		INSTR_TIME_SET_CURRENT(start_time);
+
+		for (int i = 0; i < num_operations + num_buffers; i++)
+		{
+			if (i >= num_buffers)
+				ReleaseBuffer(pipeline[(i - num_buffers) % num_buffers]);
+			if (i < num_operations)
+			{
+				Buffer buf = base_buffers[access_sequence[i]];
+				IncrBufferRefCount(buf);
+				pipeline[i % num_buffers] = buf;
+			}
+		}
+
+		INSTR_TIME_SET_CURRENT(end_time);
+		INSTR_TIME_SUBTRACT(end_time, start_time);
+		elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time);
+
+		values[0] = Int32GetDatum(iter);
+		values[1] = Int64GetDatum(elapsed_ns);
+		values[2] = Float8GetDatum((double) elapsed_ns / num_operations);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* Release base pins */
+	for (int i = 0; i < num_buffers; i++)
+		ReleaseBuffer(base_buffers[i]);
+
+	pfree(access_sequence);
+	pfree(pipeline);
+	pfree(base_buffers);
+	relation_close(rel, AccessShareLock);
+
+	return (Datum) 0;
+}
+
+/*
+ * bench_resowner - benchmark ResourceOwner remember/forget operations
+ *
+ * Uses fake resource values to test pure ResourceOwner tracking overhead
+ * without any actual resource operations. Uses same pipelined structure.
+ */
+Datum
+bench_resowner(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	int			num_buffers = PG_GETARG_INT32(0);
+	int			num_operations = PG_GETARG_INT32(1);
+	int			iterations = PG_GETARG_INT32(2);
+	bool		random_access = PG_GETARG_BOOL(3);
+
+	Datum	   *resources;
+	Datum	   *pipeline;
+	int		   *access_sequence;
+	int			iter;
+
+	Tuplestorestate *tupstore;
+	TupleDesc	tupdesc;
+	Datum		values[3];
+	bool		nulls[3] = {false, false, false};
+
+	if (num_buffers < 1 || num_operations < num_buffers)
+		ereport(ERROR, (errmsg("invalid parameters")));
+
+	InitMaterializedSRF(fcinfo, 0);
+	tupstore = rsinfo->setResult;
+	tupdesc = rsinfo->setDesc;
+
+	resources = palloc(num_buffers * sizeof(Datum));
+	pipeline = palloc(num_buffers * sizeof(Datum));
+	access_sequence = palloc(num_operations * sizeof(int));
+
+	/* Use fake resource values (pointers to array elements for uniqueness) */
+	for (int i = 0; i < num_buffers; i++)
+		resources[i] = PointerGetDatum(&resources[i]);
+
+	generate_access_sequence(access_sequence, num_operations, num_buffers, random_access);
+
+	for (iter = 0; iter < iterations; iter++)
+	{
+		instr_time	start_time, end_time;
+		int64		elapsed_ns;
+
+		INSTR_TIME_SET_CURRENT(start_time);
+
+		for (int i = 0; i < num_operations + num_buffers; i++)
+		{
+			if (i >= num_buffers)
+				ResourceOwnerForget(CurrentResourceOwner,
+									pipeline[(i - num_buffers) % num_buffers],
+									&bench_resowner_desc);
+			if (i < num_operations)
+			{
+				Datum		res = resources[access_sequence[i]];
+				ResourceOwnerEnlarge(CurrentResourceOwner);
+				ResourceOwnerRemember(CurrentResourceOwner, res, &bench_resowner_desc);
+				pipeline[i % num_buffers] = res;
+			}
+		}
+
+		INSTR_TIME_SET_CURRENT(end_time);
+		INSTR_TIME_SUBTRACT(end_time, start_time);
+		elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time);
+
+		values[0] = Int32GetDatum(iter);
+		values[1] = Int64GetDatum(elapsed_ns);
+		values[2] = Float8GetDatum((double) elapsed_ns / num_operations);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	pfree(access_sequence);
+	pfree(pipeline);
+	pfree(resources);
+
+	return (Datum) 0;
+}
diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin.control b/src/test/modules/test_buffer_pin/test_buffer_pin.control
new file mode 100644
index 0000000000..f065941712
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/test_buffer_pin.control
@@ -0,0 +1,4 @@
+comment = 'test_buffer_pin - buffer pinning benchmarks'
+default_version = '1.0'
+module_pathname = '$libdir/test_buffer_pin'
+relocatable = true
-- 
2.34.1



  [application/x-patch] v2-0005-Array-direct-mapping.patch (15.3K, 8-v2-0005-Array-direct-mapping.patch)
  download | inline diff:
From 33138ffb0d834bc452e9d2d1819f21871c64210f Mon Sep 17 00:00:00 2001
From: bob <[email protected]>
Date: Wed, 11 Mar 2026 12:08:04 +0000
Subject: [PATCH 5/5] Array direct mapping

Instead of searching all array entries, each buffer is mapped to exactly
one slot at index (buffer % REFCOUNT_ARRAY_ENTRIES). Collisions are
resolved by evicting to the hash table.

This simplifies lookup to O(1) for the array check and should improve
performance for num_buffers > 8 while maintaining performance for
small numbers of buffers.
---
 src/backend/storage/buffer/bufmgr_refcnt.c | 307 +++++++++------------
 1 file changed, 132 insertions(+), 175 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr_refcnt.c b/src/backend/storage/buffer/bufmgr_refcnt.c
index bf84e9f62e..726697ebf7 100644
--- a/src/backend/storage/buffer/bufmgr_refcnt.c
+++ b/src/backend/storage/buffer/bufmgr_refcnt.c
@@ -12,26 +12,15 @@
  * than once by an individual backend. This mechanism is also used to track
  * whether this backend has a buffer locked, and, if so, in what mode.
  *
- * To avoid - as we used to - requiring an array with NBuffers entries to keep
- * track of local buffers, we use a small sequentially searched array
- * (PrivateRefCountArrayKeys, with the corresponding data stored in
- * PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to
- * keep track of backend local pins.
+ * To avoid requiring an array with NBuffers entries, we use a dynamically
+ * sized direct-mapped array (PrivateRefCountArray) and an overflow hash table
+ * (PrivateRefCountHash). Each buffer maps to exactly one array slot at
+ * index (buffer & mask). When a collision occurs (two different buffers
+ * map to the same slot), the existing entry is evicted to the hash table.
+ * The array grows automatically when occupation exceeds a specified threshold.
  *
- * Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all
- * refcounts are kept track of in the array; after that, new array entries
- * displace old ones into the hash table. That way a frequently used entry
- * can't get "stuck" in the hashtable while infrequent ones clog the array.
- *
- * This was initially designed trying to optimize for the case where the
- * number of pinned buffers is expected to not exceed REFCOUNT_ARRAY_ENTRIES.
- * However this might not be the case with the introduction of prefetching.
- *
- * To enter a buffer into the refcount tracking mechanism first reserve a free
- * entry using ReservePrivateRefCountEntry() and then later, if necessary,
- * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing
- * memory allocations in NewPrivateRefCountEntry() which can be important
- * because in some scenarios it's called with a spinlock held...
+ * This design provides O(1) lookup for the common case where there are no
+ * collisions, while gracefully handling overflow via the hash table.
  *
  * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -86,33 +77,40 @@ struct PrivateRefCountIterator
 #define SH_DEFINE
 #include "lib/simplehash.h"
 
-/* Private refcount array and keys */
-#define REFCOUNT_ARRAY_ENTRIES 8
-static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
-static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
+/*
+ * Private refcount array - direct-mapped by (buffer & mask).
+ * Each buffer maps to exactly one slot. Collisions evict to hash table.
+ * 
+ * Consider an array with N slots, after storing n random buffers
+ *  
+ *  - The probability of a single buffer hitting a given slot is 1/N
+ *  - The probability of a single buffer not hitting a given slot is 1-1/N
+ *  - The probability of n buffers not hitting a given slot is (1-1/N)^n
+ *  - the expected number of vacant slots is N * (1-1/N)^n ~ N * e^(-n/N)
+ * 
+ * The last approximation should be not too bad for large number,
+ * and that gives e.g. 63% occupation for n = N, 86% ocupation for n = 2*N
+ */
+#define REFCOUNT_ARRAY_INITIAL_SIZE		8
+#define REFCOUNT_ARRAY_MAX_OCCUPATION	0.86
+#define REFCOUNT_ARRAY_MAX_SIZE			(1 << 20)	/* 1M entries max */
+
+static struct PrivateRefCountEntry *PrivateRefCountArray = NULL;
+static int32 PrivateRefCountArrayMask = 0;		/* (size - 1) for fast modulo */
+static int32 PrivateRefCountArrayUsed = 0;		/* entries in the array */
+static int32 PrivateRefCountArrayTolerated = 0;	/* grow when Used exceeds this */
 
-/* Overflow hash table for when array is full */
+/* Overflow hash table for collisions */
 static refcount_hash *PrivateRefCountHash = NULL;
 
-/* Count of entries that have overflowed into the hash table */
+/* Count of entries in the hash table */
 static int32 PrivateRefCountOverflowed = 0;
 
-/* Clock hand for selecting victim when array is full */
-static uint32 PrivateRefCountClock = 0;
-
-/* Reserved slot index, or -1 if none reserved */
-static int	ReservedRefCountSlot = -1;
-
-/* Cache for last accessed entry */
-static int	PrivateRefCountEntryLast = -1;
-
 /* Advisory limit on the number of pins each backend should hold */
 static uint32 MaxProportionalPins = 0;
 
-/* Forward declarations */
-static void ReservePrivateRefCountEntry(Buffer buffer);
-static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
-static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer);
+/* Forward declaration */
+static void GrowPrivateRefCountArray(void);
 
 /*
  * Initialize private refcount tracking for this backend.
@@ -126,174 +124,135 @@ InitPrivateRefCount(void)
 	 * That's very pessimistic, but outside toy-sized shared_buffers it should
 	 * allow plenty of pins.  LimitAdditionalPins() and
 	 * GetAdditionalPinLimit() can be used to check the remaining balance.
-	*/
+	 */
 	MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
-	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
-	memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
 
-	PrivateRefCountHash = refcount_create(CurrentMemoryContext, 64, NULL);
+	/* Initialize the direct-mapped array */
+	PrivateRefCountArrayMask = REFCOUNT_ARRAY_INITIAL_SIZE - 1;
+	PrivateRefCountArrayUsed = 0;
+	PrivateRefCountArrayTolerated = (int32)(REFCOUNT_ARRAY_INITIAL_SIZE * REFCOUNT_ARRAY_MAX_OCCUPATION);
+	PrivateRefCountArray = MemoryContextAllocZero(TopMemoryContext,
+						REFCOUNT_ARRAY_INITIAL_SIZE * sizeof(PrivateRefCountEntry));
+
+	PrivateRefCountHash = refcount_create(TopMemoryContext, 64, NULL);
 }
 
 /*
- * Ensure that the PrivateRefCountArray has sufficient space to store one more
- * entry.
+ * Grow the private refcount array when usage exceeds tolerated occupation.
+ * Doubles the size and rehashes all existing entries.
  */
 static void
-ReservePrivateRefCountEntry(Buffer buffer)
+GrowPrivateRefCountArray(void)
 {
-	/* Already reserved (or freed), nothing to do */
-	if (ReservedRefCountSlot != -1)
+	PrivateRefCountEntry *old_array = PrivateRefCountArray;
+	int32		old_size = PrivateRefCountArrayMask + 1;
+	int32		new_size = old_size * 2;
+	int32		new_mask = new_size - 1;
+	int32		i;
+
+	/* Don't grow beyond maximum */
+	if (new_size > REFCOUNT_ARRAY_MAX_SIZE)
 		return;
 
-	/*
-	 * First search for a free entry in the array, that'll be sufficient in
-	 * the majority of cases.
-	 */
-	{
-		int			i;
-
-		for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
-		{
-			if (PrivateRefCountArrayKeys[i] == InvalidBuffer)
-			{
-				ReservedRefCountSlot = i;
-				return;
-			}
-		}
-	}
-
-	/*
-	 * No luck. All array entries are full. Move one array entry into the hash
-	 * table.
-	 */
-	{
-		int			victim_slot;
-		PrivateRefCountEntry *victim_entry;
-		PrivateRefCountEntry *hashent;
-		bool		found;
-
-		/* select victim slot */
-		victim_slot = PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES;
-		victim_entry = &PrivateRefCountArray[victim_slot];
-		ReservedRefCountSlot = victim_slot;
-
-		/* Better be used, otherwise we shouldn't get here. */
-		Assert(PrivateRefCountArrayKeys[victim_slot] != InvalidBuffer);
-		Assert(PrivateRefCountArray[victim_slot].buffer != InvalidBuffer);
-		Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer);
-
-		/* enter victim array entry into hashtable */
-		hashent = refcount_insert(PrivateRefCountHash,
-								  PrivateRefCountArrayKeys[victim_slot],
-								  &found);
-		Assert(!found);
-		hashent->data = victim_entry->data;
-
-		/* clear the now free array slot */
-		PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer;
-		victim_entry->buffer = InvalidBuffer;
-		victim_entry->data = 0;
-
-		PrivateRefCountOverflowed++;
-	}
-}
-
-/*
- * Create a new refcount entry for the given buffer.
- */
-static PrivateRefCountEntry *
-NewPrivateRefCountEntry(Buffer buffer)
-{
-	PrivateRefCountEntry *res;
-
-	/* only allowed to be called when a reservation has been made */
-	Assert(ReservedRefCountSlot != -1);
-
-	/* use up the reserved entry */
-	res = &PrivateRefCountArray[ReservedRefCountSlot];
-
-	/* and fill it */
-	PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
-	res->buffer = buffer;
-	res->data = 0;
-
-	/* update cache for the next lookup */
-	PrivateRefCountEntryLast = ReservedRefCountSlot;
-
-	ReservedRefCountSlot = -1;
-
-	return res;
-}
-
-/*
- * Slow-path for GetSharedBufferEntry().
- */
-static pg_noinline PrivateRefCountEntry *
-GetPrivateRefCountEntrySlow(Buffer buffer)
-{
-	PrivateRefCountEntry *res;
-	int			i;
+	/* Allocate new array and update metadata */
+	PrivateRefCountArray = MemoryContextAllocZero(TopMemoryContext,
+						new_size * sizeof(PrivateRefCountEntry));
+	PrivateRefCountArrayMask = new_mask;
+	PrivateRefCountArrayTolerated = (int32)(new_size * REFCOUNT_ARRAY_MAX_OCCUPATION);
 
 	/*
-	 * First search for references in the array.
+	 * Rehash entries from old array. When doubling, each old slot maps to
+	 * two possible new slots (slot or slot + old_size), so entries from
+	 * different old slots cannot collide. PrivateRefCountArrayUsed stays same.
 	 */
-	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+	for (i = 0; i < old_size; i++)
 	{
-		if (PrivateRefCountArrayKeys[i] == buffer)
+		if (old_array[i].buffer != InvalidBuffer)
 		{
-			PrivateRefCountEntryLast = i;
-			return &PrivateRefCountArray[i];
+			int			new_slot = old_array[i].buffer & new_mask;
+
+			PrivateRefCountArray[new_slot] = old_array[i];
 		}
 	}
 
-	/*
-	 * Only look up the buffer in the hashtable if we've previously overflowed.
-	 */
-	if (PrivateRefCountOverflowed == 0)
-		return NULL;
-
-	res = refcount_lookup(PrivateRefCountHash, buffer);
-	return res;
+	pfree(old_array);
 }
 
 /*
  * Return the PrivateRefCount entry for the passed buffer.
  * Returns NULL if the buffer is not currently pinned.
+ *
+ * With direct-mapped array, lookup is O(1): check the slot at
+ * (buffer & mask), then check hash table if needed.
  */
 static PrivateRefCountEntry *
 GetSharedBufferEntry(Buffer buffer)
 {
+	int			slot = buffer & PrivateRefCountArrayMask;
+	PrivateRefCountEntry *entry = &PrivateRefCountArray[slot];
+
 	Assert(BufferIsValid(buffer));
 	Assert(!BufferIsLocal(buffer));
 
-	/* Fast path: check one-entry cache */
-	if (likely(PrivateRefCountEntryLast != -1) &&
-		likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer))
-	{
-		return &PrivateRefCountArray[PrivateRefCountEntryLast];
-	}
+	/* Check the direct-mapped slot */
+	if (entry->buffer == buffer)
+		return entry;
 
-	return GetPrivateRefCountEntrySlow(buffer);
+	/* Check hash table if there are overflowed entries */
+	if (PrivateRefCountOverflowed > 0)
+		return refcount_lookup(PrivateRefCountHash, buffer);
+
+	return NULL;
 }
 
 /*
  * Create a new refcount entry for a buffer that is known to not be pinned.
- * This is a fast path that skips the cache/hash lookup.
- * Returns the new entry pointer with refcount already incremented.
+ * Returns the new entry pointer with refcount set to 1.
+ *
+ * If the direct-mapped slot contains a different buffer, it is evicted
+ * to the hash table first. The array grows when overflow exceeds 20% of size.
  */
 static PrivateRefCountEntry *
 SharedBufferCreateRef(Buffer buffer)
 {
-	PrivateRefCountEntry *ref;
+	int			slot;
+	PrivateRefCountEntry *entry;
+	bool		was_empty;
 
 	Assert(BufferIsValid(buffer));
 	Assert(!BufferIsLocal(buffer));
 
-	ReservePrivateRefCountEntry(buffer);
-	ref = NewPrivateRefCountEntry(buffer);
-	ref->data = ONE_PRIVATE_REFERENCE;
+	/* Grow when array usage exceeds tolerated occupation */
+	if (PrivateRefCountArrayUsed > PrivateRefCountArrayTolerated)
+		GrowPrivateRefCountArray();
+
+	slot = buffer & PrivateRefCountArrayMask;
+	entry = &PrivateRefCountArray[slot];
+	was_empty = (entry->buffer == InvalidBuffer);
+
+	/* If slot contains a different buffer, evict it to hash table */
+	if (!was_empty && entry->buffer != buffer)
+	{
+		PrivateRefCountEntry *hashent;
+		bool		found;
+
+		hashent = refcount_insert(PrivateRefCountHash, entry->buffer, &found);
+		Assert(!found);
+		hashent->data = entry->data;
+
+		PrivateRefCountOverflowed++;
+		was_empty = true;
+	}
+
+	/* Track array usage */
+	if (was_empty)
+		PrivateRefCountArrayUsed++;
+
+	/* Use the slot for our buffer */
+	entry->buffer = buffer;
+	entry->data = ONE_PRIVATE_REFERENCE;
 
-	return ref;
+	return entry;
 }
 
 /*
@@ -327,15 +286,15 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
 		Assert(SharedBufferGetLockMode(ref) == BUFFER_LOCK_UNLOCK);
 
 		if (ref >= &PrivateRefCountArray[0] &&
-			ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES])
+			ref < &PrivateRefCountArray[PrivateRefCountArrayMask + 1])
 		{
+			/* Array entry - mark as empty */
 			ref->buffer = InvalidBuffer;
-			PrivateRefCountArrayKeys[ref - PrivateRefCountArray] = InvalidBuffer;
-			ReservedRefCountSlot = ref - PrivateRefCountArray;
+			PrivateRefCountArrayUsed--;
 		}
 		else
 		{
-			/* could make slightly more efficient by using the pointer */
+			/* Hash table entry */
 			refcount_delete(PrivateRefCountHash, ref->buffer);
 			Assert(PrivateRefCountOverflowed > 0);
 			PrivateRefCountOverflowed--;
@@ -392,9 +351,9 @@ CheckPrivateRefCountLeaks(void)
 	char	   *s;
 
 	/* check the array */
-	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+	for (i = 0; i <= PrivateRefCountArrayMask; i++)
 	{
-		if (PrivateRefCountArrayKeys[i] != InvalidBuffer)
+		if (PrivateRefCountArray[i].buffer != InvalidBuffer)
 		{
 			res = &PrivateRefCountArray[i];
 
@@ -475,11 +434,11 @@ PrivateRefCountEntry *
 GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
 {
 	/* First iterate through the array */
-	while (!iter->in_hash && iter->array_index < REFCOUNT_ARRAY_ENTRIES)
+	while (!iter->in_hash && iter->array_index <= PrivateRefCountArrayMask)
 	{
 		int			idx = iter->array_index++;
 
-		if (PrivateRefCountArrayKeys[idx] != InvalidBuffer)
+		if (PrivateRefCountArray[idx].buffer != InvalidBuffer)
 			return &PrivateRefCountArray[idx];
 	}
 
@@ -547,11 +506,9 @@ FreePrivateRefCountIterator(PrivateRefCountIterator *iter)
 	 uint32		estimated_pins_held;
 
 	 /*
-	  * We get the number of "overflowed" pins for free, but don't know the
-	  * number of pins in PrivateRefCountArray.  The cost of calculating that
-	  * exactly doesn't seem worth it, so just assume the max.
+	  * We track array usage, so we can get an accurate count.
 	  */
-	 estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
+	 estimated_pins_held = PrivateRefCountOverflowed + PrivateRefCountArrayUsed;
 
 	 /* Is this backend already holding more than its fair share? */
 	 if (estimated_pins_held > MaxProportionalPins)
-- 
2.34.1



^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Addressing buffer private reference count scalability issue
@ 2026-03-11 18:55  Andres Freund <[email protected]>
  parent: Alexandre Felipe <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Andres Freund @ 2026-03-11 18:55 UTC (permalink / raw)
  To: Alexandre Felipe <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

Hi,

On 2026-03-11 18:03:34 +0000, Alexandre Felipe wrote:
> Hi Hackers,
> 
> Please find attached version 2 of this patch and the results of the
> ReadBuffer/ReleaseBuffer benchmark
> 
> For those who prefer plain text here goes an excerpt
>                         baseline   patch
> num_buffers pattern
> 1           random        194.76  183.50
>             sequential    183.75  168.02
> 2           random        204.74  187.02
>             sequential    195.27  169.58
> 3           random        222.62  192.57
>             sequential    212.81  169.55
> 7           random        233.96  219.64
>             sequential    221.94  171.94
> 11          random        365.50  247.96
>             sequential    318.24  183.96
> 19          random        390.58  267.96
>             sequential    369.51  195.35
> 100         random        424.03  286.33
>             sequential    420.74  280.23
> 988         random        445.75  301.37
>             sequential    453.61  313.30
> 10000       random        587.13  418.91
>             sequential    497.92  394.10

Nice numbers!

It'd be good to also evaluate in the context of queries, as such a focused
microbenchmark will have much higher L1/L2 cache hit ratios than workloads
that actually look at data.



> > > Here I assume that the buffer buffer sequence is independent enough from
> > the
> > > array size, so I use the buffer as the hash key directly, omitting a hash
> > > function call.
> >
> > I doubt that that's good enough. The hash table code relies on the bits
> > being
> > well mixed, but you won't get that with buffer ids.
> >
> 
> Introduced murmurhash32, it is noticeably worse for sequential access
> (by buffer index), but has some mixing.

It's not surprising it's worse, I just don't see how we could get away without
some mixing.  It's not at all crazy to have accesss patterns that just differ
in the higher bits of the buffer number (e.g. a nestloop join where the inner
side always needs a certain number of buffers). We need to be somewhat
resistant to that causing a lot of collisions (which will trigger a hashtable
growth cycle, without that necessarily fixing the issue!).

> 
> > I'm rather sceptical that the overhead of needing to shift and mask is
> > worth
> > it. I suspect we'll also want to add more state for each entry (e.g. I
> > think
> > it may be worth getting rid of the io-in-progress resowner).
> >
> 
> io-in-progress resowner name suggests that it runs only for reads...
> so maybe it could be taken out of the way of buffer hits?

It's not used for hits.

But it's used for both reads and writes.


> 
> > >    REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array
> > >    lookup, it is worth trying to remove it completely and keep only the
> > hash.
> > >    For small values we would be trading a few branches for a buffer %
> > SIZE,
> > >    for the use case of prefetch where pin/unpin in a FIFO fashion, it
> > will
> > >    save an 8-entry array lookup, and some extra data moves.
> >
> > I doubt that that's ok, in the vast majority of access we will have 0-2
> > buffers pinned. And even when we have pinned more buffers, it's
> > *exceedingly*
> > common to access the same entry repeatedly (e.g. pin, lock, unlock, unpin),
> > adding a few cycles to each of those repeated accesses will quickly show
> > up.
> >
> 
> I brought back the array, but I eliminated the linear search.

Why? In my benchmarks allowing vectorization helped a decent amount in real
queries, because it does away with all the branch misses.


> 1. USE_REFCOUNT_CACHE_ENTRY will enable the last entry cache.
> 
> 2a. the dynamic array case
> REFCOUNT_ARRAY_MAX_SIZE!=REFCOUNT_ARRAY_INITIAL_SIZE
> will grow the array when it reaches a certain level of occupation.
> I have set the default occupation level to 86% so that, if enabled, for a
> random input it will grow when we have about 2*size pins in total.
> If we find a sequential pattern then it will grow without growing the hash
> table.
> For the array lookup I don't use a hash, so for small number of pins
> it will be very fast.

I doubt it makes sense to basically have two levels of hash tables.


> 2b. the static case
> REFCOUNT_ARRAY_MAX_SIZE==REFCOUNT_ARRAY_INITIAL_SIZE
> will use a static array, just as we had before and will not perform the
> linear
> search. It still has to read the size and do mask input.
>
> I tested the 4 variations and the winner is with the static array without
> the cache for the last entry.
> I increased the array size from 8 to 32, since you suggested before that
> that this could help. At that point it would have the tradeoff of a longer
> linear search, so it may help even more now.

Does your benchmark actually test repeated accesses to the same refcount? As
mentioned, those are very very common (access sequences like Pin, (Lock,
Unlock)+, Unpin are extremely common).



> From 5de90fcd14e5d32c3165c3e7a278adaa44f4d9d1 Mon Sep 17 00:00:00 2001
> From: Alexandre Felipe <[email protected]>
> Date: Wed, 11 Mar 2026 12:05:50 +0000
> Subject: [PATCH 2/5] Refactoring reference counting
> 
> This moves the private reference count logic out of
> backend/storage/buffer/buffmgr.c that is still 8000+
> lines long. Preparing for the changes to come.


I don't think the new naming scheme (GetSharedBufferEntry etc) is good, as
that does not *at all* communicate that this is backend private state.

I'd strongly advise separating moving code from a large scale rename.  I
certainly won't waste time trying to see the difference between what was just
moved and what was changed at the same time.


> diff --git a/.gitignore b/.gitignore
> index 4e911395fe..fddb7f861d 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -43,3 +43,5 @@ lib*.pc
>  /Release/
>  /tmp_install/
>  /portlock/
> +
> +.*
> \ No newline at end of file

What? Certainly not.



> From bf146598cd94110da255c6fb45b4a5c58002a91d Mon Sep 17 00:00:00 2001
> From: Alexandre Felipe <[email protected]>
> Date: Wed, 11 Mar 2026 12:07:54 +0000
> Subject: [PATCH 3/5] Using simplehash
> 
> This patch replaces the HTAB implementation with a simplehash
> as suggested by Andres Freund [1]. The simplehash implements a templated
> open addressing hash, which have entry size information at compile time.
> 
> The access strategy of the simplehash is very close to the plain array
> that was seen to be very efficient compared to the HTAB implementation,
> with the additional advantage of using the key value to initialize the
> search closer to where the key actually is, instead of always scanning
> the entire array.
> 
> Adds one macro to the simplehash implementation enable overriding the
> empty position detection, to enable using buffer == InvalidBuffer
> instead of requiring a separate status field.
> 
> [1] https://www.postgresql.org/message-id/s5p7iou7pdhxhvmv4rohmskwqmr36dc4rybvwlep5yvwrjs4pa%406oxsemms5...
> 
> ---
>  src/backend/storage/buffer/bufmgr_refcnt.c | 89 +++++++++++-----------
>  src/include/lib/simplehash.h               | 59 +++++++++-----
>  2 files changed, 84 insertions(+), 64 deletions(-)

I don't think it's a good idea to introduce new simplehash infrastructure as
part of this larger change.  You also haven't documented the new stuff.



> From 4b96a723bfc3a6fe78e05bc3ce454b0b160679d1 Mon Sep 17 00:00:00 2001
> From: Alexandre Felipe <[email protected]>
> Date: Wed, 11 Mar 2026 12:08:03 +0000
> Subject: [PATCH 4/5] Compact PrivateRefCountEntry
> 
> The previous implementation used an 8-bytes (64-bit) entry to store
> a uint32 count and an uint32 lock mode. That is fine when we store
> the data separate from the key (buffer). But in the simplehash
> {key, value} are stored together, so each entry is 12-bytes.
> This is somewhat awkward as we have to either pad the entry to 16-bytes,
> or the access will be an alternating aligned/misaligned addreses.
> 
> Lock can assume only 4 values, and 2^30 is a decent limit for the
> number of pins on a single buffer. So this change is packing the
> {count[31:2], lock[1:0]} into a single uint32.
> 
> Incrementing/decrementing the count continue the same, just using
> 4 instead of 1, lock mode access will require one or two additional
> bitwise operations. The exact count requires one shift, and is used
> only for debugging. A special function is provided to check whether
> count == 1.

Have you actually evaluated the benefit from this? Pretty sceptical it's worth
it.

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Addressing buffer private reference count scalability issue
@ 2026-03-11 22:41  Alexandre Felipe <[email protected]>
  parent: Andres Freund <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Alexandre Felipe @ 2026-03-11 22:41 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Wed, Mar 11, 2026 at 6:55 PM Andres Freund <[email protected]> wrote:

> Nice numbers!
>
> It'd be good to also evaluate in the context of queries, as such a focused
> microbenchmark will have much higher L1/L2 cache hit ratios than workloads
> that actually look at data
>

I ran with the query you first used to illustrate the issue, and it was
slower.
Went back and tried a strict copy of the code and it was 3% slower.
passing CFLAGS += -falign-functions=64 this shows no degradation.
Maybe LTO is what we need here?

It's not surprising it's worse, I just don't see how we could get away
> without
> some mixing.

I will keep that for the future.

> It's not at all crazy to have accesss patterns that just differ
> in the higher bits of the buffer number (e.g. a nestloop join where the
> inner
> side always needs a certain number of buffers). We need to be somewhat
> resistant to that causing a lot of collisions (which will trigger a
> hashtable
> growth cycle, without that necessarily fixing the issue!).
>

Yes, I understand that possibility.

> > I brought back the array, but I eliminated the linear search.
>
> Why? In my benchmarks allowing vectorization helped a decent amount in real
> queries, because it does away with all the branch misses.

I basically treated the array as a hash table with a weak hash function and
delegated collisions to simplehash.
In the worse case would do a simple array[buffer & mask], never fails with
one branch, and would fail in ~3% of the cases with two branches.
And would eliminate the branches necessary to check the 1-entry cache.

But notice that this was the last patch, if you like everything except that
it is just a matter of picking 02, 03, 04.

> 1. USE_REFCOUNT_CACHE_ENTRY will enable the last entry cache.
> >
> > 2a. the dynamic array case
> > REFCOUNT_ARRAY_MAX_SIZE!=REFCOUNT_ARRAY_INITIAL_SIZE
> > will grow the array when it reaches a certain level of occupation.
> > I have set the default occupation level to 86% so that, if enabled, for a
> > random input it will grow when we have about 2*size pins in total.
> > If we find a sequential pattern then it will grow without growing the
> hash
> > table.
> > For the array lookup I don't use a hash, so for small number of pins
> > it will be very fast.
>
> I doubt it makes sense to basically have two levels of hash tables.
>
>
> > 2b. the static case
> > REFCOUNT_ARRAY_MAX_SIZE==REFCOUNT_ARRAY_INITIAL_SIZE
> > will use a static array, just as we had before and will not perform the
> > linear
> > search. It still has to read the size and do mask input.
> >
> > I tested the 4 variations and the winner is with the static array without
> > the cache for the last entry.
> > I increased the array size from 8 to 32, since you suggested before that
> > that this could help. At that point it would have the tradeoff of a
> longer
> > linear search, so it may help even more now.
>
> Does your benchmark actually test repeated accesses to the same refcount?
> As
> mentioned, those are very very common (access sequences like Pin, (Lock,
> Unlock)+, Unpin are extremely common).

Yes that is the case a FIFO with one buffer, I didn't include the lock
unlock tests here.
I did it a some point and it

I don't think the new naming scheme (GetSharedBufferEntry etc) is good, as
> that does not *at all* communicate that this is backend private state.
>

I suspected that, GetEntryForPrivateReferenceCountOfSharedBuffer would be
more accurate right?
I probably will stick with the original names.

> I'd strongly advise separating moving code from a large scale rename.  I
> certainly won't waste time trying to see the difference between what was
> just
> moved and what was changed at the same time.
>

Would you prefer not moving the code at all? One of the main reasons
for this was the changes in data structure on the patch 04, that I will not
include in the next version.

> > diff --git a/.gitignore b/.gitignore
> > index 4e911395fe..fddb7f861d 100644
> > --- a/.gitignore
> > +++ b/.gitignore
> > @@ -43,3 +43,5 @@ lib*.pc
> >  /Release/
> >  /tmp_install/
> >  /portlock/
> > +
> > +.*
> > \ No newline at end of file
>
> What? Certainly not.
>

Do you mean, we should certainly not exclude hidden files from git?
I usually build with prefix postgresql/.build/patch-*/
Then whenever I checkout something I have to keep adding this again.

I don't think it's a good idea to introduce new simplehash infrastructure as
> part of this larger change.

Do you think it is worth doing that as a separate patch? Then we get it
out of the way on this that probably will go a few more versions?

You also haven't documented the new stuff.

Do you mean as source comments, or is there a separate documentation
for this?

> > The previous implementation used an 8-bytes (64-bit) entry to store
> > a uint32 count and an uint32 lock mode. That is fine when we store
> > the data separate from the key (buffer). But in the simplehash
> > {key, value} are stored together, so each entry is 12-bytes.
> > This is somewhat awkward as we have to either pad the entry to 16-bytes,
> > or the access will be an alternating aligned/misaligned addreses.
> >
> > Lock can assume only 4 values, and 2^30 is a decent limit for the
> > number of pins on a single buffer. So this change is packing the
> > {count[31:2], lock[1:0]} into a single uint32.
> >
> > Incrementing/decrementing the count continue the same, just using
> > 4 instead of 1, lock mode access will require one or two additional
> > bitwise operations. The exact count requires one shift, and is used
> > only for debugging. A special function is provided to check whether
> > count == 1.
>
> Have you actually evaluated the benefit from this? Pretty sceptical it's
> worth
> it.
>

I tested, and I agree, not worth it from a speed perspective.
At this point the only part left is the introduction of the simplehash.

However what I will try is to store just the buffer number in the hash
and keep another array for the entries, who knows that works better.

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Addressing buffer private reference count scalability issue
@ 2026-03-15 02:37  Alexandre Felipe <[email protected]>
  parent: Alexandre Felipe <[email protected]>
  0 siblings, 0 replies; 6+ messages in thread

From: Alexandre Felipe @ 2026-03-15 02:37 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

Hi Andres,

> I don't think it's a good idea to introduce new simplehash infrastructure
as
> part of this larger change.

I am submitting the change I did before on simplehash for empty entry
detection.

> You also haven't documented the new stuff.

What and where I was supposed to document?

01 the change I did before, 02 I applied it to refcount in bufmgr.
Exploring a bit more the code base, 03 removes status from nodeMemoize
04 remove status from pg_rewind/filemap.c
05 was a bit trickier because InvalidBlockNumber is not 0, then I had to
make entries empty after allocation that uses memset 0 by default.

I grepped for a list and there are 21 files in total so I will stop here.


> In my benchmarks allowing vectorization helped a decent amount in real
> queries, because it does away with all the branch misses.

Interesting, did you compile with the default configuration?
I used gcc11 (a bit old I know) and and yes, it remove some branches
but still use a loop with cmove (conditional copy), in some cases it unrolls
but still uses cmove for each entry (the machine I tested  is quite feature
rich
e.g. avx512cd avx512bw avx512vl avx512_vnni.

So I am wondering if the impact I see is not the same impact as you see.

If we go for vectorisation we could do a vectorized loop
e.g. 16 iterations on 16 x 32-bit vectors with early exit.
but that would inevitably make the few entries case slightly slower.


Do you have any other localised issue like this that could be worth looking
into?

Regards,
Alexandre


Attachments:

  [application/octet-stream] v3-0004-Use-null-pointer-as-an-empty-marker.patch (1.9K, 3-v3-0004-Use-null-pointer-as-an-empty-marker.patch)
  download | inline diff:
From f457eb9c232f0e35143aa3ce37480b7b07148c53 Mon Sep 17 00:00:00 2001
From: Bob <[email protected]>
Date: Sun, 15 Mar 2026 01:23:12 +0000
Subject: [PATCH 4/5] Use null pointer as an empty marker

Here simplehash is being simply used as a Set
NULL pointer is a natural empty marker.
---
 src/bin/pg_rewind/filemap.c | 7 ++++++-
 src/bin/pg_rewind/filemap.h | 2 --
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index b79c47f925..574a282d84 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -45,6 +45,9 @@
 #define SH_KEY					path
 #define SH_HASH_KEY(tb, key)	hash_string(key)
 #define SH_EQUAL(tb, a, b)		(strcmp(a, b) == 0)
+#define SH_ENTRY_EMPTY(entry)   ((entry)->path == NULL)
+#define SH_MAKE_EMPTY(entry)    ((entry)->path = NULL)
+#define SH_MAKE_IN_USE(entry)   ((void)0)
 #define SH_SCOPE				static inline
 #define SH_RAW_ALLOCATOR		pg_malloc0
 #define SH_DECLARE
@@ -68,7 +71,6 @@ static file_entry_t *lookup_filehash_entry(const char *path);
 typedef struct keepwal_entry
 {
 	const char *path;
-	uint32		status;
 } keepwal_entry;
 
 #define SH_PREFIX				keepwal
@@ -77,6 +79,9 @@ typedef struct keepwal_entry
 #define SH_KEY					path
 #define SH_HASH_KEY(tb, key)	hash_string(key)
 #define SH_EQUAL(tb, a, b)		(strcmp(a, b) == 0)
+#define SH_ENTRY_EMPTY(entry)   ((entry)->path == NULL)
+#define SH_MAKE_EMPTY(entry)    ((entry)->path = NULL)
+#define SH_MAKE_IN_USE(entry)   ((void)0)
 #define SH_SCOPE				static inline
 #define SH_RAW_ALLOCATOR		pg_malloc0
 #define SH_DECLARE
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index 4c6dd8740d..b3e67be1da 100644
--- a/src/bin/pg_rewind/filemap.h
+++ b/src/bin/pg_rewind/filemap.h
@@ -56,8 +56,6 @@ typedef enum
  */
 typedef struct file_entry_t
 {
-	uint32		status;			/* hash status */
-
 	const char *path;
 	file_content_type_t content_type;
 
-- 
2.34.1



  [application/octet-stream] v3-0001-Custom-simplehash-empty-value-detection.patch (6.1K, 4-v3-0001-Custom-simplehash-empty-value-detection.patch)
  download | inline diff:
From db04956b20ba4324a0cd4d391a2d2e3e2a358831 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Sun, 15 Mar 2026 00:00:00 +0000
Subject: [PATCH 1/5] Custom simplehash empty value detection

Changes the empty value identification in simplehash allowing custom
values to be used. The default continues to use the status field.

For types where the key value already has an "invalid" value,
the macros SH_ENTRY_EMPTY, SH_MAKE_EMPTY and SH_MAKE_IN_USE can be
overridden to elliminate the need for a separate status field.
---
 src/include/lib/simplehash.h | 59 +++++++++++++++++++++++-------------
 1 file changed, 38 insertions(+), 21 deletions(-)

diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index 848719232a..3c03a7e9c9 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -287,6 +287,20 @@ SH_SCOPE void SH_STAT(SH_TYPE * tb);
 #define SH_COMPARE_KEYS(tb, ahash, akey, b) (SH_EQUAL(tb, b->SH_KEY, akey))
 #endif
 
+/*
+ * Macros to check/set entry status. Users can override these to avoid
+ * needing a separate status field if their key type has an "invalid" value.
+ */
+#ifndef SH_ENTRY_EMPTY
+#define SH_ENTRY_EMPTY(entry) ((entry)->status == SH_STATUS_EMPTY)
+#endif
+#ifndef SH_MAKE_EMPTY
+#define SH_MAKE_EMPTY(entry) ((entry)->status = SH_STATUS_EMPTY)
+#endif
+#ifndef SH_MAKE_IN_USE
+#define SH_MAKE_IN_USE(entry) ((entry)->status = SH_STATUS_IN_USE)
+#endif
+
 /*
  * Wrap the following definitions in include guards, to avoid multiple
  * definition errors if this header is included more than once.  The rest of
@@ -544,7 +558,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 		uint32		hash;
 		uint32		optimal;
 
-		if (oldentry->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_EMPTY(oldentry))
 		{
 			startelem = i;
 			break;
@@ -566,7 +580,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 	{
 		SH_ELEMENT_TYPE *oldentry = &olddata[copyelem];
 
-		if (oldentry->status == SH_STATUS_IN_USE)
+		if (!SH_ENTRY_EMPTY(oldentry))
 		{
 			uint32		hash;
 			uint32		startelem2;
@@ -582,7 +596,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 			{
 				newentry = &newdata[curelem];
 
-				if (newentry->status == SH_STATUS_EMPTY)
+				if (SH_ENTRY_EMPTY(newentry))
 				{
 					break;
 				}
@@ -653,14 +667,14 @@ restart:
 		SH_ELEMENT_TYPE *entry = &data[curelem];
 
 		/* any empty bucket can directly be used */
-		if (entry->status == SH_STATUS_EMPTY)
+		if (SH_ENTRY_EMPTY(entry))
 		{
 			tb->members++;
 			entry->SH_KEY = key;
 #ifdef SH_STORE_HASH
 			SH_GET_HASH(tb, entry) = hash;
 #endif
-			entry->status = SH_STATUS_IN_USE;
+			SH_MAKE_IN_USE(entry);
 			*found = false;
 			return entry;
 		}
@@ -675,7 +689,7 @@ restart:
 
 		if (SH_COMPARE_KEYS(tb, hash, key, entry))
 		{
-			Assert(entry->status == SH_STATUS_IN_USE);
+			Assert(!SH_ENTRY_EMPTY(entry));
 			*found = true;
 			return entry;
 		}
@@ -699,7 +713,7 @@ restart:
 				emptyelem = SH_NEXT(tb, emptyelem, startelem);
 				emptyentry = &data[emptyelem];
 
-				if (emptyentry->status == SH_STATUS_EMPTY)
+				if (SH_ENTRY_EMPTY(emptyentry))
 				{
 					lastentry = emptyentry;
 					break;
@@ -748,7 +762,7 @@ restart:
 #ifdef SH_STORE_HASH
 			SH_GET_HASH(tb, entry) = hash;
 #endif
-			entry->status = SH_STATUS_IN_USE;
+			SH_MAKE_IN_USE(entry);
 			*found = false;
 			return entry;
 		}
@@ -810,12 +824,12 @@ SH_LOOKUP_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash)
 	{
 		SH_ELEMENT_TYPE *entry = &tb->data[curelem];
 
-		if (entry->status == SH_STATUS_EMPTY)
+		if (SH_ENTRY_EMPTY(entry))
 		{
 			return NULL;
 		}
 
-		Assert(entry->status == SH_STATUS_IN_USE);
+		Assert(!SH_ENTRY_EMPTY(entry));
 
 		if (SH_COMPARE_KEYS(tb, hash, key, entry))
 			return entry;
@@ -868,10 +882,10 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 	{
 		SH_ELEMENT_TYPE *entry = &tb->data[curelem];
 
-		if (entry->status == SH_STATUS_EMPTY)
+		if (SH_ENTRY_EMPTY(entry))
 			return false;
 
-		if (entry->status == SH_STATUS_IN_USE &&
+		if (!SH_ENTRY_EMPTY(entry) &&
 			SH_COMPARE_KEYS(tb, hash, key, entry))
 		{
 			SH_ELEMENT_TYPE *lastentry = entry;
@@ -894,9 +908,9 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 				curelem = SH_NEXT(tb, curelem, startelem);
 				curentry = &tb->data[curelem];
 
-				if (curentry->status != SH_STATUS_IN_USE)
+				if (SH_ENTRY_EMPTY(curentry))
 				{
-					lastentry->status = SH_STATUS_EMPTY;
+					SH_MAKE_EMPTY(lastentry);
 					break;
 				}
 
@@ -906,7 +920,7 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 				/* current is at optimal position, done */
 				if (curoptimal == curelem)
 				{
-					lastentry->status = SH_STATUS_EMPTY;
+					SH_MAKE_EMPTY(lastentry);
 					break;
 				}
 
@@ -957,9 +971,9 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
 		curelem = SH_NEXT(tb, curelem, startelem);
 		curentry = &tb->data[curelem];
 
-		if (curentry->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_EMPTY(curentry))
 		{
-			lastentry->status = SH_STATUS_EMPTY;
+			SH_MAKE_EMPTY(lastentry);
 			break;
 		}
 
@@ -969,7 +983,7 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
 		/* current is at optimal position, done */
 		if (curoptimal == curelem)
 		{
-			lastentry->status = SH_STATUS_EMPTY;
+			SH_MAKE_EMPTY(lastentry);
 			break;
 		}
 
@@ -997,7 +1011,7 @@ SH_START_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter)
 	{
 		SH_ELEMENT_TYPE *entry = &tb->data[i];
 
-		if (entry->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_EMPTY(entry))
 		{
 			startelem = i;
 			break;
@@ -1063,7 +1077,7 @@ SH_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter)
 
 		if ((iter->cur & tb->sizemask) == (iter->end & tb->sizemask))
 			iter->done = true;
-		if (elem->status == SH_STATUS_IN_USE)
+		if (!SH_ENTRY_EMPTY(elem))
 		{
 			return elem;
 		}
@@ -1140,7 +1154,7 @@ SH_STAT(SH_TYPE * tb)
 
 		elem = &tb->data[i];
 
-		if (elem->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_EMPTY(elem))
 			continue;
 
 		hash = SH_ENTRY_HASH(tb, elem);
@@ -1205,6 +1219,9 @@ SH_STAT(SH_TYPE * tb)
 #undef SH_STORE_HASH
 #undef SH_USE_NONDEFAULT_ALLOCATOR
 #undef SH_EQUAL
+#undef SH_ENTRY_EMPTY
+#undef SH_MAKE_EMPTY
+#undef SH_MAKE_IN_USE
 
 /* undefine locally declared macros */
 #undef SH_MAKE_PREFIX
-- 
2.34.1



  [application/octet-stream] v3-0003-Use-NULL-key-as-empty-in-nodeMemoize.c.patch (1.3K, 5-v3-0003-Use-NULL-key-as-empty-in-nodeMemoize.c.patch)
  download | inline diff:
From e091fd06217aae30cf73697156b22bc66c382445 Mon Sep 17 00:00:00 2001
From: Bob <[email protected]>
Date: Sun, 15 Mar 2026 01:19:05 +0000
Subject: [PATCH 3/5] Use NULL key as empty in nodeMemoize.c

MemoizeKey *key is describeed as Hash key for hash table lookups
But it is not as one would expect.
MemoizeHash_hash computes hash from tb->private_data ignoring the key
Similarly Memoize_equal doesn't use key2, and uses probeslot from tb->private_data.

At line 545
memoize_insert(mstate->hashtable, NULL, found);
if not found we hold a pointer to an entry still empty in
the simplehash, until we assign it a key (line 561)
---
 src/backend/executor/nodeMemoize.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/executor/nodeMemoize.c b/src/backend/executor/nodeMemoize.c
index fdca97d742..44154ad25a 100644
--- a/src/backend/executor/nodeMemoize.c
+++ b/src/backend/executor/nodeMemoize.c
@@ -143,6 +143,9 @@ static bool MemoizeHash_equal(struct memoize_hash *tb,
 #define SH_KEY key
 #define SH_HASH_KEY(tb, key) MemoizeHash_hash(tb, key)
 #define SH_EQUAL(tb, a, b) MemoizeHash_equal(tb, a, b)
+#define SH_ENTRY_EMPTY(entry) ((entry)->key == NULL)
+#define SH_MAKE_EMPTY(entry) ((entry)->key = NULL)
+#define SH_MAKE_IN_USE(entry) ((void)0)
 #define SH_SCOPE static inline
 #define SH_STORE_HASH
 #define SH_GET_HASH(tb, a) a->hash
-- 
2.34.1



  [application/octet-stream] v3-0005-Use-InvalidBlockNumber-as-empty-marker.patch (5.2K, 6-v3-0005-Use-InvalidBlockNumber-as-empty-marker.patch)
  download | inline diff:
From 210916bcc236856e63b258a03573da5319b3e244 Mon Sep 17 00:00:00 2001
From: Bob <[email protected]>
Date: Sun, 15 Mar 2026 01:48:09 +0000
Subject: [PATCH 5/5] Use InvalidBlockNumber as empty marker

This one required an update on simplehash implementation
InvalidBlockNumber defined as ((BlockNumber) 0xFFFFFFFF)
in ./src/include/storage/block.h by default SimpleHash simply
zeroes the memory and that makes everything empty.

For this case we have to call SH_MAKE_EMPTY on each
entry after allocating.

Removing the status field also a the hacks
old_status = entry->status
modify entry, corrupts status
entry->status = old_status

where the status initialized by simplehash had to be
saved and restored when updating the entry.
---
 src/backend/nodes/tidbitmap.c | 19 ++++++-------------
 src/include/lib/simplehash.h  | 22 ++++++++++++++++++++++
 2 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index f1f925cb13..207f27f0ca 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -92,7 +92,6 @@
 typedef struct PagetableEntry
 {
 	BlockNumber blockno;		/* page number (hashtable key) */
-	char		status;			/* hash entry status */
 	bool		ischunk;		/* T = lossy storage, F = exact */
 	bool		recheck;		/* should the tuples be rechecked? */
 	bitmapword	words[Max(WORDS_PER_PAGE, WORDS_PER_CHUNK)];
@@ -237,6 +236,12 @@ static int	tbm_shared_comparator(const void *left, const void *right,
 #define SH_KEY blockno
 #define SH_HASH_KEY(tb, key) murmurhash32(key)
 #define SH_EQUAL(tb, a, b) a == b
+#define SH_ENTRY_EMPTY(entry) ((entry)->blockno == InvalidBlockNumber)
+#define SH_MAKE_EMPTY(entry) ((entry)->blockno = InvalidBlockNumber)
+#define SH_MAKE_IN_USE(entry) ((void)0)
+// Since the empty marker is non-zero, we need to reset the entries
+// after allocation using the custom SH_MAKE_EMPTY macro.
+#define SH_NONZERO_EMPTY
 #define SH_SCOPE static inline
 #define SH_DEFINE
 #define SH_DECLARE
@@ -291,15 +296,12 @@ tbm_create_pagetable(TIDBitmap *tbm)
 	{
 		PagetableEntry *page;
 		bool		found;
-		char		oldstatus;
 
 		page = pagetable_insert(tbm->pagetable,
 								tbm->entry1.blockno,
 								&found);
 		Assert(!found);
-		oldstatus = page->status;
 		memcpy(page, &tbm->entry1, sizeof(PagetableEntry));
-		page->status = oldstatus;
 	}
 
 	tbm->status = TBM_HASH;
@@ -1230,10 +1232,7 @@ tbm_get_pageentry(TIDBitmap *tbm, BlockNumber pageno)
 	/* Initialize it if not present before */
 	if (!found)
 	{
-		char		oldstatus = page->status;
-
 		MemSet(page, 0, sizeof(PagetableEntry));
-		page->status = oldstatus;
 		page->blockno = pageno;
 		/* must count it too */
 		tbm->nentries++;
@@ -1317,10 +1316,7 @@ tbm_mark_page_lossy(TIDBitmap *tbm, BlockNumber pageno)
 	/* Initialize it if not present before */
 	if (!found)
 	{
-		char		oldstatus = page->status;
-
 		MemSet(page, 0, sizeof(PagetableEntry));
-		page->status = oldstatus;
 		page->blockno = chunk_pageno;
 		page->ischunk = true;
 		/* must count it too */
@@ -1329,11 +1325,8 @@ tbm_mark_page_lossy(TIDBitmap *tbm, BlockNumber pageno)
 	}
 	else if (!page->ischunk)
 	{
-		char		oldstatus = page->status;
-
 		/* chunk header page was formerly non-lossy, make it lossy */
 		MemSet(page, 0, sizeof(PagetableEntry));
-		page->status = oldstatus;
 		page->blockno = chunk_pageno;
 		page->ischunk = true;
 		/* we assume it had some tuple bit(s) set, so mark it lossy */
diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index 3c03a7e9c9..c890a09b76 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -301,6 +301,12 @@ SH_SCOPE void SH_STAT(SH_TYPE * tb);
 #define SH_MAKE_IN_USE(entry) ((entry)->status = SH_STATUS_IN_USE)
 #endif
 
+/*
+ * If the empty marker is non-zero (e.g., InvalidBlockNumber = 0xFFFFFFFF),
+ * define SH_NONZERO_EMPTY to explicitly initialize entries. When unset,
+ * zero-initialization via memset is sufficient (the default).
+ */
+
 /*
  * Wrap the following definitions in include guards, to avoid multiple
  * definition errors if this header is included more than once.  The rest of
@@ -481,6 +487,11 @@ SH_CREATE(MemoryContext ctx, uint32 nelements, void *private_data)
 
 	tb->data = (SH_ELEMENT_TYPE *) SH_ALLOCATE(tb, sizeof(SH_ELEMENT_TYPE) * size);
 
+#ifdef SH_NONZERO_EMPTY
+	for (uint64 i = 0; i < size; i++)
+		SH_MAKE_EMPTY(&tb->data[i]);
+#endif
+
 	SH_UPDATE_PARAMETERS(tb, size);
 	return tb;
 }
@@ -497,7 +508,12 @@ SH_DESTROY(SH_TYPE * tb)
 SH_SCOPE void
 SH_RESET(SH_TYPE * tb)
 {
+#ifdef SH_NONZERO_EMPTY
+	for (uint32 i = 0; i < tb->size; i++)
+		SH_MAKE_EMPTY(&tb->data[i]);
+#else
 	memset(tb->data, 0, sizeof(SH_ELEMENT_TYPE) * tb->size);
+#endif
 	tb->members = 0;
 }
 
@@ -526,6 +542,11 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 
 	tb->data = (SH_ELEMENT_TYPE *) SH_ALLOCATE(tb, sizeof(SH_ELEMENT_TYPE) * newsize);
 
+#ifdef SH_NONZERO_EMPTY
+	for (uint64 j = 0; j < newsize; j++)
+		SH_MAKE_EMPTY(&tb->data[j]);
+#endif
+
 	/*
 	 * Update parameters for new table after allocation succeeds to avoid
 	 * inconsistent state on OOM.
@@ -1222,6 +1243,7 @@ SH_STAT(SH_TYPE * tb)
 #undef SH_ENTRY_EMPTY
 #undef SH_MAKE_EMPTY
 #undef SH_MAKE_IN_USE
+#undef SH_NONZERO_EMPTY
 
 /* undefine locally declared macros */
 #undef SH_MAKE_PREFIX
-- 
2.34.1



  [application/octet-stream] v3-0002-Use-InvalidBuffer-to-indicate-empty-slot.patch (1.5K, 7-v3-0002-Use-InvalidBuffer-to-indicate-empty-slot.patch)
  download | inline diff:
From 056b123ae41d8ee8da5fdf5d4ee9cd886a52b2d9 Mon Sep 17 00:00:00 2001
From: Bob <[email protected]>
Date: Sun, 15 Mar 2026 00:13:55 +0000
Subject: [PATCH 2/5] Use InvalidBuffer to indicate empty slot

Simple hash requires a mechanism to distinguish
empty slots. The previous implementation of refcount
simplehash was using the default `status` field, adding a
char to the PrivateRefCountEntr.

The buffer already has an reserved value InvalidBuffer that
can be used to mark an entry as empty. Making use of that
removes one field from PrivateRefCountEntry and keep it
32bit aligned, without padding required.
---
 src/backend/storage/buffer/bufmgr.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 00bc609529..db911327f9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -125,8 +125,6 @@ typedef struct PrivateRefCountEntry
 	 */
 	Buffer		buffer;
 
-	char		status;
-
 	PrivateRefCountData data;
 } PrivateRefCountEntry;
 
@@ -136,6 +134,9 @@ typedef struct PrivateRefCountEntry
 #define SH_KEY buffer
 #define SH_HASH_KEY(tb, key) murmurhash32((uint32) (key))
 #define SH_EQUAL(tb, a, b) ((a) == (b))
+#define SH_ENTRY_EMPTY(entry) ((entry)->buffer == InvalidBuffer)
+#define SH_MAKE_EMPTY(entry) ((entry)->buffer = InvalidBuffer)
+#define SH_MAKE_IN_USE(entry) ((void)0) /* key assignment implies in use */
 #define SH_SCOPE static inline
 #define SH_DECLARE
 #define SH_DEFINE
-- 
2.34.1



^ permalink  raw  reply  [nested|flat] 6+ messages in thread

end of thread, other threads:[~2026-03-15 02:37 UTC | newest]

Thread overview: 6+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-03-08 16:09 Addressing buffer private reference count scalability issue Alexandre Felipe <[email protected]>
2026-03-08 17:09 ` Andres Freund <[email protected]>
2026-03-11 18:03   ` Alexandre Felipe <[email protected]>
2026-03-11 18:55     ` Andres Freund <[email protected]>
2026-03-11 22:41       ` Alexandre Felipe <[email protected]>
2026-03-15 02:37         ` Alexandre Felipe <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox