public inbox for [email protected]  
help / color / mirror / Atom feed
From: Alexandre Felipe <[email protected]>
To: Andres Freund <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Subject: Re: Addressing buffer private reference count scalability issue
Date: Wed, 11 Mar 2026 18:03:34 +0000
Message-ID: <CAE8JnxNqf=sYB-hfeHBEtXi+aC8jezqqukdgRuR2=t8nbetL=w@mail.gmail.com> (raw)
In-Reply-To: <krknshnvus4qhehtoqtwnroemgxqwlfmykark6umd6hf64xnku@ibxx734ds3ga>
References: <CAE8JnxNTETEUiAOF31=_yo=pvyAi9npOeJfcTvEJJbi4vomtYA@mail.gmail.com>
	<krknshnvus4qhehtoqtwnroemgxqwlfmykark6umd6hf64xnku@ibxx734ds3ga>

Hi Hackers,

Please find attached version 2 of this patch and the results of the
ReadBuffer/ReleaseBuffer benchmark

For those who prefer plain text here goes an excerpt
                        baseline   patch
num_buffers pattern
1           random        194.76  183.50
            sequential    183.75  168.02
2           random        204.74  187.02
            sequential    195.27  169.58
3           random        222.62  192.57
            sequential    212.81  169.55
7           random        233.96  219.64
            sequential    221.94  171.94
11          random        365.50  247.96
            sequential    318.24  183.96
19          random        390.58  267.96
            sequential    369.51  195.35
100         random        424.03  286.33
            sequential    420.74  280.23
988         random        445.75  301.37
            sequential    453.61  313.30
10000       random        587.13  418.91
            sequential    497.92  394.10

On Sun, Mar 8, 2026 at 5:09 PM Andres Freund <[email protected]> wrote:

> Hi,
>
> On 2026-03-08 16:09:07 +0000, Alexandre Felipe wrote:
> >    2.
> >
> >    Refactoring reference counting: Before starting to change code and
> >    potentially breaking things I considered prudent to isolate it to
> limit the
> >    damage. This code was part of a 8k+ LOC file.
>
> Not allowing, at least the fast paths, to be inlined will make the most
> common
> cases measurably slower, I've tried that.
>

I made it inline in this time. I placed in a separate file, I am including
the
.h at the top and the .c


> > Here I assume that the buffer buffer sequence is independent enough from
> the
> > array size, so I use the buffer as the hash key directly, omitting a hash
> > function call.
>
> I doubt that that's good enough. The hash table code relies on the bits
> being
> well mixed, but you won't get that with buffer ids.
>

Introduced murmurhash32, it is noticeably worse for sequential access
(by buffer index), but has some mixing.


> I'm rather sceptical that the overhead of needing to shift and mask is
> worth
> it. I suspect we'll also want to add more state for each entry (e.g. I
> think
> it may be worth getting rid of the io-in-progress resowner).
>

io-in-progress resowner name suggests that it runs only for reads...
so maybe it could be taken out of the way of buffer hits?


> >    REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array
> >    lookup, it is worth trying to remove it completely and keep only the
> hash.
> >    For small values we would be trading a few branches for a buffer %
> SIZE,
> >    for the use case of prefetch where pin/unpin in a FIFO fashion, it
> will
> >    save an 8-entry array lookup, and some extra data moves.
>
> I doubt that that's ok, in the vast majority of access we will have 0-2
> buffers pinned. And even when we have pinned more buffers, it's
> *exceedingly*
> common to access the same entry repeatedly (e.g. pin, lock, unlock, unpin),
> adding a few cycles to each of those repeated accesses will quickly show
> up.
>

I brought back the array, but I eliminated the linear search.

1. USE_REFCOUNT_CACHE_ENTRY will enable the last entry cache.

2a. the dynamic array case
REFCOUNT_ARRAY_MAX_SIZE!=REFCOUNT_ARRAY_INITIAL_SIZE
will grow the array when it reaches a certain level of occupation.
I have set the default occupation level to 86% so that, if enabled, for a
random input it will grow when we have about 2*size pins in total.
If we find a sequential pattern then it will grow without growing the hash
table.
For the array lookup I don't use a hash, so for small number of pins
it will be very fast.

2b. the static case
REFCOUNT_ARRAY_MAX_SIZE==REFCOUNT_ARRAY_INITIAL_SIZE
will use a static array, just as we had before and will not perform the
linear
search. It still has to read the size and do mask input.

I tested the 4 variations and the winner is with the static array without
the cache for the last entry.
I increased the array size from 8 to 32, since you suggested before that
that this could help. At that point it would have the tradeoff of a longer
linear search, so it may help even more now.

> Incrementing/decrementing the count continue the same, just using
> > 4 instead of 1, lock mode access will require one or two additional
> > bitwise operations.
> >
> > No bit-shifts are required.
>
> I don't know how that last sentence can be true, given that:
>
> > -struct PrivateRefCountEntry
> > +#define PRIVATEREFCOUNT_LOCKMODE_MASK        0x3
> > +#define ONE_PRIVATE_REFERENCE                        4       /* 1 << 2
> */
> > +
> > +#define PrivateRefCountGetLockMode(d)        ((BufferLockMode)((d) &
> PRIVATEREFCOUNT_LOCKMODE_MASK))
> > +#define PrivateRefCountSetLockMode(d, m) ((d) = ((d) &
> ~PRIVATEREFCOUNT_LOCKMODE_MASK) | (m))
> > +#define PrivateRefCountGetRefCount(d)        ((int32)((d) >> 2))
> > +#define PrivateRefCountIsZero(d)             ((d) <
> ONE_PRIVATE_REFERENCE)
>
> Involves shifts at least in PrivateRefCountGetRefCount()..
>

Yes, there was some unintended code there.
I am adding SharedBufferHasSingleRef to replace the (...) == 1 checks.
The actual count is used only for debugging, so the shift there shouldn't
be a problem.

Regards,
Alexandre


Attachments:

  [image/png] compare-read.png (156.2K, 3-compare-read.png)
  download | view image

  [application/x-patch] v2-0004-Compact-PrivateRefCountEntry.patch (12.8K, 4-v2-0004-Compact-PrivateRefCountEntry.patch)
  download | inline diff:
From 4b96a723bfc3a6fe78e05bc3ce454b0b160679d1 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Wed, 11 Mar 2026 12:08:03 +0000
Subject: [PATCH 4/5] Compact PrivateRefCountEntry

The previous implementation used an 8-bytes (64-bit) entry to store
a uint32 count and an uint32 lock mode. That is fine when we store
the data separate from the key (buffer). But in the simplehash
{key, value} are stored together, so each entry is 12-bytes.
This is somewhat awkward as we have to either pad the entry to 16-bytes,
or the access will be an alternating aligned/misaligned addreses.

Lock can assume only 4 values, and 2^30 is a decent limit for the
number of pins on a single buffer. So this change is packing the
{count[31:2], lock[1:0]} into a single uint32.

Incrementing/decrementing the count continue the same, just using
4 instead of 1, lock mode access will require one or two additional
bitwise operations. The exact count requires one shift, and is used
only for debugging. A special function is provided to check whether
count == 1.


No bit-shifts are required.
---
 src/backend/storage/buffer/bufmgr_refcnt.c | 186 +++++++++------------
 src/backend/storage/buffer/bufmgr_refcnt.h |   2 +-
 2 files changed, 76 insertions(+), 112 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr_refcnt.c b/src/backend/storage/buffer/bufmgr_refcnt.c
index dd95d953f6..bf84e9f62e 100644
--- a/src/backend/storage/buffer/bufmgr_refcnt.c
+++ b/src/backend/storage/buffer/bufmgr_refcnt.c
@@ -47,47 +47,45 @@
 #include "bufmgr_refcnt.h"
 #include "storage/proc.h"
 
-/* Structure definitions - internal to this file */
-typedef struct PrivateRefCountData
+/*
+ * Implementation details - opaque to callers.
+ * Packed refcount and lockmode in a single uint32:
+ *   Bits 0-1:  lockmode (4 values: UNLOCK, SHARE, SHARE_EXCLUSIVE, EXCLUSIVE)
+ *   Bits 2-31: refcount (30 bits, max ~1 billion)
+ */
+struct PrivateRefCountEntry
 {
-	int32		refcount;
-	BufferLockMode lockmode;
-} PrivateRefCountData;
+	Buffer		buffer;
+	uint32		data;
+};
 
-struct PrivateRefCountEntry
+#define PRIVATEREFCOUNT_LOCKMODE_MASK	0x3
+#define ONE_PRIVATE_REFERENCE			4	/* 1 << 2 */
+
+#define PrivateRefCountIsZero(d)		((d) < ONE_PRIVATE_REFERENCE)
+
+struct PrivateRefCountIterator
 {
-	Buffer		buffer;		/* hash key - InvalidBuffer means empty */
-	PrivateRefCountData data;
+	int			array_index;
+	bool		in_hash;
+	void	   *hash_status;
 };
 
-/*
- * Define simplehash parameters for the overflow hash table.
- * We use buffer == InvalidBuffer as the "empty" marker to avoid needing
- * a separate status field.
- */
+/* Define simplehash for private refcount overflow hash table */
 #define SH_PREFIX refcount
 #define SH_ELEMENT_TYPE PrivateRefCountEntry
 #define SH_KEY_TYPE Buffer
 #define SH_KEY buffer
-#define SH_HASH_KEY(tb, key) murmurhash32(key)
+#define SH_HASH_KEY(tb, key) ((uint32) murmurhash32(key))
 #define SH_EQUAL(tb, a, b) ((a) == (b))
 #define SH_SCOPE static inline
-#define SH_DEFINE
-#define SH_DECLARE
-/* Use buffer field as empty marker - no separate status needed */
 #define SH_ENTRY_EMPTY(entry) ((entry)->buffer == InvalidBuffer)
 #define SH_MAKE_EMPTY(entry) ((entry)->buffer = InvalidBuffer)
-#define SH_MAKE_IN_USE(entry) ((void)0)  /* key assignment handles this */
+#define SH_MAKE_IN_USE(entry) ((void) 0)
+#define SH_DECLARE
+#define SH_DEFINE
 #include "lib/simplehash.h"
 
-struct PrivateRefCountIterator
-{
-	int			array_index;
-	bool		in_hash;
-	refcount_iterator hash_iter;
-	bool		hash_iter_valid;
-};
-
 /* Private refcount array and keys */
 #define REFCOUNT_ARRAY_ENTRIES 8
 static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
@@ -112,9 +110,9 @@ static int	PrivateRefCountEntryLast = -1;
 static uint32 MaxProportionalPins = 0;
 
 /* Forward declarations */
-static void ReservePrivateRefCountEntry(void);
+static void ReservePrivateRefCountEntry(Buffer buffer);
 static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
-static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move);
+static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer);
 
 /*
  * Initialize private refcount tracking for this backend.
@@ -133,7 +131,7 @@ InitPrivateRefCount(void)
 	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
 	memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
 
-	PrivateRefCountHash = refcount_create(CurrentMemoryContext, 16, NULL);
+	PrivateRefCountHash = refcount_create(CurrentMemoryContext, 64, NULL);
 }
 
 /*
@@ -141,15 +139,15 @@ InitPrivateRefCount(void)
  * entry.
  */
 static void
-ReservePrivateRefCountEntry(void)
+ReservePrivateRefCountEntry(Buffer buffer)
 {
 	/* Already reserved (or freed), nothing to do */
 	if (ReservedRefCountSlot != -1)
 		return;
 
 	/*
-	 * First search for a free entry the array, that'll be sufficient in the
-	 * majority of cases.
+	 * First search for a free entry in the array, that'll be sufficient in
+	 * the majority of cases.
 	 */
 	{
 		int			i;
@@ -159,11 +157,9 @@ ReservePrivateRefCountEntry(void)
 			if (PrivateRefCountArrayKeys[i] == InvalidBuffer)
 			{
 				ReservedRefCountSlot = i;
+				return;
 			}
 		}
-
-		if (ReservedRefCountSlot != -1)
-			return;
 	}
 
 	/*
@@ -196,10 +192,7 @@ ReservePrivateRefCountEntry(void)
 		/* clear the now free array slot */
 		PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer;
 		victim_entry->buffer = InvalidBuffer;
-
-		memset(&victim_entry->data, 0, sizeof(victim_entry->data));
-		victim_entry->data.refcount = 0;
-		victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK;
+		victim_entry->data = 0;
 
 		PrivateRefCountOverflowed++;
 	}
@@ -222,8 +215,7 @@ NewPrivateRefCountEntry(Buffer buffer)
 	/* and fill it */
 	PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
 	res->buffer = buffer;
-	res->data.refcount = 0;
-	res->data.lockmode = BUFFER_LOCK_UNLOCK;
+	res->data = 0;
 
 	/* update cache for the next lookup */
 	PrivateRefCountEntryLast = ReservedRefCountSlot;
@@ -237,10 +229,9 @@ NewPrivateRefCountEntry(Buffer buffer)
  * Slow-path for GetSharedBufferEntry().
  */
 static pg_noinline PrivateRefCountEntry *
-GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
+GetPrivateRefCountEntrySlow(Buffer buffer)
 {
 	PrivateRefCountEntry *res;
-	int			match = -1;
 	int			i;
 
 	/*
@@ -250,16 +241,11 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
 	{
 		if (PrivateRefCountArrayKeys[i] == buffer)
 		{
-			match = i;
+			PrivateRefCountEntryLast = i;
+			return &PrivateRefCountArray[i];
 		}
 	}
 
-	if (likely(match != -1))
-	{
-		PrivateRefCountEntryLast = match;
-		return &PrivateRefCountArray[match];
-	}
-
 	/*
 	 * Only look up the buffer in the hashtable if we've previously overflowed.
 	 */
@@ -267,41 +253,11 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
 		return NULL;
 
 	res = refcount_lookup(PrivateRefCountHash, buffer);
-
-	if (res == NULL)
-		return NULL;
-	else if (!do_move)
-	{
-		return res;
-	}
-	else
-	{
-		/* move buffer from hashtable into the free array slot */
-		PrivateRefCountEntry *free;
-
-		ReservePrivateRefCountEntry();
-
-		Assert(ReservedRefCountSlot != -1);
-		free = &PrivateRefCountArray[ReservedRefCountSlot];
-		Assert(free->buffer == InvalidBuffer);
-
-		free->buffer = buffer;
-		free->data = res->data;
-		PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
-		PrivateRefCountEntryLast = ReservedRefCountSlot;
-
-		ReservedRefCountSlot = -1;
-
-		refcount_delete_item(PrivateRefCountHash, res);
-		Assert(PrivateRefCountOverflowed > 0);
-		PrivateRefCountOverflowed--;
-
-		return free;
-	}
+	return res;
 }
 
 /*
- * Return the PrivateRefCountEntry for the passed buffer.
+ * Return the PrivateRefCount entry for the passed buffer.
  * Returns NULL if the buffer is not currently pinned.
  */
 static PrivateRefCountEntry *
@@ -317,7 +273,7 @@ GetSharedBufferEntry(Buffer buffer)
 		return &PrivateRefCountArray[PrivateRefCountEntryLast];
 	}
 
-	return GetPrivateRefCountEntrySlow(buffer, false);
+	return GetPrivateRefCountEntrySlow(buffer);
 }
 
 /*
@@ -333,9 +289,9 @@ SharedBufferCreateRef(Buffer buffer)
 	Assert(BufferIsValid(buffer));
 	Assert(!BufferIsLocal(buffer));
 
-	ReservePrivateRefCountEntry();
+	ReservePrivateRefCountEntry(buffer);
 	ref = NewPrivateRefCountEntry(buffer);
-	ref->data.refcount++;
+	ref->data = ONE_PRIVATE_REFERENCE;
 
 	return ref;
 }
@@ -348,8 +304,8 @@ static void
 SharedBufferRefExisting(PrivateRefCountEntry *ref)
 {
 	Assert(ref != NULL);
-	Assert(ref->data.refcount > 0);
-	ref->data.refcount++;
+	Assert(!PrivateRefCountIsZero(ref->data));
+	ref->data += ONE_PRIVATE_REFERENCE;
 }
 
 /*
@@ -361,14 +317,14 @@ static bool
 SharedBufferUnref(PrivateRefCountEntry *ref)
 {
 	Assert(ref != NULL);
-	Assert(ref->data.refcount > 0);
+	Assert(!PrivateRefCountIsZero(ref->data));
 
-	ref->data.refcount--;
+	ref->data -= ONE_PRIVATE_REFERENCE;
 
-	if (ref->data.refcount == 0)
+	if (PrivateRefCountIsZero(ref->data))
 	{
 		/* No more references - clean up the entry */
-		Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK);
+		Assert(SharedBufferGetLockMode(ref) == BUFFER_LOCK_UNLOCK);
 
 		if (ref >= &PrivateRefCountArray[0] &&
 			ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES])
@@ -379,7 +335,8 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
 		}
 		else
 		{
-			refcount_delete_item(PrivateRefCountHash, ref);
+			/* could make slightly more efficient by using the pointer */
+			refcount_delete(PrivateRefCountHash, ref->buffer);
 			Assert(PrivateRefCountOverflowed > 0);
 			PrivateRefCountOverflowed--;
 		}
@@ -390,25 +347,16 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
 	return false;
 }
 
-/*
- * Accessors for refcount entry fields.
- */
-static int32
-SharedBufferRefCount(PrivateRefCountEntry *ref)
-{
-	return ref->data.refcount;
-}
-
 static BufferLockMode
 SharedBufferGetLockMode(PrivateRefCountEntry *ref)
 {
-	return ref->data.lockmode;
+	return (BufferLockMode)(ref->data & PRIVATEREFCOUNT_LOCKMODE_MASK);
 }
 
 static void
 SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode)
 {
-	ref->data.lockmode = mode;
+	ref->data = (ref->data & ~PRIVATEREFCOUNT_LOCKMODE_MASK) | mode;
 }
 
 /*
@@ -417,7 +365,18 @@ SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode)
 static bool
 SharedBufferHasSingleRef(PrivateRefCountEntry *ref)
 {
-	return ref != NULL && ref->data.refcount == 1;
+	return ref != NULL &&
+		   ref->data >= ONE_PRIVATE_REFERENCE &&
+		   ref->data < (2 * ONE_PRIVATE_REFERENCE);
+}
+
+/*
+ * Get the refcount for a buffer entry (for debug output only).
+ */
+static int32
+SharedBufferRefCount(PrivateRefCountEntry *ref)
+{
+	return (int32)(ref->data >> 2);
 }
 
 /*
@@ -488,7 +447,7 @@ AssertBufferLocksPermitCatalogRead(void)
 		if (buf == InvalidBuffer)
 			continue;
 
-		AssertNotCatalogBufferLock(buf, res->data.lockmode);
+		AssertNotCatalogBufferLock(buf, SharedBufferGetLockMode(res));
 	}
 	FreePrivateRefCountIterator(iter);
 }
@@ -504,7 +463,7 @@ InitPrivateRefCountIterator(void)
 
 	iter->array_index = 0;
 	iter->in_hash = false;
-	iter->hash_iter_valid = false;
+	iter->hash_status = NULL;
 	return iter;
 }
 
@@ -530,20 +489,23 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
 		iter->in_hash = true;
 		if (PrivateRefCountOverflowed > 0)
 		{
-			refcount_start_iterate(PrivateRefCountHash, &iter->hash_iter);
-			iter->hash_iter_valid = true;
+			refcount_iterator *hiter = palloc(sizeof(refcount_iterator));
+
+			refcount_start_iterate(PrivateRefCountHash, hiter);
+			iter->hash_status = hiter;
 		}
 	}
 
-	if (iter->hash_iter_valid)
+	if (iter->hash_status != NULL)
 	{
 		PrivateRefCountEntry *res;
 
-		res = refcount_iterate(PrivateRefCountHash, &iter->hash_iter);
+		res = refcount_iterate(PrivateRefCountHash, (refcount_iterator *) iter->hash_status);
 		if (res != NULL)
 			return res;
 
-		iter->hash_iter_valid = false;
+		pfree(iter->hash_status);
+		iter->hash_status = NULL;
 	}
 
 	return NULL;
@@ -555,6 +517,8 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
 void
 FreePrivateRefCountIterator(PrivateRefCountIterator *iter)
 {
+	if (iter->hash_status != NULL)
+		pfree(iter->hash_status);
 	pfree(iter);
 }
 
diff --git a/src/backend/storage/buffer/bufmgr_refcnt.h b/src/backend/storage/buffer/bufmgr_refcnt.h
index f2a6f45d84..8d31f6ee6f 100644
--- a/src/backend/storage/buffer/bufmgr_refcnt.h
+++ b/src/backend/storage/buffer/bufmgr_refcnt.h
@@ -39,8 +39,8 @@ static void SharedBufferRefExisting(PrivateRefCountEntry *ref);
 static bool SharedBufferUnref(PrivateRefCountEntry *ref);
 static BufferLockMode SharedBufferGetLockMode(PrivateRefCountEntry *ref);
 static void SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode);
-static int32 SharedBufferRefCount(PrivateRefCountEntry *ref);
 static bool SharedBufferHasSingleRef(PrivateRefCountEntry *ref);
+static int32 SharedBufferRefCount(PrivateRefCountEntry *ref);
 
 /* Pin limiting */
 extern uint32 GetPinLimit(void);
-- 
2.34.1



  [application/x-patch] v2-0002-Refactoring-reference-counting.patch (54.0K, 5-v2-0002-Refactoring-reference-counting.patch)
  download | inline diff:
From 5de90fcd14e5d32c3165c3e7a278adaa44f4d9d1 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Wed, 11 Mar 2026 12:05:50 +0000
Subject: [PATCH 2/5] Refactoring reference counting

This moves the private reference count logic out of
backend/storage/buffer/buffmgr.c that is still 8000+
lines long. Preparing for the changes to come.

---
 .gitignore                                 |   2 +
 src/backend/storage/buffer/Makefile        |   3 +
 src/backend/storage/buffer/bufmgr.c        | 704 +++------------------
 src/backend/storage/buffer/bufmgr_refcnt.c | 615 ++++++++++++++++++
 src/backend/storage/buffer/bufmgr_refcnt.h |  59 ++
 5 files changed, 751 insertions(+), 632 deletions(-)
 create mode 100644 src/backend/storage/buffer/bufmgr_refcnt.c
 create mode 100644 src/backend/storage/buffer/bufmgr_refcnt.h

diff --git a/.gitignore b/.gitignore
index 4e911395fe..fddb7f861d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -43,3 +43,5 @@ lib*.pc
 /Release/
 /tmp_install/
 /portlock/
+
+.*
\ No newline at end of file
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index fd7c40dcb0..6f45674cae 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -20,3 +20,6 @@ OBJS = \
 	localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
+
+# bufmgr.c includes bufmgr_refcnt.c for inlining
+bufmgr.o: bufmgr_refcnt.c bufmgr_refcnt.h
\ No newline at end of file
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0546ee0193..82af282f70 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -54,6 +54,7 @@
 #include "storage/aio.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "bufmgr_refcnt.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -70,7 +71,6 @@
 #include "utils/timestamp.h"
 #include "utils/wait_event.h"
 
-
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
 #define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
@@ -93,43 +93,6 @@
  */
 #define BUF_DROP_FULL_SCAN_THRESHOLD		(uint64) (NBuffers / 32)
 
-/*
- * This is separated out from PrivateRefCountEntry to allow for copying all
- * the data members via struct assignment.
- */
-typedef struct PrivateRefCountData
-{
-	/*
-	 * How many times has the buffer been pinned by this backend.
-	 */
-	int32		refcount;
-
-	/*
-	 * Is the buffer locked by this backend? BUFFER_LOCK_UNLOCK indicates that
-	 * the buffer is not locked.
-	 */
-	BufferLockMode lockmode;
-} PrivateRefCountData;
-
-typedef struct PrivateRefCountEntry
-{
-	/*
-	 * Note that this needs to be same as the entry's corresponding
-	 * PrivateRefCountArrayKeys[i], if the entry is stored in the array. We
-	 * store it in both places as this is used for the hashtable key and
-	 * because it is more convenient (passing around a PrivateRefCountEntry
-	 * suffices to identify the buffer) and faster (checking the keys array is
-	 * faster when checking many entries, checking the entry is faster if just
-	 * checking a single entry).
-	 */
-	Buffer		buffer;
-
-	PrivateRefCountData data;
-} PrivateRefCountEntry;
-
-/* 64 bytes, about the size of a cache line on common systems */
-#define REFCOUNT_ARRAY_ENTRIES 8
-
 /*
  * Status of buffers to checkpoint for a particular tablespace, used
  * internally in BufferSync.
@@ -213,55 +176,6 @@ int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
-/*
- * Backend-Private refcount management:
- *
- * Each buffer also has a private refcount that keeps track of the number of
- * times the buffer is pinned in the current process.  This is so that the
- * shared refcount needs to be modified only once if a buffer is pinned more
- * than once by an individual backend.  It's also used to check that no
- * buffers are still pinned at the end of transactions and when exiting. We
- * also use this mechanism to track whether this backend has a buffer locked,
- * and, if so, in what mode.
- *
- *
- * To avoid - as we used to - requiring an array with NBuffers entries to keep
- * track of local buffers, we use a small sequentially searched array
- * (PrivateRefCountArrayKeys, with the corresponding data stored in
- * PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to
- * keep track of backend local pins.
- *
- * Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all
- * refcounts are kept track of in the array; after that, new array entries
- * displace old ones into the hash table. That way a frequently used entry
- * can't get "stuck" in the hashtable while infrequent ones clog the array.
- *
- * Note that in most scenarios the number of pinned buffers will not exceed
- * REFCOUNT_ARRAY_ENTRIES.
- *
- *
- * To enter a buffer into the refcount tracking mechanism first reserve a free
- * entry using ReservePrivateRefCountEntry() and then later, if necessary,
- * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing
- * memory allocations in NewPrivateRefCountEntry() which can be important
- * because in some scenarios it's called with a spinlock held...
- */
-static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
-static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
-static HTAB *PrivateRefCountHash = NULL;
-static int32 PrivateRefCountOverflowed = 0;
-static uint32 PrivateRefCountClock = 0;
-static int	ReservedRefCountSlot = -1;
-static int	PrivateRefCountEntryLast = -1;
-
-static uint32 MaxProportionalPins;
-
-static void ReservePrivateRefCountEntry(void);
-static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
-static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
-static inline int32 GetPrivateRefCount(Buffer buffer);
-static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref);
-
 /* ResourceOwner callbacks to hold in-progress I/Os and buffer pins */
 static void ResOwnerReleaseBufferIO(Datum res);
 static char *ResOwnerPrintBufferIO(Datum res);
@@ -286,301 +200,6 @@ const ResourceOwnerDesc buffer_resowner_desc =
 	.DebugPrint = ResOwnerPrintBuffer
 };
 
-/*
- * Ensure that the PrivateRefCountArray has sufficient space to store one more
- * entry. This has to be called before using NewPrivateRefCountEntry() to fill
- * a new entry - but it's perfectly fine to not use a reserved entry.
- */
-static void
-ReservePrivateRefCountEntry(void)
-{
-	/* Already reserved (or freed), nothing to do */
-	if (ReservedRefCountSlot != -1)
-		return;
-
-	/*
-	 * First search for a free entry the array, that'll be sufficient in the
-	 * majority of cases.
-	 */
-	{
-		int			i;
-
-		for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
-		{
-			if (PrivateRefCountArrayKeys[i] == InvalidBuffer)
-			{
-				ReservedRefCountSlot = i;
-
-				/*
-				 * We could return immediately, but iterating till the end of
-				 * the array allows compiler-autovectorization.
-				 */
-			}
-		}
-
-		if (ReservedRefCountSlot != -1)
-			return;
-	}
-
-	/*
-	 * No luck. All array entries are full. Move one array entry into the hash
-	 * table.
-	 */
-	{
-		/*
-		 * Move entry from the current clock position in the array into the
-		 * hashtable. Use that slot.
-		 */
-		int			victim_slot;
-		PrivateRefCountEntry *victim_entry;
-		PrivateRefCountEntry *hashent;
-		bool		found;
-
-		/* select victim slot */
-		victim_slot = PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES;
-		victim_entry = &PrivateRefCountArray[victim_slot];
-		ReservedRefCountSlot = victim_slot;
-
-		/* Better be used, otherwise we shouldn't get here. */
-		Assert(PrivateRefCountArrayKeys[victim_slot] != InvalidBuffer);
-		Assert(PrivateRefCountArray[victim_slot].buffer != InvalidBuffer);
-		Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer);
-
-		/* enter victim array entry into hashtable */
-		hashent = hash_search(PrivateRefCountHash,
-							  &PrivateRefCountArrayKeys[victim_slot],
-							  HASH_ENTER,
-							  &found);
-		Assert(!found);
-		/* move data from the entry in the array to the hash entry */
-		hashent->data = victim_entry->data;
-
-		/* clear the now free array slot */
-		PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer;
-		victim_entry->buffer = InvalidBuffer;
-
-		/* clear the whole data member, just for future proofing */
-		memset(&victim_entry->data, 0, sizeof(victim_entry->data));
-		victim_entry->data.refcount = 0;
-		victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK;
-
-		PrivateRefCountOverflowed++;
-	}
-}
-
-/*
- * Fill a previously reserved refcount entry.
- */
-static PrivateRefCountEntry *
-NewPrivateRefCountEntry(Buffer buffer)
-{
-	PrivateRefCountEntry *res;
-
-	/* only allowed to be called when a reservation has been made */
-	Assert(ReservedRefCountSlot != -1);
-
-	/* use up the reserved entry */
-	res = &PrivateRefCountArray[ReservedRefCountSlot];
-
-	/* and fill it */
-	PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
-	res->buffer = buffer;
-	res->data.refcount = 0;
-	res->data.lockmode = BUFFER_LOCK_UNLOCK;
-
-	/* update cache for the next lookup */
-	PrivateRefCountEntryLast = ReservedRefCountSlot;
-
-	ReservedRefCountSlot = -1;
-
-	return res;
-}
-
-/*
- * Slow-path for GetPrivateRefCountEntry(). This is big enough to not be worth
- * inlining. This particularly seems to be true if the compiler is capable of
- * auto-vectorizing the code, as that imposes additional stack-alignment
- * requirements etc.
- */
-static pg_noinline PrivateRefCountEntry *
-GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
-{
-	PrivateRefCountEntry *res;
-	int			match = -1;
-	int			i;
-
-	/*
-	 * First search for references in the array, that'll be sufficient in the
-	 * majority of cases.
-	 */
-	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
-	{
-		if (PrivateRefCountArrayKeys[i] == buffer)
-		{
-			match = i;
-			/* see ReservePrivateRefCountEntry() for why we don't return */
-		}
-	}
-
-	if (likely(match != -1))
-	{
-		/* update cache for the next lookup */
-		PrivateRefCountEntryLast = match;
-
-		return &PrivateRefCountArray[match];
-	}
-
-	/*
-	 * By here we know that the buffer, if already pinned, isn't residing in
-	 * the array.
-	 *
-	 * Only look up the buffer in the hashtable if we've previously overflowed
-	 * into it.
-	 */
-	if (PrivateRefCountOverflowed == 0)
-		return NULL;
-
-	res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL);
-
-	if (res == NULL)
-		return NULL;
-	else if (!do_move)
-	{
-		/* caller doesn't want us to move the hash entry into the array */
-		return res;
-	}
-	else
-	{
-		/* move buffer from hashtable into the free array slot */
-		bool		found;
-		PrivateRefCountEntry *free;
-
-		/* Ensure there's a free array slot */
-		ReservePrivateRefCountEntry();
-
-		/* Use up the reserved slot */
-		Assert(ReservedRefCountSlot != -1);
-		free = &PrivateRefCountArray[ReservedRefCountSlot];
-		Assert(PrivateRefCountArrayKeys[ReservedRefCountSlot] == free->buffer);
-		Assert(free->buffer == InvalidBuffer);
-
-		/* and fill it */
-		free->buffer = buffer;
-		free->data = res->data;
-		PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
-		/* update cache for the next lookup */
-		PrivateRefCountEntryLast = match;
-
-		ReservedRefCountSlot = -1;
-
-
-		/* delete from hashtable */
-		hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
-		Assert(found);
-		Assert(PrivateRefCountOverflowed > 0);
-		PrivateRefCountOverflowed--;
-
-		return free;
-	}
-}
-
-/*
- * Return the PrivateRefCount entry for the passed buffer.
- *
- * Returns NULL if a buffer doesn't have a refcount entry. Otherwise, if
- * do_move is true, and the entry resides in the hashtable the entry is
- * optimized for frequent access by moving it to the array.
- */
-static inline PrivateRefCountEntry *
-GetPrivateRefCountEntry(Buffer buffer, bool do_move)
-{
-	Assert(BufferIsValid(buffer));
-	Assert(!BufferIsLocal(buffer));
-
-	/*
-	 * It's very common to look up the same buffer repeatedly. To make that
-	 * fast, we have a one-entry cache.
-	 *
-	 * In contrast to the loop in GetPrivateRefCountEntrySlow(), here it
-	 * faster to check PrivateRefCountArray[].buffer, as in the case of a hit
-	 * fewer addresses are computed and fewer cachelines are accessed. Whereas
-	 * in GetPrivateRefCountEntrySlow()'s case, checking
-	 * PrivateRefCountArrayKeys saves a lot of memory accesses.
-	 */
-	if (likely(PrivateRefCountEntryLast != -1) &&
-		likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer))
-	{
-		return &PrivateRefCountArray[PrivateRefCountEntryLast];
-	}
-
-	/*
-	 * The code for the cached lookup is small enough to be worth inlining
-	 * into the caller. In the miss case however, that empirically doesn't
-	 * seem worth it.
-	 */
-	return GetPrivateRefCountEntrySlow(buffer, do_move);
-}
-
-/*
- * Returns how many times the passed buffer is pinned by this backend.
- *
- * Only works for shared memory buffers!
- */
-static inline int32
-GetPrivateRefCount(Buffer buffer)
-{
-	PrivateRefCountEntry *ref;
-
-	Assert(BufferIsValid(buffer));
-	Assert(!BufferIsLocal(buffer));
-
-	/*
-	 * Not moving the entry - that's ok for the current users, but we might
-	 * want to change this one day.
-	 */
-	ref = GetPrivateRefCountEntry(buffer, false);
-
-	if (ref == NULL)
-		return 0;
-	return ref->data.refcount;
-}
-
-/*
- * Release resources used to track the reference count of a buffer which we no
- * longer have pinned and don't want to pin again immediately.
- */
-static void
-ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
-{
-	Assert(ref->data.refcount == 0);
-	Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK);
-
-	if (ref >= &PrivateRefCountArray[0] &&
-		ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES])
-	{
-		ref->buffer = InvalidBuffer;
-		PrivateRefCountArrayKeys[ref - PrivateRefCountArray] = InvalidBuffer;
-
-
-		/*
-		 * Mark the just used entry as reserved - in many scenarios that
-		 * allows us to avoid ever having to search the array/hash for free
-		 * entries.
-		 */
-		ReservedRefCountSlot = ref - PrivateRefCountArray;
-	}
-	else
-	{
-		bool		found;
-		Buffer		buffer = ref->buffer;
-
-		hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
-		Assert(found);
-		Assert(PrivateRefCountOverflowed > 0);
-		PrivateRefCountOverflowed--;
-	}
-}
-
 /*
  * BufferIsPinned
  *		True iff the buffer is pinned (also checks for valid buffer number).
@@ -596,7 +215,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 		BufferIsLocal(bufnum) ? \
 			(LocalRefCount[-(bufnum) - 1] > 0) \
 		: \
-	(GetPrivateRefCount(bufnum) > 0) \
+	(GetSharedBufferEntry(bufnum) != NULL) \
 )
 
 
@@ -653,7 +272,6 @@ static void RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 										   RelFileLocator dstlocator,
 										   ForkNumber forkNum, bool permanent);
 static void AtProcExit_Buffers(int code, Datum arg);
-static void CheckForBufferLeaks(void);
 #ifdef USE_ASSERT_CHECKING
 static void AssertNotCatalogBufferLock(Buffer buffer, BufferLockMode mode);
 #endif
@@ -812,7 +430,6 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 	Assert(BufferIsValid(recent_buffer));
 
 	ResourceOwnerEnlarge(CurrentResourceOwner);
-	ReservePrivateRefCountEntry();
 	InitBufferTag(&tag, &rlocator, forkNum, blockNum);
 
 	if (BufferIsLocal(recent_buffer))
@@ -2115,7 +1732,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* Make sure we will have room to remember the buffer pin */
 	ResourceOwnerEnlarge(CurrentResourceOwner);
-	ReservePrivateRefCountEntry();
 
 	/* create a tag so we can lookup the buffer */
 	InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
@@ -2327,7 +1943,7 @@ retry:
 		UnlockBufHdr(buf);
 		LWLockRelease(oldPartitionLock);
 		/* safety check: should definitely not be our *own* pin */
-		if (GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) > 0)
+		if (GetSharedBufferEntry(BufferDescriptorGetBuffer(buf)) != NULL)
 			elog(ERROR, "buffer is pinned in InvalidateBuffer");
 		WaitIO(buf);
 		goto retry;
@@ -2380,7 +1996,7 @@ InvalidateVictimBuffer(BufferDesc *buf_hdr)
 	LWLock	   *partition_lock;
 	BufferTag	tag;
 
-	Assert(GetPrivateRefCount(BufferDescriptorGetBuffer(buf_hdr)) == 1);
+	Assert(GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr)) != NULL);
 
 	/* have buffer pinned, so it's safe to read tag without lock */
 	tag = buf_hdr->tag;
@@ -2461,7 +2077,6 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 	 * Ensure, before we pin a victim buffer, that there's a free refcount
 	 * entry and resource owner slot for the pin.
 	 */
-	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
 	/* we return here if a prospective victim buffer gets used concurrently */
@@ -2591,64 +2206,6 @@ again:
 	return buf;
 }
 
-/*
- * Return the maximum number of buffers that a backend should try to pin once,
- * to avoid exceeding its fair share.  This is the highest value that
- * GetAdditionalPinLimit() could ever return.  Note that it may be zero on a
- * system with a very small buffer pool relative to max_connections.
- */
-uint32
-GetPinLimit(void)
-{
-	return MaxProportionalPins;
-}
-
-/*
- * Return the maximum number of additional buffers that this backend should
- * pin if it wants to stay under the per-backend limit, considering the number
- * of buffers it has already pinned.  Unlike LimitAdditionalPins(), the limit
- * return by this function can be zero.
- */
-uint32
-GetAdditionalPinLimit(void)
-{
-	uint32		estimated_pins_held;
-
-	/*
-	 * We get the number of "overflowed" pins for free, but don't know the
-	 * number of pins in PrivateRefCountArray.  The cost of calculating that
-	 * exactly doesn't seem worth it, so just assume the max.
-	 */
-	estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
-
-	/* Is this backend already holding more than its fair share? */
-	if (estimated_pins_held > MaxProportionalPins)
-		return 0;
-
-	return MaxProportionalPins - estimated_pins_held;
-}
-
-/*
- * Limit the number of pins a batch operation may additionally acquire, to
- * avoid running out of pinnable buffers.
- *
- * One additional pin is always allowed, on the assumption that the operation
- * requires at least one to make progress.
- */
-void
-LimitAdditionalPins(uint32 *additional_pins)
-{
-	uint32		limit;
-
-	if (*additional_pins <= 1)
-		return;
-
-	limit = GetAdditionalPinLimit();
-	limit = Max(limit, 1);
-	if (limit < *additional_pins)
-		*additional_pins = limit;
-}
-
 /*
  * Logic shared between ExtendBufferedRelBy(), ExtendBufferedRelTo(). Just to
  * avoid duplicating the tracing and relpersistence related logic.
@@ -2812,7 +2369,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 		/* in case we need to pin an existing buffer below */
 		ResourceOwnerEnlarge(CurrentResourceOwner);
-		ReservePrivateRefCountEntry();
 
 		InitBufferTag(&tag, &BMR_GET_SMGR(bmr)->smgr_rlocator.locator, fork,
 					  first_block + i);
@@ -3185,9 +2741,8 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
 	PrivateRefCountEntry *ref;
 
 	Assert(!BufferIsLocal(b));
-	Assert(ReservedRefCountSlot != -1);
 
-	ref = GetPrivateRefCountEntry(b, true);
+	ref = GetSharedBufferEntry(b);
 
 	if (ref == NULL)
 	{
@@ -3257,8 +2812,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
 		 */
 		result = (pg_atomic_read_u64(&buf->state) & BM_VALID) != 0;
 
-		Assert(ref->data.refcount > 0);
-		ref->data.refcount++;
+		SharedBufferRefExisting(ref);
 		ResourceOwnerRememberBuffer(CurrentResourceOwner, b);
 	}
 
@@ -3296,7 +2850,7 @@ PinBuffer_Locked(BufferDesc *buf)
 	 * As explained, We don't expect any preexisting pins. That allows us to
 	 * manipulate the PrivateRefCount after releasing the spinlock
 	 */
-	Assert(GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf), false) == NULL);
+	Assert(GetSharedBufferEntry(BufferDescriptorGetBuffer(buf)) == NULL);
 
 	/*
 	 * Since we hold the buffer spinlock, we can update the buffer state and
@@ -3373,11 +2927,10 @@ UnpinBufferNoOwner(BufferDesc *buf)
 	Assert(!BufferIsLocal(b));
 
 	/* not moving as we're likely deleting it soon anyway */
-	ref = GetPrivateRefCountEntry(b, false);
+	ref = GetSharedBufferEntry(b);
 	Assert(ref != NULL);
-	Assert(ref->data.refcount > 0);
-	ref->data.refcount--;
-	if (ref->data.refcount == 0)
+
+	if (SharedBufferUnref(ref))
 	{
 		uint64		old_buf_state;
 
@@ -3402,8 +2955,6 @@ UnpinBufferNoOwner(BufferDesc *buf)
 		/* Support LockBufferForCleanup() */
 		if (old_buf_state & BM_PIN_COUNT_WAITER)
 			WakePinCountWaiter(buf);
-
-		ForgetPrivateRefCountEntry(ref);
 	}
 }
 
@@ -3414,10 +2965,7 @@ UnpinBufferNoOwner(BufferDesc *buf)
 inline void
 TrackNewBufferPin(Buffer buf)
 {
-	PrivateRefCountEntry *ref;
-
-	ref = NewPrivateRefCountEntry(buf);
-	ref->data.refcount++;
+	SharedBufferCreateRef(buf);
 
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, buf);
 
@@ -4037,7 +3585,6 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	BufferTag	tag;
 
 	/* Make sure we can handle the pin */
-	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
 	/*
@@ -4101,11 +3648,9 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 void
 AtEOXact_Buffers(bool isCommit)
 {
-	CheckForBufferLeaks();
+	CheckPrivateRefCountLeaks();
 
 	AtEOXact_LocalBuffers(isCommit);
-
-	Assert(PrivateRefCountOverflowed == 0);
 }
 
 /*
@@ -4118,25 +3663,8 @@ AtEOXact_Buffers(bool isCommit)
 void
 InitBufferManagerAccess(void)
 {
-	HASHCTL		hash_ctl;
-
-	/*
-	 * An advisory limit on the number of pins each backend should hold, based
-	 * on shared_buffers and the maximum number of connections possible.
-	 * That's very pessimistic, but outside toy-sized shared_buffers it should
-	 * allow plenty of pins.  LimitAdditionalPins() and
-	 * GetAdditionalPinLimit() can be used to check the remaining balance.
-	 */
-	MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
-
-	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
-	memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
-
-	hash_ctl.keysize = sizeof(Buffer);
-	hash_ctl.entrysize = sizeof(PrivateRefCountEntry);
 
-	PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS);
+	InitPrivateRefCount();
 
 	/*
 	 * AtProcExit_Buffers needs LWLock access, and thereby has to be called at
@@ -4155,112 +3683,13 @@ AtProcExit_Buffers(int code, Datum arg)
 {
 	UnlockBuffers();
 
-	CheckForBufferLeaks();
+	CheckPrivateRefCountLeaks();
 
 	/* localbuf.c needs a chance too */
 	AtProcExit_LocalBuffers();
 }
 
-/*
- *		CheckForBufferLeaks - ensure this backend holds no buffer pins
- *
- *		As of PostgreSQL 8.0, buffer pins should get released by the
- *		ResourceOwner mechanism.  This routine is just a debugging
- *		cross-check that no pins remain.
- */
-static void
-CheckForBufferLeaks(void)
-{
 #ifdef USE_ASSERT_CHECKING
-	int			RefCountErrors = 0;
-	PrivateRefCountEntry *res;
-	int			i;
-	char	   *s;
-
-	/* check the array */
-	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
-	{
-		if (PrivateRefCountArrayKeys[i] != InvalidBuffer)
-		{
-			res = &PrivateRefCountArray[i];
-
-			s = DebugPrintBufferRefcount(res->buffer);
-			elog(WARNING, "buffer refcount leak: %s", s);
-			pfree(s);
-
-			RefCountErrors++;
-		}
-	}
-
-	/* if necessary search the hash */
-	if (PrivateRefCountOverflowed)
-	{
-		HASH_SEQ_STATUS hstat;
-
-		hash_seq_init(&hstat, PrivateRefCountHash);
-		while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			s = DebugPrintBufferRefcount(res->buffer);
-			elog(WARNING, "buffer refcount leak: %s", s);
-			pfree(s);
-			RefCountErrors++;
-		}
-	}
-
-	Assert(RefCountErrors == 0);
-#endif
-}
-
-#ifdef USE_ASSERT_CHECKING
-/*
- * Check for exclusive-locked catalog buffers.  This is the core of
- * AssertCouldGetRelation().
- *
- * A backend would self-deadlock on the content lock if the catalog scan read
- * the exclusive-locked buffer.  The main threat is exclusive-locked buffers
- * of catalogs used in relcache, because a catcache search on any catalog may
- * build that catalog's relcache entry.  We don't have an inventory of
- * catalogs relcache uses, so just check buffers of most catalogs.
- *
- * It's better to minimize waits while holding an exclusive buffer lock, so it
- * would be nice to broaden this check not to be catalog-specific.  However,
- * bttextcmp() accesses pg_collation, and non-core opclasses might similarly
- * read tables.  That is deadlock-free as long as there's no loop in the
- * dependency graph: modifying table A may cause an opclass to read table B,
- * but it must not cause a read of table A.
- */
-void
-AssertBufferLocksPermitCatalogRead(void)
-{
-	PrivateRefCountEntry *res;
-
-	/* check the array */
-	for (int i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
-	{
-		if (PrivateRefCountArrayKeys[i] != InvalidBuffer)
-		{
-			res = &PrivateRefCountArray[i];
-
-			if (res->buffer == InvalidBuffer)
-				continue;
-
-			AssertNotCatalogBufferLock(res->buffer, res->data.lockmode);
-		}
-	}
-
-	/* if necessary search the hash */
-	if (PrivateRefCountOverflowed)
-	{
-		HASH_SEQ_STATUS hstat;
-
-		hash_seq_init(&hstat, PrivateRefCountHash);
-		while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			AssertNotCatalogBufferLock(res->buffer, res->data.lockmode);
-		}
-	}
-}
-
 static void
 AssertNotCatalogBufferLock(Buffer buffer, BufferLockMode mode)
 {
@@ -4312,8 +3741,10 @@ DebugPrintBufferRefcount(Buffer buffer)
 	}
 	else
 	{
+		PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
+
 		buf = GetBufferDescriptor(buffer - 1);
-		loccount = GetPrivateRefCount(buffer);
+		loccount = ref ? SharedBufferRefCount(ref) : 0;
 		backend = INVALID_PROC_NUMBER;
 	}
 
@@ -5100,7 +4531,6 @@ FlushRelationBuffers(Relation rel)
 				error_context_stack = &errcallback;
 
 				/* Make sure we can handle the pin */
-				ReservePrivateRefCountEntry();
 				ResourceOwnerEnlarge(CurrentResourceOwner);
 
 				/*
@@ -5136,7 +4566,6 @@ FlushRelationBuffers(Relation rel)
 			continue;
 
 		/* Make sure we can handle the pin */
-		ReservePrivateRefCountEntry();
 		ResourceOwnerEnlarge(CurrentResourceOwner);
 
 		buf_state = LockBufHdr(bufHdr);
@@ -5231,7 +4660,6 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 			continue;
 
 		/* Make sure we can handle the pin */
-		ReservePrivateRefCountEntry();
 		ResourceOwnerEnlarge(CurrentResourceOwner);
 
 		buf_state = LockBufHdr(bufHdr);
@@ -5457,7 +4885,6 @@ FlushDatabaseBuffers(Oid dbid)
 			continue;
 
 		/* Make sure we can handle the pin */
-		ReservePrivateRefCountEntry();
 		ResourceOwnerEnlarge(CurrentResourceOwner);
 
 		buf_state = LockBufHdr(bufHdr);
@@ -5532,17 +4959,18 @@ UnlockReleaseBuffer(Buffer buffer)
 void
 IncrBufferRefCount(Buffer buffer)
 {
-	Assert(BufferIsPinned(buffer));
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 	if (BufferIsLocal(buffer))
+	{
+		Assert(LocalRefCount[-buffer - 1] > 0);
 		LocalRefCount[-buffer - 1]++;
+	}
 	else
 	{
-		PrivateRefCountEntry *ref;
+		PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
 
-		ref = GetPrivateRefCountEntry(buffer, true);
 		Assert(ref != NULL);
-		ref->data.refcount++;
+		SharedBufferRefExisting(ref);
 	}
 	ResourceOwnerRememberBuffer(CurrentResourceOwner, buffer);
 }
@@ -5561,11 +4989,9 @@ MarkSharedBufferDirtyHint(Buffer buffer, BufferDesc *bufHdr, uint64 lockstate,
 {
 	Page		page = BufferGetPage(buffer);
 
-	Assert(GetPrivateRefCount(buffer) > 0);
-
+	Assert(GetSharedBufferEntry(buffer) != NULL);
 	/* here, either share-exclusive or exclusive lock is OK */
-	Assert(BufferLockHeldByMeInMode(bufHdr, BUFFER_LOCK_EXCLUSIVE) ||
-		   BufferLockHeldByMeInMode(bufHdr, BUFFER_LOCK_SHARE_EXCLUSIVE));
+	Assert(BufferIsLockedByMe(buffer));
 
 	/*
 	 * This routine might get called many times on the same page, if we are
@@ -5777,12 +5203,12 @@ BufferLockAcquire(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
 	 * Get reference to the refcount entry before we hold the lock, it seems
 	 * better to do before holding the lock.
 	 */
-	entry = GetPrivateRefCountEntry(buffer, true);
+	entry = GetSharedBufferEntry(buffer);
 
 	/*
 	 * We better not already hold a lock on the buffer.
 	 */
-	Assert(entry->data.lockmode == BUFFER_LOCK_UNLOCK);
+	Assert(SharedBufferGetLockMode(entry) == BUFFER_LOCK_UNLOCK);
 
 	/*
 	 * Lock out cancel/die interrupts until we exit the code section protected
@@ -5871,7 +5297,7 @@ BufferLockAcquire(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
 	}
 
 	/* Remember that we now hold this lock */
-	entry->data.lockmode = mode;
+	SharedBufferSetLockMode(entry, mode);
 
 	/*
 	 * Fix the process wait semaphore's count for any absorbed wakeups.
@@ -5922,7 +5348,7 @@ BufferLockUnlock(Buffer buffer, BufferDesc *buf_hdr)
 static bool
 BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
 {
-	PrivateRefCountEntry *entry = GetPrivateRefCountEntry(buffer, true);
+	PrivateRefCountEntry *entry = GetSharedBufferEntry(buffer);
 	bool		mustwait;
 
 	/*
@@ -5930,7 +5356,7 @@ BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
 	 * already has locked, return false, independent of the existing and
 	 * desired lock level.
 	 */
-	if (entry->data.lockmode != BUFFER_LOCK_UNLOCK)
+	if (SharedBufferGetLockMode(entry) != BUFFER_LOCK_UNLOCK)
 		return false;
 
 	/*
@@ -5950,7 +5376,7 @@ BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
 	}
 	else
 	{
-		entry->data.lockmode = mode;
+		SharedBufferSetLockMode(entry, mode);
 	}
 
 	return !mustwait;
@@ -6160,11 +5586,11 @@ BufferLockDisownInternal(Buffer buffer, BufferDesc *buf_hdr)
 	BufferLockMode mode;
 	PrivateRefCountEntry *ref;
 
-	ref = GetPrivateRefCountEntry(buffer, false);
+	ref = GetSharedBufferEntry(buffer);
 	if (ref == NULL)
 		elog(ERROR, "lock %d is not held", buffer);
-	mode = ref->data.lockmode;
-	ref->data.lockmode = BUFFER_LOCK_UNLOCK;
+	mode = SharedBufferGetLockMode(ref);
+	SharedBufferSetLockMode(ref, BUFFER_LOCK_UNLOCK);
 
 	return mode;
 }
@@ -6398,12 +5824,12 @@ static bool
 BufferLockHeldByMeInMode(BufferDesc *buf_hdr, BufferLockMode mode)
 {
 	PrivateRefCountEntry *entry =
-		GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf_hdr), false);
+		GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr));
 
 	if (!entry)
 		return false;
 	else
-		return entry->data.lockmode == mode;
+		return SharedBufferGetLockMode(entry) == mode;
 }
 
 /*
@@ -6416,12 +5842,12 @@ static bool
 BufferLockHeldByMe(BufferDesc *buf_hdr)
 {
 	PrivateRefCountEntry *entry =
-		GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf_hdr), false);
+		GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr));
 
 	if (!entry)
 		return false;
 	else
-		return entry->data.lockmode != BUFFER_LOCK_UNLOCK;
+		return SharedBufferGetLockMode(entry) != BUFFER_LOCK_UNLOCK;
 }
 
 /*
@@ -6517,9 +5943,13 @@ CheckBufferIsPinnedOnce(Buffer buffer)
 	}
 	else
 	{
-		if (GetPrivateRefCount(buffer) != 1)
-			elog(ERROR, "incorrect local pin count: %d",
-				 GetPrivateRefCount(buffer));
+		{
+			PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
+
+			if (!ref || !SharedBufferHasSingleRef(ref))
+				elog(ERROR, "incorrect pin count: %d",
+					 ref ? SharedBufferRefCount(ref) : 0);
+		}
 	}
 }
 
@@ -6700,7 +6130,7 @@ HoldingBufferPinThatDelaysRecovery(void)
 	if (bufid < 0)
 		return false;
 
-	if (GetPrivateRefCount(bufid + 1) > 0)
+	if (GetSharedBufferEntry(bufid + 1) != NULL)
 		return true;
 
 	return false;
@@ -6735,10 +6165,13 @@ ConditionalLockBufferForCleanup(Buffer buffer)
 	}
 
 	/* There should be exactly one local pin */
-	refcount = GetPrivateRefCount(buffer);
-	Assert(refcount);
-	if (refcount != 1)
-		return false;
+	{
+		PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
+
+		Assert(ref != NULL);
+		if (!SharedBufferHasSingleRef(ref))
+			return false;
+	}
 
 	/* Try to acquire lock */
 	if (!ConditionalLockBuffer(buffer))
@@ -6790,8 +6223,12 @@ IsBufferCleanupOK(Buffer buffer)
 	}
 
 	/* There should be exactly one local pin */
-	if (GetPrivateRefCount(buffer) != 1)
-		return false;
+	{
+		PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
+
+		if (!SharedBufferHasSingleRef(ref))
+			return false;
+	}
 
 	bufHdr = GetBufferDescriptor(buffer - 1);
 
@@ -6827,12 +6264,12 @@ SharedBufferBeginSetHintBits(Buffer buffer, BufferDesc *buf_hdr, uint64 *locksta
 	PrivateRefCountEntry *ref;
 	BufferLockMode mode;
 
-	ref = GetPrivateRefCountEntry(buffer, true);
+	ref = GetSharedBufferEntry(buffer);
 
 	if (ref == NULL)
 		elog(ERROR, "buffer is not pinned");
 
-	mode = ref->data.lockmode;
+	mode = SharedBufferGetLockMode(ref);
 	if (mode == BUFFER_LOCK_UNLOCK)
 		elog(ERROR, "buffer is not locked");
 
@@ -6874,7 +6311,7 @@ SharedBufferBeginSetHintBits(Buffer buffer, BufferDesc *buf_hdr, uint64 *locksta
 		if (likely(pg_atomic_compare_exchange_u64(&buf_hdr->state,
 												  &old_state, desired_state)))
 		{
-			ref->data.lockmode = BUFFER_LOCK_SHARE_EXCLUSIVE;
+			SharedBufferSetLockMode(ref, BUFFER_LOCK_SHARE_EXCLUSIVE);
 			*lockstate = desired_state;
 
 			return true;
@@ -7647,7 +7084,7 @@ ResOwnerReleaseBuffer(Datum res)
 	{
 		PrivateRefCountEntry *ref;
 
-		ref = GetPrivateRefCountEntry(buffer, false);
+		ref = GetSharedBufferEntry(buffer);
 
 		/* not having a private refcount would imply resowner corruption */
 		Assert(ref != NULL);
@@ -7656,7 +7093,7 @@ ResOwnerReleaseBuffer(Datum res)
 		 * If the buffer was locked at the time of the resowner release,
 		 * release the lock now. This should only happen after errors.
 		 */
-		if (ref->data.lockmode != BUFFER_LOCK_UNLOCK)
+		if (SharedBufferGetLockMode(ref) != BUFFER_LOCK_UNLOCK)
 		{
 			BufferDesc *buf = GetBufferDescriptor(buffer - 1);
 
@@ -7749,7 +7186,6 @@ EvictUnpinnedBuffer(Buffer buf, bool *buffer_flushed)
 
 	/* Make sure we can pin the buffer. */
 	ResourceOwnerEnlarge(CurrentResourceOwner);
-	ReservePrivateRefCountEntry();
 
 	desc = GetBufferDescriptor(buf - 1);
 	LockBufHdr(desc);
@@ -7790,7 +7226,6 @@ EvictAllUnpinnedBuffers(int32 *buffers_evicted, int32 *buffers_flushed,
 			continue;
 
 		ResourceOwnerEnlarge(CurrentResourceOwner);
-		ReservePrivateRefCountEntry();
 
 		LockBufHdr(desc);
 
@@ -7844,7 +7279,6 @@ EvictRelUnpinnedBuffers(Relation rel, int32 *buffers_evicted,
 
 		/* Make sure we can pin the buffer. */
 		ResourceOwnerEnlarge(CurrentResourceOwner);
-		ReservePrivateRefCountEntry();
 
 		buf_state = LockBufHdr(desc);
 
@@ -7936,7 +7370,6 @@ MarkDirtyUnpinnedBuffer(Buffer buf, bool *buffer_already_dirty)
 
 	/* Make sure we can pin the buffer. */
 	ResourceOwnerEnlarge(CurrentResourceOwner);
-	ReservePrivateRefCountEntry();
 
 	desc = GetBufferDescriptor(buf - 1);
 	LockBufHdr(desc);
@@ -7989,7 +7422,6 @@ MarkDirtyRelUnpinnedBuffers(Relation rel,
 
 		/* Make sure we can pin the buffer. */
 		ResourceOwnerEnlarge(CurrentResourceOwner);
-		ReservePrivateRefCountEntry();
 
 		buf_state = LockBufHdr(desc);
 
@@ -8041,7 +7473,6 @@ MarkDirtyAllUnpinnedBuffers(int32 *buffers_dirtied,
 			continue;
 
 		ResourceOwnerEnlarge(CurrentResourceOwner);
-		ReservePrivateRefCountEntry();
 
 		LockBufHdr(desc);
 
@@ -8740,3 +8171,12 @@ const PgAioHandleCallbacks aio_local_buffer_readv_cb = {
 	.complete_local = local_buffer_readv_complete,
 	.report = buffer_readv_report,
 };
+
+
+/*
+ * Intentionally included at the bottom so that the compiler can inline 
+ * functions but developers are forced to use accessors.
+ * bufmgr.c should not be concerned with bufmgr_refcnt.c implementation
+ * details.
+ */
+ #include "bufmgr_refcnt.c"
diff --git a/src/backend/storage/buffer/bufmgr_refcnt.c b/src/backend/storage/buffer/bufmgr_refcnt.c
new file mode 100644
index 0000000000..fef8f98037
--- /dev/null
+++ b/src/backend/storage/buffer/bufmgr_refcnt.c
@@ -0,0 +1,615 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmgr_refcnt.c
+ *	  Backend-private buffer refcount tracking
+ *
+ * This file is included at the end of bufmgr.c to allow the compiler
+ * to inline functions. Do not compile this file separately.
+ *
+ * Each buffer has a private refcount that keeps track of the number of
+ * times the buffer is pinned in the current process.  This is so that the
+ * shared refcount needs to be modified only once if a buffer is pinned more
+ * than once by an individual backend. This mechanism is also used to track
+ * whether this backend has a buffer locked, and, if so, in what mode.
+ *
+ * To avoid - as we used to - requiring an array with NBuffers entries to keep
+ * track of local buffers, we use a small sequentially searched array
+ * (PrivateRefCountArrayKeys, with the corresponding data stored in
+ * PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to
+ * keep track of backend local pins.
+ *
+ * Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all
+ * refcounts are kept track of in the array; after that, new array entries
+ * displace old ones into the hash table. That way a frequently used entry
+ * can't get "stuck" in the hashtable while infrequent ones clog the array.
+ *
+ * This was initially designed trying to optimize for the case where the
+ * number of pinned buffers is expected to not exceed REFCOUNT_ARRAY_ENTRIES.
+ * However this might not be the case with the introduction of prefetching.
+ *
+ * To enter a buffer into the refcount tracking mechanism first reserve a free
+ * entry using ReservePrivateRefCountEntry() and then later, if necessary,
+ * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing
+ * memory allocations in NewPrivateRefCountEntry() which can be important
+ * because in some scenarios it's called with a spinlock held...
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/buffer/bufmgr_refcnt.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "utils/hsearch.h"
+
+/* Structure definitions - internal to this file */
+typedef struct PrivateRefCountData
+{
+	int32		refcount;
+	BufferLockMode lockmode;
+} PrivateRefCountData;
+
+struct PrivateRefCountEntry
+{
+	Buffer		buffer;
+	PrivateRefCountData data;
+};
+
+struct PrivateRefCountIterator
+{
+	int			array_index;
+	bool		in_hash;
+	HASH_SEQ_STATUS *hash_status;
+};
+
+/* Private refcount array and keys */
+#define REFCOUNT_ARRAY_ENTRIES 8
+static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
+static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
+
+/* Overflow hash table for when array is full */
+static HTAB *PrivateRefCountHash = NULL;
+
+/* Count of entries that have overflowed into the hash table */
+static int32 PrivateRefCountOverflowed = 0;
+
+/* Clock hand for selecting victim when array is full */
+static uint32 PrivateRefCountClock = 0;
+
+/* Reserved slot index, or -1 if none reserved */
+static int	ReservedRefCountSlot = -1;
+
+/* Cache for last accessed entry */
+static int	PrivateRefCountEntryLast = -1;
+
+/* Advisory limit on the number of pins each backend should hold */
+static uint32 MaxProportionalPins = 0;
+
+/* Forward declarations */
+static void ReservePrivateRefCountEntry(void);
+static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
+static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move);
+
+/*
+ * Initialize private refcount tracking for this backend.
+ */
+void
+InitPrivateRefCount(void)
+{
+	HASHCTL		hash_ctl;
+
+
+	/*
+	 * An advisory limit on the number of pins each backend should hold, based
+	 * on shared_buffers and the maximum number of connections possible.
+	 * That's very pessimistic, but outside toy-sized shared_buffers it should
+	 * allow plenty of pins.  LimitAdditionalPins() and
+	 * GetAdditionalPinLimit() can be used to check the remaining balance.
+	 */
+	 MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
+
+	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
+	memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
+
+	hash_ctl.keysize = sizeof(Buffer);
+	hash_ctl.entrysize = sizeof(PrivateRefCountEntry);
+
+	PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS);
+}
+
+/*
+ * Ensure that the PrivateRefCountArray has sufficient space to store one more
+ * entry.
+ */
+static void
+ReservePrivateRefCountEntry(void)
+{
+	/* Already reserved (or freed), nothing to do */
+	if (ReservedRefCountSlot != -1)
+		return;
+
+	/*
+	 * First search for a free entry the array, that'll be sufficient in the
+	 * majority of cases.
+	 */
+	{
+		int			i;
+
+		for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+		{
+			if (PrivateRefCountArrayKeys[i] == InvalidBuffer)
+			{
+				ReservedRefCountSlot = i;
+			}
+		}
+
+		if (ReservedRefCountSlot != -1)
+			return;
+	}
+
+	/*
+	 * No luck. All array entries are full. Move one array entry into the hash
+	 * table.
+	 */
+	{
+		int			victim_slot;
+		PrivateRefCountEntry *victim_entry;
+		PrivateRefCountEntry *hashent;
+		bool		found;
+
+		/* select victim slot */
+		victim_slot = PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES;
+		victim_entry = &PrivateRefCountArray[victim_slot];
+		ReservedRefCountSlot = victim_slot;
+
+		/* Better be used, otherwise we shouldn't get here. */
+		Assert(PrivateRefCountArrayKeys[victim_slot] != InvalidBuffer);
+		Assert(PrivateRefCountArray[victim_slot].buffer != InvalidBuffer);
+		Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer);
+
+		/* enter victim array entry into hashtable */
+		hashent = hash_search(PrivateRefCountHash,
+							  &PrivateRefCountArrayKeys[victim_slot],
+							  HASH_ENTER,
+							  &found);
+		Assert(!found);
+		hashent->data = victim_entry->data;
+
+		/* clear the now free array slot */
+		PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer;
+		victim_entry->buffer = InvalidBuffer;
+
+		memset(&victim_entry->data, 0, sizeof(victim_entry->data));
+		victim_entry->data.refcount = 0;
+		victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK;
+
+		PrivateRefCountOverflowed++;
+	}
+}
+
+/*
+ * Create a new refcount entry for the given buffer.
+ */
+static PrivateRefCountEntry *
+NewPrivateRefCountEntry(Buffer buffer)
+{
+	PrivateRefCountEntry *res;
+
+	/* only allowed to be called when a reservation has been made */
+	Assert(ReservedRefCountSlot != -1);
+
+	/* use up the reserved entry */
+	res = &PrivateRefCountArray[ReservedRefCountSlot];
+
+	/* and fill it */
+	PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
+	res->buffer = buffer;
+	res->data.refcount = 0;
+	res->data.lockmode = BUFFER_LOCK_UNLOCK;
+
+	/* update cache for the next lookup */
+	PrivateRefCountEntryLast = ReservedRefCountSlot;
+
+	ReservedRefCountSlot = -1;
+
+	return res;
+}
+
+/*
+ * Slow-path for GetSharedBufferEntry().
+ */
+static pg_noinline PrivateRefCountEntry *
+GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
+{
+	PrivateRefCountEntry *res;
+	int			match = -1;
+	int			i;
+
+	/*
+	 * First search for references in the array.
+	 */
+	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+	{
+		if (PrivateRefCountArrayKeys[i] == buffer)
+		{
+			match = i;
+		}
+	}
+
+	if (likely(match != -1))
+	{
+		PrivateRefCountEntryLast = match;
+		return &PrivateRefCountArray[match];
+	}
+
+	/*
+	 * Only look up the buffer in the hashtable if we've previously overflowed.
+	 */
+	if (PrivateRefCountOverflowed == 0)
+		return NULL;
+
+	res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL);
+
+	if (res == NULL)
+		return NULL;
+	else if (!do_move)
+	{
+		return res;
+	}
+	else
+	{
+		/* move buffer from hashtable into the free array slot */
+		bool		found;
+		PrivateRefCountEntry *free;
+
+		ReservePrivateRefCountEntry();
+
+		Assert(ReservedRefCountSlot != -1);
+		free = &PrivateRefCountArray[ReservedRefCountSlot];
+		Assert(free->buffer == InvalidBuffer);
+
+		free->buffer = buffer;
+		free->data = res->data;
+		PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
+		PrivateRefCountEntryLast = ReservedRefCountSlot;
+
+		ReservedRefCountSlot = -1;
+
+		hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
+		Assert(found);
+		Assert(PrivateRefCountOverflowed > 0);
+		PrivateRefCountOverflowed--;
+
+		return free;
+	}
+}
+
+/*
+ * Return the PrivateRefCountEntry for the passed buffer.
+ * Returns NULL if the buffer is not currently pinned.
+ */
+static PrivateRefCountEntry *
+GetSharedBufferEntry(Buffer buffer)
+{
+	Assert(BufferIsValid(buffer));
+	Assert(!BufferIsLocal(buffer));
+
+	/* Fast path: check one-entry cache */
+	if (likely(PrivateRefCountEntryLast != -1) &&
+		likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer))
+	{
+		return &PrivateRefCountArray[PrivateRefCountEntryLast];
+	}
+
+	return GetPrivateRefCountEntrySlow(buffer, false);
+}
+
+/*
+ * Create a new refcount entry for a buffer that is known to not be pinned.
+ * This is a fast path that skips the cache/hash lookup.
+ * Returns the new entry pointer with refcount already incremented.
+ */
+static PrivateRefCountEntry *
+SharedBufferCreateRef(Buffer buffer)
+{
+	PrivateRefCountEntry *ref;
+
+	Assert(BufferIsValid(buffer));
+	Assert(!BufferIsLocal(buffer));
+
+	ReservePrivateRefCountEntry();
+	ref = NewPrivateRefCountEntry(buffer);
+	ref->data.refcount++;
+
+	return ref;
+}
+
+/*
+ * Increment the private refcount for an existing entry.
+ * Use when you already have the entry from a previous lookup.
+ */
+static void
+SharedBufferRefExisting(PrivateRefCountEntry *ref)
+{
+	Assert(ref != NULL);
+	Assert(ref->data.refcount > 0);
+	ref->data.refcount++;
+}
+
+/*
+ * Decrement the private refcount for a buffer.
+ * If the refcount reaches zero, removes the entry and returns true.
+ * Returns false if the buffer still has references.
+ */
+static bool
+SharedBufferUnref(PrivateRefCountEntry *ref)
+{
+	Assert(ref != NULL);
+	Assert(ref->data.refcount > 0);
+
+	ref->data.refcount--;
+
+	if (ref->data.refcount == 0)
+	{
+		/* No more references - clean up the entry */
+		Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK);
+
+		if (ref >= &PrivateRefCountArray[0] &&
+			ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES])
+		{
+			ref->buffer = InvalidBuffer;
+			PrivateRefCountArrayKeys[ref - PrivateRefCountArray] = InvalidBuffer;
+			ReservedRefCountSlot = ref - PrivateRefCountArray;
+		}
+		else
+		{
+			bool		found;
+			Buffer		buffer = ref->buffer;
+
+			hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
+			Assert(found);
+			Assert(PrivateRefCountOverflowed > 0);
+			PrivateRefCountOverflowed--;
+		}
+
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Accessors for refcount entry fields.
+ */
+static int32
+SharedBufferRefCount(PrivateRefCountEntry *ref)
+{
+	return ref->data.refcount;
+}
+
+static BufferLockMode
+SharedBufferGetLockMode(PrivateRefCountEntry *ref)
+{
+	return ref->data.lockmode;
+}
+
+static void
+SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode)
+{
+	ref->data.lockmode = mode;
+}
+
+/*
+ * Check if the entry has exactly one reference.
+ */
+static bool
+SharedBufferHasSingleRef(PrivateRefCountEntry *ref)
+{
+	return ref != NULL && ref->data.refcount == 1;
+}
+
+/*
+ * Check for buffer refcount leaks.
+ */
+void
+CheckPrivateRefCountLeaks(void)
+{
+#ifdef USE_ASSERT_CHECKING
+	int			RefCountErrors = 0;
+	PrivateRefCountEntry *res;
+	int			i;
+	char	   *s;
+
+	/* check the array */
+	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+	{
+		if (PrivateRefCountArrayKeys[i] != InvalidBuffer)
+		{
+			res = &PrivateRefCountArray[i];
+
+			s = DebugPrintBufferRefcount(res->buffer);
+			elog(WARNING, "buffer refcount leak: %s", s);
+			pfree(s);
+
+			RefCountErrors++;
+		}
+	}
+
+	/* if necessary search the hash */
+	if (PrivateRefCountOverflowed)
+	{
+		HASH_SEQ_STATUS hstat;
+
+		hash_seq_init(&hstat, PrivateRefCountHash);
+		while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			s = DebugPrintBufferRefcount(res->buffer);
+			elog(WARNING, "buffer refcount leak: %s", s);
+			pfree(s);
+			RefCountErrors++;
+		}
+	}
+
+	Assert(RefCountErrors == 0);
+#endif
+}
+
+#ifdef USE_ASSERT_CHECKING
+/* Forward declaration - defined in bufmgr.c */
+static void AssertNotCatalogBufferLock(Buffer buffer, BufferLockMode mode);
+
+/*
+ * Check for exclusive-locked catalog buffers.  This is the core of
+ * AssertCouldGetRelation().
+ */
+void
+AssertBufferLocksPermitCatalogRead(void)
+{
+	PrivateRefCountIterator *iter;
+	PrivateRefCountEntry *res;
+
+	iter = InitPrivateRefCountIterator();
+	while ((res = GetNextPrivateRefCountEntry(iter)) != NULL)
+	{
+		Buffer		buf = res->buffer;
+
+		if (buf == InvalidBuffer)
+			continue;
+
+		AssertNotCatalogBufferLock(buf, res->data.lockmode);
+	}
+	FreePrivateRefCountIterator(iter);
+}
+#endif
+
+/*
+ * Initialize an iterator for walking all private refcount entries.
+ */
+PrivateRefCountIterator *
+InitPrivateRefCountIterator(void)
+{
+	PrivateRefCountIterator *iter = palloc(sizeof(PrivateRefCountIterator));
+
+	iter->array_index = 0;
+	iter->in_hash = false;
+	iter->hash_status = NULL;
+	return iter;
+}
+
+/*
+ * Get the next private refcount entry.
+ * Returns NULL when iteration is complete.
+ */
+PrivateRefCountEntry *
+GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
+{
+	/* First iterate through the array */
+	while (!iter->in_hash && iter->array_index < REFCOUNT_ARRAY_ENTRIES)
+	{
+		int			idx = iter->array_index++;
+
+		if (PrivateRefCountArrayKeys[idx] != InvalidBuffer)
+			return &PrivateRefCountArray[idx];
+	}
+
+	/* Then iterate through the hash if there are overflowed entries */
+	if (!iter->in_hash)
+	{
+		iter->in_hash = true;
+		if (PrivateRefCountOverflowed > 0)
+		{
+			iter->hash_status = palloc(sizeof(HASH_SEQ_STATUS));
+			hash_seq_init(iter->hash_status, PrivateRefCountHash);
+		}
+	}
+
+	if (iter->hash_status != NULL)
+	{
+		PrivateRefCountEntry *res;
+
+		res = (PrivateRefCountEntry *) hash_seq_search(iter->hash_status);
+		if (res != NULL)
+			return res;
+
+		pfree(iter->hash_status);
+		iter->hash_status = NULL;
+	}
+
+	return NULL;
+}
+
+/*
+ * Free an iterator from InitPrivateRefCountIterator.
+ */
+void
+FreePrivateRefCountIterator(PrivateRefCountIterator *iter)
+{
+	if (iter->hash_status != NULL)
+	{
+		hash_seq_term(iter->hash_status);
+		pfree(iter->hash_status);
+	}
+	pfree(iter);
+}
+
+
+/*
+ * Return the maximum number of buffers that a backend should try to pin once,
+ * to avoid exceeding its fair share.  This is the highest value that
+ * GetAdditionalPinLimit() could ever return.  Note that it may be zero on a
+ * system with a very small buffer pool relative to max_connections.
+ */
+ uint32
+ GetPinLimit(void)
+ {
+	 return MaxProportionalPins;
+ }
+
+ /*
+  * Return the maximum number of additional buffers that this backend should
+  * pin if it wants to stay under the per-backend limit, considering the number
+  * of buffers it has already pinned.  Unlike LimitAdditionalPins(), the limit
+  * return by this function can be zero.
+  */
+ uint32
+ GetAdditionalPinLimit(void)
+ {
+	 uint32		estimated_pins_held;
+
+	 /*
+	  * We get the number of "overflowed" pins for free, but don't know the
+	  * number of pins in PrivateRefCountArray.  The cost of calculating that
+	  * exactly doesn't seem worth it, so just assume the max.
+	  */
+	 estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
+
+	 /* Is this backend already holding more than its fair share? */
+	 if (estimated_pins_held > MaxProportionalPins)
+		 return 0;
+
+	 return MaxProportionalPins - estimated_pins_held;
+ }
+
+ /*
+  * Limit the number of pins a batch operation may additionally acquire, to
+  * avoid running out of pinnable buffers.
+  *
+  * One additional pin is always allowed, on the assumption that the operation
+  * requires at least one to make progress.
+  */
+ void
+ LimitAdditionalPins(uint32 *additional_pins)
+ {
+	 uint32		limit;
+
+	 if (*additional_pins <= 1)
+		 return;
+
+	 limit = GetAdditionalPinLimit();
+	 limit = Max(limit, 1);
+	 if (limit < *additional_pins)
+		 *additional_pins = limit;
+ }
diff --git a/src/backend/storage/buffer/bufmgr_refcnt.h b/src/backend/storage/buffer/bufmgr_refcnt.h
new file mode 100644
index 0000000000..f2a6f45d84
--- /dev/null
+++ b/src/backend/storage/buffer/bufmgr_refcnt.h
@@ -0,0 +1,59 @@
+/*-------------------------------------------------------------------------
+ *
+ * bufmgr_refcnt.h
+ *	  Backend-private buffer refcount tracking
+ *
+ * This header provides opaque declarations for the private refcount
+ * tracking system. The implementation is in bufmgr_refcnt.c which is
+ * included at the end of bufmgr.c for inlining.
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/storage/buffer/bufmgr_refcnt.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BUFMGR_REFCNT_H
+#define BUFMGR_REFCNT_H
+
+#include "storage/buf.h"
+#include "storage/bufmgr.h"
+
+/* Opaque handle to a private refcount entry */
+typedef struct PrivateRefCountEntry PrivateRefCountEntry;
+
+/* Opaque handle to an iterator */
+typedef struct PrivateRefCountIterator PrivateRefCountIterator;
+
+/* Initialization */
+extern void InitPrivateRefCount(void);
+
+/*
+ * Hot-path functions - forward declarations.
+ * Defined as static in bufmgr_refcnt.c which is included in bufmgr.c.
+ */
+static PrivateRefCountEntry *GetSharedBufferEntry(Buffer buffer);
+static PrivateRefCountEntry *SharedBufferCreateRef(Buffer buffer);
+static void SharedBufferRefExisting(PrivateRefCountEntry *ref);
+static bool SharedBufferUnref(PrivateRefCountEntry *ref);
+static BufferLockMode SharedBufferGetLockMode(PrivateRefCountEntry *ref);
+static void SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode);
+static int32 SharedBufferRefCount(PrivateRefCountEntry *ref);
+static bool SharedBufferHasSingleRef(PrivateRefCountEntry *ref);
+
+/* Pin limiting */
+extern uint32 GetPinLimit(void);
+extern uint32 GetAdditionalPinLimit(void);
+extern void LimitAdditionalPins(uint32 *additional_pins);
+
+/* Leak checking */
+extern void CheckPrivateRefCountLeaks(void);
+
+/* Iterator functions */
+extern PrivateRefCountIterator *InitPrivateRefCountIterator(void);
+extern PrivateRefCountEntry *GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter);
+extern void FreePrivateRefCountIterator(PrivateRefCountIterator *iter);
+
+
+#endif							/* BUFMGR_REFCNT_H */
-- 
2.34.1



  [application/x-patch] v2-0003-Using-simplehash.patch (13.2K, 6-v2-0003-Using-simplehash.patch)
  download | inline diff:
From bf146598cd94110da255c6fb45b4a5c58002a91d Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Wed, 11 Mar 2026 12:07:54 +0000
Subject: [PATCH 3/5] Using simplehash

This patch replaces the HTAB implementation with a simplehash
as suggested by Andres Freund [1]. The simplehash implements a templated
open addressing hash, which have entry size information at compile time.

The access strategy of the simplehash is very close to the plain array
that was seen to be very efficient compared to the HTAB implementation,
with the additional advantage of using the key value to initialize the
search closer to where the key actually is, instead of always scanning
the entire array.

Adds one macro to the simplehash implementation enable overriding the
empty position detection, to enable using buffer == InvalidBuffer
instead of requiring a separate status field.

[1] https://www.postgresql.org/message-id/s5p7iou7pdhxhvmv4rohmskwqmr36dc4rybvwlep5yvwrjs4pa%406oxsemms5mw4

---
 src/backend/storage/buffer/bufmgr_refcnt.c | 89 +++++++++++-----------
 src/include/lib/simplehash.h               | 59 +++++++++-----
 2 files changed, 84 insertions(+), 64 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr_refcnt.c b/src/backend/storage/buffer/bufmgr_refcnt.c
index fef8f98037..dd95d953f6 100644
--- a/src/backend/storage/buffer/bufmgr_refcnt.c
+++ b/src/backend/storage/buffer/bufmgr_refcnt.c
@@ -42,7 +42,10 @@
  *-------------------------------------------------------------------------
  */
 
-#include "utils/hsearch.h"
+#include "common/hashfn.h"
+#include "storage/buf_internals.h"
+#include "bufmgr_refcnt.h"
+#include "storage/proc.h"
 
 /* Structure definitions - internal to this file */
 typedef struct PrivateRefCountData
@@ -53,15 +56,36 @@ typedef struct PrivateRefCountData
 
 struct PrivateRefCountEntry
 {
-	Buffer		buffer;
+	Buffer		buffer;		/* hash key - InvalidBuffer means empty */
 	PrivateRefCountData data;
 };
 
+/*
+ * Define simplehash parameters for the overflow hash table.
+ * We use buffer == InvalidBuffer as the "empty" marker to avoid needing
+ * a separate status field.
+ */
+#define SH_PREFIX refcount
+#define SH_ELEMENT_TYPE PrivateRefCountEntry
+#define SH_KEY_TYPE Buffer
+#define SH_KEY buffer
+#define SH_HASH_KEY(tb, key) murmurhash32(key)
+#define SH_EQUAL(tb, a, b) ((a) == (b))
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+/* Use buffer field as empty marker - no separate status needed */
+#define SH_ENTRY_EMPTY(entry) ((entry)->buffer == InvalidBuffer)
+#define SH_MAKE_EMPTY(entry) ((entry)->buffer = InvalidBuffer)
+#define SH_MAKE_IN_USE(entry) ((void)0)  /* key assignment handles this */
+#include "lib/simplehash.h"
+
 struct PrivateRefCountIterator
 {
 	int			array_index;
 	bool		in_hash;
-	HASH_SEQ_STATUS *hash_status;
+	refcount_iterator hash_iter;
+	bool		hash_iter_valid;
 };
 
 /* Private refcount array and keys */
@@ -70,7 +94,7 @@ static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
 static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
 
 /* Overflow hash table for when array is full */
-static HTAB *PrivateRefCountHash = NULL;
+static refcount_hash *PrivateRefCountHash = NULL;
 
 /* Count of entries that have overflowed into the hash table */
 static int32 PrivateRefCountOverflowed = 0;
@@ -98,26 +122,18 @@ static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buff
 void
 InitPrivateRefCount(void)
 {
-	HASHCTL		hash_ctl;
-
-
 	/*
 	 * An advisory limit on the number of pins each backend should hold, based
 	 * on shared_buffers and the maximum number of connections possible.
 	 * That's very pessimistic, but outside toy-sized shared_buffers it should
 	 * allow plenty of pins.  LimitAdditionalPins() and
 	 * GetAdditionalPinLimit() can be used to check the remaining balance.
-	 */
-	 MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
-
+	*/
+	MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
 	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
 	memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
 
-	hash_ctl.keysize = sizeof(Buffer);
-	hash_ctl.entrysize = sizeof(PrivateRefCountEntry);
-
-	PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS);
+	PrivateRefCountHash = refcount_create(CurrentMemoryContext, 16, NULL);
 }
 
 /*
@@ -171,10 +187,9 @@ ReservePrivateRefCountEntry(void)
 		Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer);
 
 		/* enter victim array entry into hashtable */
-		hashent = hash_search(PrivateRefCountHash,
-							  &PrivateRefCountArrayKeys[victim_slot],
-							  HASH_ENTER,
-							  &found);
+		hashent = refcount_insert(PrivateRefCountHash,
+								  PrivateRefCountArrayKeys[victim_slot],
+								  &found);
 		Assert(!found);
 		hashent->data = victim_entry->data;
 
@@ -251,7 +266,7 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
 	if (PrivateRefCountOverflowed == 0)
 		return NULL;
 
-	res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL);
+	res = refcount_lookup(PrivateRefCountHash, buffer);
 
 	if (res == NULL)
 		return NULL;
@@ -262,7 +277,6 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
 	else
 	{
 		/* move buffer from hashtable into the free array slot */
-		bool		found;
 		PrivateRefCountEntry *free;
 
 		ReservePrivateRefCountEntry();
@@ -278,8 +292,7 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
 
 		ReservedRefCountSlot = -1;
 
-		hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
-		Assert(found);
+		refcount_delete_item(PrivateRefCountHash, res);
 		Assert(PrivateRefCountOverflowed > 0);
 		PrivateRefCountOverflowed--;
 
@@ -366,11 +379,7 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
 		}
 		else
 		{
-			bool		found;
-			Buffer		buffer = ref->buffer;
-
-			hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
-			Assert(found);
+			refcount_delete_item(PrivateRefCountHash, ref);
 			Assert(PrivateRefCountOverflowed > 0);
 			PrivateRefCountOverflowed--;
 		}
@@ -441,10 +450,10 @@ CheckPrivateRefCountLeaks(void)
 	/* if necessary search the hash */
 	if (PrivateRefCountOverflowed)
 	{
-		HASH_SEQ_STATUS hstat;
+		refcount_iterator iter;
 
-		hash_seq_init(&hstat, PrivateRefCountHash);
-		while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL)
+		refcount_start_iterate(PrivateRefCountHash, &iter);
+		while ((res = refcount_iterate(PrivateRefCountHash, &iter)) != NULL)
 		{
 			s = DebugPrintBufferRefcount(res->buffer);
 			elog(WARNING, "buffer refcount leak: %s", s);
@@ -495,7 +504,7 @@ InitPrivateRefCountIterator(void)
 
 	iter->array_index = 0;
 	iter->in_hash = false;
-	iter->hash_status = NULL;
+	iter->hash_iter_valid = false;
 	return iter;
 }
 
@@ -521,21 +530,20 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
 		iter->in_hash = true;
 		if (PrivateRefCountOverflowed > 0)
 		{
-			iter->hash_status = palloc(sizeof(HASH_SEQ_STATUS));
-			hash_seq_init(iter->hash_status, PrivateRefCountHash);
+			refcount_start_iterate(PrivateRefCountHash, &iter->hash_iter);
+			iter->hash_iter_valid = true;
 		}
 	}
 
-	if (iter->hash_status != NULL)
+	if (iter->hash_iter_valid)
 	{
 		PrivateRefCountEntry *res;
 
-		res = (PrivateRefCountEntry *) hash_seq_search(iter->hash_status);
+		res = refcount_iterate(PrivateRefCountHash, &iter->hash_iter);
 		if (res != NULL)
 			return res;
 
-		pfree(iter->hash_status);
-		iter->hash_status = NULL;
+		iter->hash_iter_valid = false;
 	}
 
 	return NULL;
@@ -547,11 +555,6 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
 void
 FreePrivateRefCountIterator(PrivateRefCountIterator *iter)
 {
-	if (iter->hash_status != NULL)
-	{
-		hash_seq_term(iter->hash_status);
-		pfree(iter->hash_status);
-	}
 	pfree(iter);
 }
 
diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index 848719232a..3c03a7e9c9 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -287,6 +287,20 @@ SH_SCOPE void SH_STAT(SH_TYPE * tb);
 #define SH_COMPARE_KEYS(tb, ahash, akey, b) (SH_EQUAL(tb, b->SH_KEY, akey))
 #endif
 
+/*
+ * Macros to check/set entry status. Users can override these to avoid
+ * needing a separate status field if their key type has an "invalid" value.
+ */
+#ifndef SH_ENTRY_EMPTY
+#define SH_ENTRY_EMPTY(entry) ((entry)->status == SH_STATUS_EMPTY)
+#endif
+#ifndef SH_MAKE_EMPTY
+#define SH_MAKE_EMPTY(entry) ((entry)->status = SH_STATUS_EMPTY)
+#endif
+#ifndef SH_MAKE_IN_USE
+#define SH_MAKE_IN_USE(entry) ((entry)->status = SH_STATUS_IN_USE)
+#endif
+
 /*
  * Wrap the following definitions in include guards, to avoid multiple
  * definition errors if this header is included more than once.  The rest of
@@ -544,7 +558,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 		uint32		hash;
 		uint32		optimal;
 
-		if (oldentry->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_EMPTY(oldentry))
 		{
 			startelem = i;
 			break;
@@ -566,7 +580,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 	{
 		SH_ELEMENT_TYPE *oldentry = &olddata[copyelem];
 
-		if (oldentry->status == SH_STATUS_IN_USE)
+		if (!SH_ENTRY_EMPTY(oldentry))
 		{
 			uint32		hash;
 			uint32		startelem2;
@@ -582,7 +596,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
 			{
 				newentry = &newdata[curelem];
 
-				if (newentry->status == SH_STATUS_EMPTY)
+				if (SH_ENTRY_EMPTY(newentry))
 				{
 					break;
 				}
@@ -653,14 +667,14 @@ restart:
 		SH_ELEMENT_TYPE *entry = &data[curelem];
 
 		/* any empty bucket can directly be used */
-		if (entry->status == SH_STATUS_EMPTY)
+		if (SH_ENTRY_EMPTY(entry))
 		{
 			tb->members++;
 			entry->SH_KEY = key;
 #ifdef SH_STORE_HASH
 			SH_GET_HASH(tb, entry) = hash;
 #endif
-			entry->status = SH_STATUS_IN_USE;
+			SH_MAKE_IN_USE(entry);
 			*found = false;
 			return entry;
 		}
@@ -675,7 +689,7 @@ restart:
 
 		if (SH_COMPARE_KEYS(tb, hash, key, entry))
 		{
-			Assert(entry->status == SH_STATUS_IN_USE);
+			Assert(!SH_ENTRY_EMPTY(entry));
 			*found = true;
 			return entry;
 		}
@@ -699,7 +713,7 @@ restart:
 				emptyelem = SH_NEXT(tb, emptyelem, startelem);
 				emptyentry = &data[emptyelem];
 
-				if (emptyentry->status == SH_STATUS_EMPTY)
+				if (SH_ENTRY_EMPTY(emptyentry))
 				{
 					lastentry = emptyentry;
 					break;
@@ -748,7 +762,7 @@ restart:
 #ifdef SH_STORE_HASH
 			SH_GET_HASH(tb, entry) = hash;
 #endif
-			entry->status = SH_STATUS_IN_USE;
+			SH_MAKE_IN_USE(entry);
 			*found = false;
 			return entry;
 		}
@@ -810,12 +824,12 @@ SH_LOOKUP_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash)
 	{
 		SH_ELEMENT_TYPE *entry = &tb->data[curelem];
 
-		if (entry->status == SH_STATUS_EMPTY)
+		if (SH_ENTRY_EMPTY(entry))
 		{
 			return NULL;
 		}
 
-		Assert(entry->status == SH_STATUS_IN_USE);
+		Assert(!SH_ENTRY_EMPTY(entry));
 
 		if (SH_COMPARE_KEYS(tb, hash, key, entry))
 			return entry;
@@ -868,10 +882,10 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 	{
 		SH_ELEMENT_TYPE *entry = &tb->data[curelem];
 
-		if (entry->status == SH_STATUS_EMPTY)
+		if (SH_ENTRY_EMPTY(entry))
 			return false;
 
-		if (entry->status == SH_STATUS_IN_USE &&
+		if (!SH_ENTRY_EMPTY(entry) &&
 			SH_COMPARE_KEYS(tb, hash, key, entry))
 		{
 			SH_ELEMENT_TYPE *lastentry = entry;
@@ -894,9 +908,9 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 				curelem = SH_NEXT(tb, curelem, startelem);
 				curentry = &tb->data[curelem];
 
-				if (curentry->status != SH_STATUS_IN_USE)
+				if (SH_ENTRY_EMPTY(curentry))
 				{
-					lastentry->status = SH_STATUS_EMPTY;
+					SH_MAKE_EMPTY(lastentry);
 					break;
 				}
 
@@ -906,7 +920,7 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
 				/* current is at optimal position, done */
 				if (curoptimal == curelem)
 				{
-					lastentry->status = SH_STATUS_EMPTY;
+					SH_MAKE_EMPTY(lastentry);
 					break;
 				}
 
@@ -957,9 +971,9 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
 		curelem = SH_NEXT(tb, curelem, startelem);
 		curentry = &tb->data[curelem];
 
-		if (curentry->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_EMPTY(curentry))
 		{
-			lastentry->status = SH_STATUS_EMPTY;
+			SH_MAKE_EMPTY(lastentry);
 			break;
 		}
 
@@ -969,7 +983,7 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
 		/* current is at optimal position, done */
 		if (curoptimal == curelem)
 		{
-			lastentry->status = SH_STATUS_EMPTY;
+			SH_MAKE_EMPTY(lastentry);
 			break;
 		}
 
@@ -997,7 +1011,7 @@ SH_START_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter)
 	{
 		SH_ELEMENT_TYPE *entry = &tb->data[i];
 
-		if (entry->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_EMPTY(entry))
 		{
 			startelem = i;
 			break;
@@ -1063,7 +1077,7 @@ SH_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter)
 
 		if ((iter->cur & tb->sizemask) == (iter->end & tb->sizemask))
 			iter->done = true;
-		if (elem->status == SH_STATUS_IN_USE)
+		if (!SH_ENTRY_EMPTY(elem))
 		{
 			return elem;
 		}
@@ -1140,7 +1154,7 @@ SH_STAT(SH_TYPE * tb)
 
 		elem = &tb->data[i];
 
-		if (elem->status != SH_STATUS_IN_USE)
+		if (SH_ENTRY_EMPTY(elem))
 			continue;
 
 		hash = SH_ENTRY_HASH(tb, elem);
@@ -1205,6 +1219,9 @@ SH_STAT(SH_TYPE * tb)
 #undef SH_STORE_HASH
 #undef SH_USE_NONDEFAULT_ALLOCATOR
 #undef SH_EQUAL
+#undef SH_ENTRY_EMPTY
+#undef SH_MAKE_EMPTY
+#undef SH_MAKE_IN_USE
 
 /* undefine locally declared macros */
 #undef SH_MAKE_PREFIX
-- 
2.34.1



  [application/x-patch] v2-0001-Benchmark-buffer-pinning.patch (22.3K, 7-v2-0001-Benchmark-buffer-pinning.patch)
  download | inline diff:
From 36b69e2028c8febb35ca2c45327384251c44b775 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Wed, 11 Mar 2026 12:05:50 +0000
Subject: [PATCH 1/5] Benchmark buffer pinning

---
 src/test/modules/test_buffer_pin/Makefile     |  18 +
 src/test/modules/test_buffer_pin/benchmark.py | 185 ++++++++++
 .../modules/test_buffer_pin/requirements.txt  |   9 +
 .../test_buffer_pin/test_buffer_pin--1.0.sql  |  66 ++++
 .../modules/test_buffer_pin/test_buffer_pin.c | 323 ++++++++++++++++++
 .../test_buffer_pin/test_buffer_pin.control   |   4 +
 6 files changed, 605 insertions(+)
 create mode 100644 src/test/modules/test_buffer_pin/Makefile
 create mode 100755 src/test/modules/test_buffer_pin/benchmark.py
 create mode 100644 src/test/modules/test_buffer_pin/requirements.txt
 create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql
 create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin.c
 create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin.control

diff --git a/src/test/modules/test_buffer_pin/Makefile b/src/test/modules/test_buffer_pin/Makefile
new file mode 100644
index 0000000000..2707f84c5b
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/Makefile
@@ -0,0 +1,18 @@
+MODULE_big = test_buffer_pin
+OBJS = test_buffer_pin.o
+
+EXTENSION = test_buffer_pin
+DATA = test_buffer_pin--1.0.sql
+
+PGFILEDESC = "test_buffer_pin - buffer pinning benchmarks"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_buffer_pin
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_buffer_pin/benchmark.py b/src/test/modules/test_buffer_pin/benchmark.py
new file mode 100755
index 0000000000..004044bc5c
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/benchmark.py
@@ -0,0 +1,185 @@
+#!/usr/bin/env python3
+"""
+Buffer pinning benchmark - runs SQL benchmarks and saves results.
+
+Usage:
+    python3 benchmark.py [options]
+
+Examples:
+    python3 benchmark.py --port 5434 --name pg18-baseline --title "PostgreSQL 18 Baseline"
+    python3 benchmark.py --port 5434 --name patch-0005 --title "With Patch 0005"
+"""
+
+import argparse
+import numpy as np
+import pandas as pd
+from sqlalchemy import create_engine, text
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+from pathlib import Path
+
+
+def geomspace_int(start, stop, num):
+    """Generate geometrically spaced integers, pushing duplicates forward."""
+    raw = np.geomspace(start, stop, num)
+    result = []
+    for v in raw:
+        candidate = max(int(round(v)), result[-1] + 1 if result else 1)
+        result.append(candidate)
+    return result
+
+
+def ensure_test_table(conn, table_name, min_blocks):
+    """Create a table with enough blocks for the benchmark."""
+    rows_needed = min_blocks * 226 + 10000
+    print(f"Creating table {table_name} with ~{min_blocks} blocks...")
+    conn.execute(text(f"DROP TABLE IF EXISTS {table_name}"))
+    conn.execute(text(f"CREATE UNLOGGED TABLE {table_name} AS SELECT generate_series(1, {rows_needed}) AS id"))
+    result = conn.execute(text(f"SELECT pg_relation_size('{table_name}') / 8192 as blocks")).fetchone()
+    assert result[0] >= min_blocks, f"Table {table_name} has {result[0]} blocks, expected at least {min_blocks}"
+
+def run_benchmark(conn, func_template, buffer_counts):
+    """Run a benchmark function across all buffer counts and patterns in one query."""
+    array_str = str(buffer_counts)
+    sql = text(f"""
+        SELECT
+            CASE WHEN r.random THEN 'random' ELSE 'sequential' END as pattern,
+            b.num_buffers,
+            percentile_cont(0.5) WITHIN GROUP (ORDER BY bench.per_op_ns) as median_ns
+        FROM
+            unnest(ARRAY[false, true]) AS r(random),
+            unnest(ARRAY[{array_str}]) AS b(num_buffers),
+            LATERAL {func_template} AS bench
+        WHERE bench.iteration > 0
+        GROUP BY r.random, b.num_buffers
+        ORDER BY r.random, b.num_buffers
+    """)
+    return [dict(row._mapping) for row in conn.execute(sql)]
+
+
+def save_results(results_dir, name, results_dict, title):
+    """Save benchmark results to CSV and generate SVG plot."""
+    results_dir = Path(results_dir)
+    results_dir.mkdir(parents=True, exist_ok=True)
+
+    all_data = []
+    for bench_name, data in results_dict.items():
+        df = pd.DataFrame(data)
+        df['benchmark'] = bench_name
+        all_data.append(df)
+
+    combined_df = pd.concat(all_data, ignore_index=True)
+
+    csv_path = results_dir / f"{name}.csv"
+    combined_df.to_csv(csv_path, index=False)
+    print(f"Saved CSV: {csv_path}")
+    return combined_df
+
+
+def plot_results(results_dir, name, results_dict, title):
+    """Generate SVG plot from benchmark results."""
+    results_dir = Path(results_dir)
+
+    fig, ax = plt.subplots(figsize=(10, 6))
+
+    benchmarks = [
+        ('read', 'blue', 'o', 'ReadBuffer/ReleaseBuffer'),
+        ('pinning', 'green', 's', 'IncrBufferRefCount/ReleaseBuffer'),
+        ('locking', 'purple', 'd', 'LockBuffer/UnlockBuffer'),
+        ('resowner', 'red', '^', 'ResourceOwner Remember/Forget'),
+    ]
+
+    for bench_name, color, marker, label in benchmarks:
+        if bench_name not in results_dict:
+            continue
+        df = pd.DataFrame(results_dict[bench_name])
+
+        seq_data = df[df['pattern'] == 'sequential']
+        ax.plot(seq_data['num_buffers'], seq_data['median_ns'],
+                color=color, linestyle='-', marker=marker,
+                linewidth=2, markersize=6, label=f'{label} (seq)')
+
+        rand_data = df[df['pattern'] == 'random']
+        ax.plot(rand_data['num_buffers'], rand_data['median_ns'],
+                color=color, linestyle='--', marker=marker,
+                linewidth=2, markersize=6, label=f'{label} (rand)')
+
+    ax.set_xlabel('Number of Buffers (sliding window)')
+    ax.set_ylabel('Time per operation (ns)')
+    ax.set_title(title)
+    ax.set_ylim(0, None)
+    ax.grid(True, alpha=0.3)
+    ax.set_xscale('log', base=2)
+    ax.legend(fontsize=8, loc='upper left')
+
+    plt.tight_layout()
+    svg_path = results_dir / f"{name}.svg"
+    plt.savefig(svg_path, format='svg')
+    plt.close()
+    print(f"Saved SVG: {svg_path}")
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Buffer pinning benchmark')
+    parser.add_argument('--iterations', '-n', type=int, default=10000,
+                        help='Number of operations per sample (default: 10000)')
+    parser.add_argument('--samples', type=int, default=50,
+                        help='Number of samples per data point (default: 50)')
+    parser.add_argument('--port', type=int, default=5432,
+                        help='PostgreSQL port (default: 5432)')
+    parser.add_argument('--points', type=int, default=50,
+                        help='Number of data points (default: 200)')
+    parser.add_argument('--max-dist', type=int, default=500,
+                        help='Maximum buffer count to test (default: 512)')
+    parser.add_argument('--name', default='benchmark',
+                        help='Name for output files (default: benchmark)')
+    parser.add_argument('--title', default='PostgreSQL Buffer Pinning Performance',
+                        help='Title for the plot')
+    parser.add_argument('--results-dir', default=None,
+                        help='Directory for results (default: ./results)')
+    parser.add_argument('--table', default='bench_large',
+                        help='Table name to use for benchmarks (default: bench_large)')
+    args = parser.parse_args()
+
+    if args.results_dir is None:
+        args.results_dir = Path(__file__).parent / 'results'
+
+    engine = create_engine(f'postgresql://localhost:{args.port}/postgres')
+    buffer_counts = geomspace_int(1, args.max_dist, args.points)
+
+    min_blocks_needed = args.max_dist + 100
+
+    with engine.connect().execution_options(isolation_level="AUTOCOMMIT") as conn:
+        conn.execute(text("CREATE EXTENSION IF NOT EXISTS test_buffer_pin"))
+        ensure_test_table(conn, args.table, min_blocks_needed)
+
+
+        n, s = args.iterations, args.samples
+        print(f"\nRunning benchmarks: {n} iterations, {s} samples, max {args.max_dist} buffers")
+
+        results = {}
+
+        print("  Running: bench_pinning...")
+        results['pinning'] = run_benchmark(conn,
+            f"bench_pinning('{args.table}', b.num_buffers, {n}, {s}, r.random)",
+            buffer_counts)
+
+        print("  Running: bench_prefetch_pipeline...")
+        results['read'] = run_benchmark(conn,
+            f"bench_prefetch_pipeline('{args.table}', b.num_buffers, {n}, {s}, r.random)",
+            buffer_counts)
+
+        print("  Running: bench_resowner...")
+        results['resowner'] = run_benchmark(conn,
+            f"bench_resowner(b.num_buffers, {n}, {s}, r.random)",
+            buffer_counts)
+
+    save_results(args.results_dir, args.name, results, args.title)
+    plot_results(args.results_dir, args.name, results, args.title)
+
+    print(f"\nDone! Results saved to {args.results_dir}/{args.name}.*")
+
+
+if __name__ == '__main__':
+    main()
diff --git a/src/test/modules/test_buffer_pin/requirements.txt b/src/test/modules/test_buffer_pin/requirements.txt
new file mode 100644
index 0000000000..e73ab56927
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/requirements.txt
@@ -0,0 +1,9 @@
+# Requirements for benchmark.py
+# pip install -r requirements.txt
+# not enforcing versions, as it might simply
+# work with your installed versions
+numpy            # tested with 1.24+
+pandas           # tested with 2.0+
+sqlalchemy       # tested with 2.0+
+matplotlib       # tested with 3.7+
+psycopg2-binary  # tested with 2.9+
diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql b/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql
new file mode 100644
index 0000000000..4c43c41496
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql
@@ -0,0 +1,66 @@
+/* test_buffer_pin--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_buffer_pin" to load this file. \quit
+
+CREATE FUNCTION bench_prefetch_pipeline(
+    relname text,
+    prefetch_dist int,
+    num_ops int,
+    iterations int,
+    random_access bool
+)
+RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8)
+AS 'MODULE_PATHNAME', 'bench_prefetch_pipeline'
+LANGUAGE C STRICT;
+
+COMMENT ON FUNCTION bench_prefetch_pipeline IS
+'Benchmark buffer pinning with sliding window prefetch simulation.
+Arguments:
+  relname        - name of the relation to use
+  prefetch_dist  - number of buffers to keep pinned (sliding window size)
+  num_ops        - number of pin/unpin operations to perform
+  iterations     - number of times to repeat the benchmark
+  random_access  - if true, access blocks randomly; if false, sequentially
+';
+
+CREATE FUNCTION bench_pinning(
+    relname text,
+    num_buffers int,
+    num_ops int,
+    iterations int,
+    random_access bool
+)
+RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8)
+AS 'MODULE_PATHNAME', 'bench_pinning'
+LANGUAGE C STRICT;
+
+COMMENT ON FUNCTION bench_pinning IS
+'Benchmark pure pin/unpin operations without buffer lookup.
+Uses IncrBufferRefCount/ReleaseBuffer on pre-pinned buffers to isolate
+refcount tracking overhead from buffer table lookups and I/O.
+Arguments:
+  relname       - name of the relation to use for pinning buffers
+  num_buffers   - number of buffers to keep as base pins
+  num_ops       - number of pin/unpin pairs to perform
+  iterations    - number of times to repeat the benchmark
+  random_access - if true, access buffers randomly; if false, sequentially';
+
+CREATE FUNCTION bench_resowner(
+    num_buffers int,
+    num_ops int,
+    iterations int,
+    random_access bool
+)
+RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8)
+AS 'MODULE_PATHNAME', 'bench_resowner'
+LANGUAGE C STRICT;
+
+COMMENT ON FUNCTION bench_resowner IS
+'Benchmark ResourceOwner remember/forget operations only.
+Uses fake resource values - no actual resources are tracked.
+Arguments:
+  num_buffers   - number of fake resources to track
+  num_ops       - number of remember/forget pairs to perform
+  iterations    - number of times to repeat the benchmark
+  random_access - if true, access resources randomly; if false, sequentially';
diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin.c b/src/test/modules/test_buffer_pin/test_buffer_pin.c
new file mode 100644
index 0000000000..7b6c91678f
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/test_buffer_pin.c
@@ -0,0 +1,323 @@
+/*
+ * test_buffer_pin.c - Buffer pinning benchmark
+ */
+
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "catalog/namespace.h"
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "portability/instr_time.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+#include "utils/resowner.h"
+#include "utils/varlena.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(bench_prefetch_pipeline);
+PG_FUNCTION_INFO_V1(bench_pinning);
+PG_FUNCTION_INFO_V1(bench_resowner);
+
+/* Custom ResourceOwnerDesc for benchmark - does nothing on release */
+static void bench_release_resource(Datum res) { /* no-op */ }
+
+static const ResourceOwnerDesc bench_resowner_desc = {
+	.name = "BenchmarkResource",
+	.release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
+	.release_priority = RELEASE_PRIO_FIRST,
+	.ReleaseResource = bench_release_resource,
+	.DebugPrint = NULL,
+};
+
+/*
+ * Generate an access sequence for benchmarking.
+ * If random_access is true, uses Fisher-Yates shuffle.
+ * Otherwise, generates sequential pattern modulo num_items.
+ */
+static void
+generate_access_sequence(int *sequence, int num_operations, int num_items, bool random_access)
+{
+	for (int i = 0; i < num_operations; i++)
+		sequence[i] = i % num_items;
+
+	if (random_access)
+	{
+		for (int i = num_operations - 1; i > 0; i--)
+		{
+			int j = pg_prng_uint64_range(&pg_global_prng_state, 0, i);
+			int tmp = sequence[i];
+			sequence[i] = sequence[j];
+			sequence[j] = tmp;
+		}
+	}
+}
+
+
+/*
+ * bench_pinning - benchmark pure pin/unpin operations
+ *
+ * Warms up the cache by reading all blocks in the relation.
+ * Precompute a block sequence in which the buffers will be read.
+ * Then scan times repeatedly a scan following the block sequence.
+ * with a fixed distance of `num_buffers` (plus ramp up and ramp down)
+*/
+Datum
+bench_prefetch_pipeline(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	text	   *relname_text = PG_GETARG_TEXT_PP(0);
+	int			num_buffers = PG_GETARG_INT32(1);
+	int			num_operations = PG_GETARG_INT32(2);
+	int			iterations = PG_GETARG_INT32(3);
+	bool		random_access = PG_GETARG_BOOL(4);
+
+	Oid			relid;
+	Relation	rel;
+	BlockNumber nblocks;
+	Buffer	   *pipeline;
+	int		   *block_sequence;
+	int			iter;
+
+	Tuplestorestate *tupstore;
+	TupleDesc	tupdesc;
+	Datum		values[3];
+	bool		nulls[3] = {false, false, false};
+
+	if (num_buffers < 1 || num_operations < num_buffers)
+		ereport(ERROR, (errmsg("invalid parameters")));
+
+	InitMaterializedSRF(fcinfo, 0);
+	tupstore = rsinfo->setResult;
+	tupdesc = rsinfo->setDesc;
+
+	relid = RangeVarGetRelid(makeRangeVarFromNameList(textToQualifiedNameList(relname_text)), AccessShareLock, false);
+	rel = relation_open(relid, AccessShareLock);
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	if (nblocks == 0)
+		ereport(ERROR, (errmsg("relation has no blocks")));
+
+	pipeline = palloc0(num_buffers * sizeof(Buffer));
+	block_sequence = palloc(num_operations * sizeof(int));
+
+	/* Warm up */
+	for (int i = 0; i < num_operations && i < (int) nblocks; i++)
+	{
+		Buffer buf = ReadBuffer(rel, i);
+		ReleaseBuffer(buf);
+	}
+	generate_access_sequence(block_sequence, num_operations, nblocks, random_access);
+
+	for (iter = 0; iter < iterations; iter++)
+	{
+		instr_time	start_time, end_time;
+		int64		elapsed_ns;
+
+		INSTR_TIME_SET_CURRENT(start_time);
+
+		for (int i = 0; i < num_operations + num_buffers; i++)
+		{
+			if (i >= num_buffers)
+				ReleaseBuffer(pipeline[(i - num_buffers) % num_buffers]);
+			if (i < num_operations)
+				pipeline[i % num_buffers] = ReadBuffer(rel, block_sequence[i]);
+		}
+
+		INSTR_TIME_SET_CURRENT(end_time);
+		INSTR_TIME_SUBTRACT(end_time, start_time);
+		elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time);
+
+		values[0] = Int32GetDatum(iter);
+		values[1] = Int64GetDatum(elapsed_ns);
+		values[2] = Float8GetDatum((double) elapsed_ns / num_operations);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	pfree(pipeline);
+	pfree(block_sequence);
+	relation_close(rel, AccessShareLock);
+
+	return (Datum) 0;
+}
+
+/*
+ * bench_pinning - benchmark pure pin/unpin operations
+ *
+ * Pre-reads buffers into cache, then times IncrBufferRefCount/ReleaseBuffer
+ * cycles in a pipelined fashion. This is intended to separate the
+ * pin count dependent part of the ReadBuffer/ReleaseBuffer operations
+ */
+Datum
+bench_pinning(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	text	   *relname_text = PG_GETARG_TEXT_PP(0);
+	int			num_buffers = PG_GETARG_INT32(1);
+	int			num_operations = PG_GETARG_INT32(2);
+	int			iterations = PG_GETARG_INT32(3);
+	bool		random_access = PG_GETARG_BOOL(4);
+
+	Oid			relid;
+	Relation	rel;
+	BlockNumber nblocks;
+	Buffer	   *base_buffers;
+	Buffer	   *pipeline;
+	int		   *access_sequence;
+	int			iter;
+
+	Tuplestorestate *tupstore;
+	TupleDesc	tupdesc;
+	Datum		values[3];
+	bool		nulls[3] = {false, false, false};
+
+	if (num_buffers < 1 || num_operations < num_buffers)
+		ereport(ERROR, (errmsg("invalid parameters")));
+
+	InitMaterializedSRF(fcinfo, 0);
+	tupstore = rsinfo->setResult;
+	tupdesc = rsinfo->setDesc;
+
+	relid = RangeVarGetRelid(makeRangeVarFromNameList(textToQualifiedNameList(relname_text)), AccessShareLock, false);
+	rel = relation_open(relid, AccessShareLock);
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	if ((BlockNumber) num_buffers > nblocks)
+		ereport(ERROR, (errmsg("not enough blocks in relation")));
+
+	base_buffers = palloc(num_buffers * sizeof(Buffer));
+	pipeline = palloc(num_buffers * sizeof(Buffer));
+	access_sequence = palloc(num_operations * sizeof(int));
+
+	generate_access_sequence(access_sequence, num_operations, num_buffers, random_access);
+
+	/* Pin the buffers as base pins (keeps them in cache) */
+	for (int i = 0; i < num_buffers; i++)
+		base_buffers[i] = ReadBuffer(rel, i);
+
+	for (iter = 0; iter < iterations; iter++)
+	{
+		instr_time	start_time, end_time;
+		int64		elapsed_ns;
+
+		INSTR_TIME_SET_CURRENT(start_time);
+
+		for (int i = 0; i < num_operations + num_buffers; i++)
+		{
+			if (i >= num_buffers)
+				ReleaseBuffer(pipeline[(i - num_buffers) % num_buffers]);
+			if (i < num_operations)
+			{
+				Buffer buf = base_buffers[access_sequence[i]];
+				IncrBufferRefCount(buf);
+				pipeline[i % num_buffers] = buf;
+			}
+		}
+
+		INSTR_TIME_SET_CURRENT(end_time);
+		INSTR_TIME_SUBTRACT(end_time, start_time);
+		elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time);
+
+		values[0] = Int32GetDatum(iter);
+		values[1] = Int64GetDatum(elapsed_ns);
+		values[2] = Float8GetDatum((double) elapsed_ns / num_operations);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* Release base pins */
+	for (int i = 0; i < num_buffers; i++)
+		ReleaseBuffer(base_buffers[i]);
+
+	pfree(access_sequence);
+	pfree(pipeline);
+	pfree(base_buffers);
+	relation_close(rel, AccessShareLock);
+
+	return (Datum) 0;
+}
+
+/*
+ * bench_resowner - benchmark ResourceOwner remember/forget operations
+ *
+ * Uses fake resource values to test pure ResourceOwner tracking overhead
+ * without any actual resource operations. Uses same pipelined structure.
+ */
+Datum
+bench_resowner(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	int			num_buffers = PG_GETARG_INT32(0);
+	int			num_operations = PG_GETARG_INT32(1);
+	int			iterations = PG_GETARG_INT32(2);
+	bool		random_access = PG_GETARG_BOOL(3);
+
+	Datum	   *resources;
+	Datum	   *pipeline;
+	int		   *access_sequence;
+	int			iter;
+
+	Tuplestorestate *tupstore;
+	TupleDesc	tupdesc;
+	Datum		values[3];
+	bool		nulls[3] = {false, false, false};
+
+	if (num_buffers < 1 || num_operations < num_buffers)
+		ereport(ERROR, (errmsg("invalid parameters")));
+
+	InitMaterializedSRF(fcinfo, 0);
+	tupstore = rsinfo->setResult;
+	tupdesc = rsinfo->setDesc;
+
+	resources = palloc(num_buffers * sizeof(Datum));
+	pipeline = palloc(num_buffers * sizeof(Datum));
+	access_sequence = palloc(num_operations * sizeof(int));
+
+	/* Use fake resource values (pointers to array elements for uniqueness) */
+	for (int i = 0; i < num_buffers; i++)
+		resources[i] = PointerGetDatum(&resources[i]);
+
+	generate_access_sequence(access_sequence, num_operations, num_buffers, random_access);
+
+	for (iter = 0; iter < iterations; iter++)
+	{
+		instr_time	start_time, end_time;
+		int64		elapsed_ns;
+
+		INSTR_TIME_SET_CURRENT(start_time);
+
+		for (int i = 0; i < num_operations + num_buffers; i++)
+		{
+			if (i >= num_buffers)
+				ResourceOwnerForget(CurrentResourceOwner,
+									pipeline[(i - num_buffers) % num_buffers],
+									&bench_resowner_desc);
+			if (i < num_operations)
+			{
+				Datum		res = resources[access_sequence[i]];
+				ResourceOwnerEnlarge(CurrentResourceOwner);
+				ResourceOwnerRemember(CurrentResourceOwner, res, &bench_resowner_desc);
+				pipeline[i % num_buffers] = res;
+			}
+		}
+
+		INSTR_TIME_SET_CURRENT(end_time);
+		INSTR_TIME_SUBTRACT(end_time, start_time);
+		elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time);
+
+		values[0] = Int32GetDatum(iter);
+		values[1] = Int64GetDatum(elapsed_ns);
+		values[2] = Float8GetDatum((double) elapsed_ns / num_operations);
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	pfree(access_sequence);
+	pfree(pipeline);
+	pfree(resources);
+
+	return (Datum) 0;
+}
diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin.control b/src/test/modules/test_buffer_pin/test_buffer_pin.control
new file mode 100644
index 0000000000..f065941712
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/test_buffer_pin.control
@@ -0,0 +1,4 @@
+comment = 'test_buffer_pin - buffer pinning benchmarks'
+default_version = '1.0'
+module_pathname = '$libdir/test_buffer_pin'
+relocatable = true
-- 
2.34.1



  [application/x-patch] v2-0005-Array-direct-mapping.patch (15.3K, 8-v2-0005-Array-direct-mapping.patch)
  download | inline diff:
From 33138ffb0d834bc452e9d2d1819f21871c64210f Mon Sep 17 00:00:00 2001
From: bob <[email protected]>
Date: Wed, 11 Mar 2026 12:08:04 +0000
Subject: [PATCH 5/5] Array direct mapping

Instead of searching all array entries, each buffer is mapped to exactly
one slot at index (buffer % REFCOUNT_ARRAY_ENTRIES). Collisions are
resolved by evicting to the hash table.

This simplifies lookup to O(1) for the array check and should improve
performance for num_buffers > 8 while maintaining performance for
small numbers of buffers.
---
 src/backend/storage/buffer/bufmgr_refcnt.c | 307 +++++++++------------
 1 file changed, 132 insertions(+), 175 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr_refcnt.c b/src/backend/storage/buffer/bufmgr_refcnt.c
index bf84e9f62e..726697ebf7 100644
--- a/src/backend/storage/buffer/bufmgr_refcnt.c
+++ b/src/backend/storage/buffer/bufmgr_refcnt.c
@@ -12,26 +12,15 @@
  * than once by an individual backend. This mechanism is also used to track
  * whether this backend has a buffer locked, and, if so, in what mode.
  *
- * To avoid - as we used to - requiring an array with NBuffers entries to keep
- * track of local buffers, we use a small sequentially searched array
- * (PrivateRefCountArrayKeys, with the corresponding data stored in
- * PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to
- * keep track of backend local pins.
+ * To avoid requiring an array with NBuffers entries, we use a dynamically
+ * sized direct-mapped array (PrivateRefCountArray) and an overflow hash table
+ * (PrivateRefCountHash). Each buffer maps to exactly one array slot at
+ * index (buffer & mask). When a collision occurs (two different buffers
+ * map to the same slot), the existing entry is evicted to the hash table.
+ * The array grows automatically when occupation exceeds a specified threshold.
  *
- * Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all
- * refcounts are kept track of in the array; after that, new array entries
- * displace old ones into the hash table. That way a frequently used entry
- * can't get "stuck" in the hashtable while infrequent ones clog the array.
- *
- * This was initially designed trying to optimize for the case where the
- * number of pinned buffers is expected to not exceed REFCOUNT_ARRAY_ENTRIES.
- * However this might not be the case with the introduction of prefetching.
- *
- * To enter a buffer into the refcount tracking mechanism first reserve a free
- * entry using ReservePrivateRefCountEntry() and then later, if necessary,
- * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing
- * memory allocations in NewPrivateRefCountEntry() which can be important
- * because in some scenarios it's called with a spinlock held...
+ * This design provides O(1) lookup for the common case where there are no
+ * collisions, while gracefully handling overflow via the hash table.
  *
  * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -86,33 +77,40 @@ struct PrivateRefCountIterator
 #define SH_DEFINE
 #include "lib/simplehash.h"
 
-/* Private refcount array and keys */
-#define REFCOUNT_ARRAY_ENTRIES 8
-static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
-static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
+/*
+ * Private refcount array - direct-mapped by (buffer & mask).
+ * Each buffer maps to exactly one slot. Collisions evict to hash table.
+ * 
+ * Consider an array with N slots, after storing n random buffers
+ *  
+ *  - The probability of a single buffer hitting a given slot is 1/N
+ *  - The probability of a single buffer not hitting a given slot is 1-1/N
+ *  - The probability of n buffers not hitting a given slot is (1-1/N)^n
+ *  - the expected number of vacant slots is N * (1-1/N)^n ~ N * e^(-n/N)
+ * 
+ * The last approximation should be not too bad for large number,
+ * and that gives e.g. 63% occupation for n = N, 86% ocupation for n = 2*N
+ */
+#define REFCOUNT_ARRAY_INITIAL_SIZE		8
+#define REFCOUNT_ARRAY_MAX_OCCUPATION	0.86
+#define REFCOUNT_ARRAY_MAX_SIZE			(1 << 20)	/* 1M entries max */
+
+static struct PrivateRefCountEntry *PrivateRefCountArray = NULL;
+static int32 PrivateRefCountArrayMask = 0;		/* (size - 1) for fast modulo */
+static int32 PrivateRefCountArrayUsed = 0;		/* entries in the array */
+static int32 PrivateRefCountArrayTolerated = 0;	/* grow when Used exceeds this */
 
-/* Overflow hash table for when array is full */
+/* Overflow hash table for collisions */
 static refcount_hash *PrivateRefCountHash = NULL;
 
-/* Count of entries that have overflowed into the hash table */
+/* Count of entries in the hash table */
 static int32 PrivateRefCountOverflowed = 0;
 
-/* Clock hand for selecting victim when array is full */
-static uint32 PrivateRefCountClock = 0;
-
-/* Reserved slot index, or -1 if none reserved */
-static int	ReservedRefCountSlot = -1;
-
-/* Cache for last accessed entry */
-static int	PrivateRefCountEntryLast = -1;
-
 /* Advisory limit on the number of pins each backend should hold */
 static uint32 MaxProportionalPins = 0;
 
-/* Forward declarations */
-static void ReservePrivateRefCountEntry(Buffer buffer);
-static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
-static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer);
+/* Forward declaration */
+static void GrowPrivateRefCountArray(void);
 
 /*
  * Initialize private refcount tracking for this backend.
@@ -126,174 +124,135 @@ InitPrivateRefCount(void)
 	 * That's very pessimistic, but outside toy-sized shared_buffers it should
 	 * allow plenty of pins.  LimitAdditionalPins() and
 	 * GetAdditionalPinLimit() can be used to check the remaining balance.
-	*/
+	 */
 	MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
-	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
-	memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
 
-	PrivateRefCountHash = refcount_create(CurrentMemoryContext, 64, NULL);
+	/* Initialize the direct-mapped array */
+	PrivateRefCountArrayMask = REFCOUNT_ARRAY_INITIAL_SIZE - 1;
+	PrivateRefCountArrayUsed = 0;
+	PrivateRefCountArrayTolerated = (int32)(REFCOUNT_ARRAY_INITIAL_SIZE * REFCOUNT_ARRAY_MAX_OCCUPATION);
+	PrivateRefCountArray = MemoryContextAllocZero(TopMemoryContext,
+						REFCOUNT_ARRAY_INITIAL_SIZE * sizeof(PrivateRefCountEntry));
+
+	PrivateRefCountHash = refcount_create(TopMemoryContext, 64, NULL);
 }
 
 /*
- * Ensure that the PrivateRefCountArray has sufficient space to store one more
- * entry.
+ * Grow the private refcount array when usage exceeds tolerated occupation.
+ * Doubles the size and rehashes all existing entries.
  */
 static void
-ReservePrivateRefCountEntry(Buffer buffer)
+GrowPrivateRefCountArray(void)
 {
-	/* Already reserved (or freed), nothing to do */
-	if (ReservedRefCountSlot != -1)
+	PrivateRefCountEntry *old_array = PrivateRefCountArray;
+	int32		old_size = PrivateRefCountArrayMask + 1;
+	int32		new_size = old_size * 2;
+	int32		new_mask = new_size - 1;
+	int32		i;
+
+	/* Don't grow beyond maximum */
+	if (new_size > REFCOUNT_ARRAY_MAX_SIZE)
 		return;
 
-	/*
-	 * First search for a free entry in the array, that'll be sufficient in
-	 * the majority of cases.
-	 */
-	{
-		int			i;
-
-		for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
-		{
-			if (PrivateRefCountArrayKeys[i] == InvalidBuffer)
-			{
-				ReservedRefCountSlot = i;
-				return;
-			}
-		}
-	}
-
-	/*
-	 * No luck. All array entries are full. Move one array entry into the hash
-	 * table.
-	 */
-	{
-		int			victim_slot;
-		PrivateRefCountEntry *victim_entry;
-		PrivateRefCountEntry *hashent;
-		bool		found;
-
-		/* select victim slot */
-		victim_slot = PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES;
-		victim_entry = &PrivateRefCountArray[victim_slot];
-		ReservedRefCountSlot = victim_slot;
-
-		/* Better be used, otherwise we shouldn't get here. */
-		Assert(PrivateRefCountArrayKeys[victim_slot] != InvalidBuffer);
-		Assert(PrivateRefCountArray[victim_slot].buffer != InvalidBuffer);
-		Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer);
-
-		/* enter victim array entry into hashtable */
-		hashent = refcount_insert(PrivateRefCountHash,
-								  PrivateRefCountArrayKeys[victim_slot],
-								  &found);
-		Assert(!found);
-		hashent->data = victim_entry->data;
-
-		/* clear the now free array slot */
-		PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer;
-		victim_entry->buffer = InvalidBuffer;
-		victim_entry->data = 0;
-
-		PrivateRefCountOverflowed++;
-	}
-}
-
-/*
- * Create a new refcount entry for the given buffer.
- */
-static PrivateRefCountEntry *
-NewPrivateRefCountEntry(Buffer buffer)
-{
-	PrivateRefCountEntry *res;
-
-	/* only allowed to be called when a reservation has been made */
-	Assert(ReservedRefCountSlot != -1);
-
-	/* use up the reserved entry */
-	res = &PrivateRefCountArray[ReservedRefCountSlot];
-
-	/* and fill it */
-	PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
-	res->buffer = buffer;
-	res->data = 0;
-
-	/* update cache for the next lookup */
-	PrivateRefCountEntryLast = ReservedRefCountSlot;
-
-	ReservedRefCountSlot = -1;
-
-	return res;
-}
-
-/*
- * Slow-path for GetSharedBufferEntry().
- */
-static pg_noinline PrivateRefCountEntry *
-GetPrivateRefCountEntrySlow(Buffer buffer)
-{
-	PrivateRefCountEntry *res;
-	int			i;
+	/* Allocate new array and update metadata */
+	PrivateRefCountArray = MemoryContextAllocZero(TopMemoryContext,
+						new_size * sizeof(PrivateRefCountEntry));
+	PrivateRefCountArrayMask = new_mask;
+	PrivateRefCountArrayTolerated = (int32)(new_size * REFCOUNT_ARRAY_MAX_OCCUPATION);
 
 	/*
-	 * First search for references in the array.
+	 * Rehash entries from old array. When doubling, each old slot maps to
+	 * two possible new slots (slot or slot + old_size), so entries from
+	 * different old slots cannot collide. PrivateRefCountArrayUsed stays same.
 	 */
-	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+	for (i = 0; i < old_size; i++)
 	{
-		if (PrivateRefCountArrayKeys[i] == buffer)
+		if (old_array[i].buffer != InvalidBuffer)
 		{
-			PrivateRefCountEntryLast = i;
-			return &PrivateRefCountArray[i];
+			int			new_slot = old_array[i].buffer & new_mask;
+
+			PrivateRefCountArray[new_slot] = old_array[i];
 		}
 	}
 
-	/*
-	 * Only look up the buffer in the hashtable if we've previously overflowed.
-	 */
-	if (PrivateRefCountOverflowed == 0)
-		return NULL;
-
-	res = refcount_lookup(PrivateRefCountHash, buffer);
-	return res;
+	pfree(old_array);
 }
 
 /*
  * Return the PrivateRefCount entry for the passed buffer.
  * Returns NULL if the buffer is not currently pinned.
+ *
+ * With direct-mapped array, lookup is O(1): check the slot at
+ * (buffer & mask), then check hash table if needed.
  */
 static PrivateRefCountEntry *
 GetSharedBufferEntry(Buffer buffer)
 {
+	int			slot = buffer & PrivateRefCountArrayMask;
+	PrivateRefCountEntry *entry = &PrivateRefCountArray[slot];
+
 	Assert(BufferIsValid(buffer));
 	Assert(!BufferIsLocal(buffer));
 
-	/* Fast path: check one-entry cache */
-	if (likely(PrivateRefCountEntryLast != -1) &&
-		likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer))
-	{
-		return &PrivateRefCountArray[PrivateRefCountEntryLast];
-	}
+	/* Check the direct-mapped slot */
+	if (entry->buffer == buffer)
+		return entry;
 
-	return GetPrivateRefCountEntrySlow(buffer);
+	/* Check hash table if there are overflowed entries */
+	if (PrivateRefCountOverflowed > 0)
+		return refcount_lookup(PrivateRefCountHash, buffer);
+
+	return NULL;
 }
 
 /*
  * Create a new refcount entry for a buffer that is known to not be pinned.
- * This is a fast path that skips the cache/hash lookup.
- * Returns the new entry pointer with refcount already incremented.
+ * Returns the new entry pointer with refcount set to 1.
+ *
+ * If the direct-mapped slot contains a different buffer, it is evicted
+ * to the hash table first. The array grows when overflow exceeds 20% of size.
  */
 static PrivateRefCountEntry *
 SharedBufferCreateRef(Buffer buffer)
 {
-	PrivateRefCountEntry *ref;
+	int			slot;
+	PrivateRefCountEntry *entry;
+	bool		was_empty;
 
 	Assert(BufferIsValid(buffer));
 	Assert(!BufferIsLocal(buffer));
 
-	ReservePrivateRefCountEntry(buffer);
-	ref = NewPrivateRefCountEntry(buffer);
-	ref->data = ONE_PRIVATE_REFERENCE;
+	/* Grow when array usage exceeds tolerated occupation */
+	if (PrivateRefCountArrayUsed > PrivateRefCountArrayTolerated)
+		GrowPrivateRefCountArray();
+
+	slot = buffer & PrivateRefCountArrayMask;
+	entry = &PrivateRefCountArray[slot];
+	was_empty = (entry->buffer == InvalidBuffer);
+
+	/* If slot contains a different buffer, evict it to hash table */
+	if (!was_empty && entry->buffer != buffer)
+	{
+		PrivateRefCountEntry *hashent;
+		bool		found;
+
+		hashent = refcount_insert(PrivateRefCountHash, entry->buffer, &found);
+		Assert(!found);
+		hashent->data = entry->data;
+
+		PrivateRefCountOverflowed++;
+		was_empty = true;
+	}
+
+	/* Track array usage */
+	if (was_empty)
+		PrivateRefCountArrayUsed++;
+
+	/* Use the slot for our buffer */
+	entry->buffer = buffer;
+	entry->data = ONE_PRIVATE_REFERENCE;
 
-	return ref;
+	return entry;
 }
 
 /*
@@ -327,15 +286,15 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
 		Assert(SharedBufferGetLockMode(ref) == BUFFER_LOCK_UNLOCK);
 
 		if (ref >= &PrivateRefCountArray[0] &&
-			ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES])
+			ref < &PrivateRefCountArray[PrivateRefCountArrayMask + 1])
 		{
+			/* Array entry - mark as empty */
 			ref->buffer = InvalidBuffer;
-			PrivateRefCountArrayKeys[ref - PrivateRefCountArray] = InvalidBuffer;
-			ReservedRefCountSlot = ref - PrivateRefCountArray;
+			PrivateRefCountArrayUsed--;
 		}
 		else
 		{
-			/* could make slightly more efficient by using the pointer */
+			/* Hash table entry */
 			refcount_delete(PrivateRefCountHash, ref->buffer);
 			Assert(PrivateRefCountOverflowed > 0);
 			PrivateRefCountOverflowed--;
@@ -392,9 +351,9 @@ CheckPrivateRefCountLeaks(void)
 	char	   *s;
 
 	/* check the array */
-	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+	for (i = 0; i <= PrivateRefCountArrayMask; i++)
 	{
-		if (PrivateRefCountArrayKeys[i] != InvalidBuffer)
+		if (PrivateRefCountArray[i].buffer != InvalidBuffer)
 		{
 			res = &PrivateRefCountArray[i];
 
@@ -475,11 +434,11 @@ PrivateRefCountEntry *
 GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
 {
 	/* First iterate through the array */
-	while (!iter->in_hash && iter->array_index < REFCOUNT_ARRAY_ENTRIES)
+	while (!iter->in_hash && iter->array_index <= PrivateRefCountArrayMask)
 	{
 		int			idx = iter->array_index++;
 
-		if (PrivateRefCountArrayKeys[idx] != InvalidBuffer)
+		if (PrivateRefCountArray[idx].buffer != InvalidBuffer)
 			return &PrivateRefCountArray[idx];
 	}
 
@@ -547,11 +506,9 @@ FreePrivateRefCountIterator(PrivateRefCountIterator *iter)
 	 uint32		estimated_pins_held;
 
 	 /*
-	  * We get the number of "overflowed" pins for free, but don't know the
-	  * number of pins in PrivateRefCountArray.  The cost of calculating that
-	  * exactly doesn't seem worth it, so just assume the max.
+	  * We track array usage, so we can get an accurate count.
 	  */
-	 estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
+	 estimated_pins_held = PrivateRefCountOverflowed + PrivateRefCountArrayUsed;
 
 	 /* Is this backend already holding more than its fair share? */
 	 if (estimated_pins_held > MaxProportionalPins)
-- 
2.34.1



view thread (6+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Addressing buffer private reference count scalability issue
  In-Reply-To: <CAE8JnxNqf=sYB-hfeHBEtXi+aC8jezqqukdgRuR2=t8nbetL=w@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox