public inbox for [email protected]
help / color / mirror / Atom feedFrom: Alexandre Felipe <[email protected]>
To: PostgreSQL Hackers <[email protected]>
To: Andres Freund <[email protected]>
Subject: Addressing buffer private reference count scalability issue
Date: Sun, 8 Mar 2026 16:09:07 +0000
Message-ID: <CAE8JnxNTETEUiAOF31=_yo=pvyAi9npOeJfcTvEJJbi4vomtYA@mail.gmail.com> (raw)
Hi Hackers,
This patch addresses a performance issue pointed out by Andres Freund,
1.
Benchmark buffer pinning: You know benchmark code, implemented a few
functions that can be use in postgres queries, and a python script that
runs them and produces CSV files and SVG plots for the current build.
2.
Refactoring reference counting: Before starting to change code and
potentially breaking things I considered prudent to isolate it to limit the
damage. This code was part of a 8k+ LOC file.
3.
Using simplehash: Simply replacing the HTAB for a simplehash, and adding
a new set of macros SH_ENTRY_EMPTY, SH_MAKE_EMPTY, SH_MAKE_IN_USE. To allow
using the InvalidBuffer special value instead of allocating extra space for
a validity flag. Here I assume that the buffer buffer sequence is
independent enough from the array size, so I use the buffer as the hash key
directly, omitting a hash function call.
4.
Compact PrivateRefCountEntry: The original implementation used a 4-byte
key and 8-byte value. Reference count uses 32 bits, and it is unreasonable
to expect one backend to pin the same buffer 1 billion times. The lock mode
uses 32 bits but can only assume 4 values. So I packed them in one single
uint32, giving 30 bits for count and 2 bits for lock mode. This makes the
entries 8-byte long, on 64-bit CPUs this represents more than a 1/3
reduction in memory. This makes the array aligned with the 64-bit words,
copying one entry can be completed in one instruction, and every entry will
be aligned.
5.
REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array
lookup, it is worth trying to remove it completely and keep only the hash.
For small values we would be trading a few branches for a buffer % SIZE,
for the use case of prefetch where pin/unpin in a FIFO fashion, it will
save an 8-entry array lookup, and some extra data moves.
In addition to the patch I am including
- A bash script to apply and benchmark the patches sequentially. You might
have to adjust REPO_ROOT, in my case it gets it relative to the script
path, that is under $REPO_ROOT/.patches/pins/.
- A compare-patches.py script that can be copied to
src/test/modules/test_buffer_pin to process the benchmark CSV in figures
showing one metric for different patches instead of different metrics for
one patch as the benchmark.py produces.
- A nicely formatted post about this [2]
Regards,
Alexandre
[1]
https://www.postgresql.org/message-id/s5p7iou7pdhxhvmv4rohmskwqmr36dc4rybvwlep5yvwrjs4pa%406oxsemms5...
[2] https://afelipe.hashnode.dev/postgres-backend-buffer-pinning-algorithm
Attachments:
[application/octet-stream] v1-0003-Using-simplehash.patch (12.8K, 3-v1-0003-Using-simplehash.patch)
download | inline diff:
From 077520420223d3bc14c9f7b073c15021aae20388 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Fri, 6 Mar 2026 16:55:43 +0000
Subject: [PATCH 3/5] Using simplehash
This patch replaces the HTAB implementation with a simplehash
as suggested by Andres Freund [1]. The simplehash implements a templated
open addressing hash, which have entry size information at compile time.
The access strategy of the simplehash is very close to the plain array
that was seen to be very efficient compared to the HTAB implementation,
with the additional advantage of using the key value to initialize the
search closer to where the key actually is, instead of always scanning
the entire array.
---
src/backend/storage/buffer/buf_refcount.c | 86 +++++++++++------------
src/include/lib/simplehash.h | 59 ++++++++++------
2 files changed, 81 insertions(+), 64 deletions(-)
diff --git a/src/backend/storage/buffer/buf_refcount.c b/src/backend/storage/buffer/buf_refcount.c
index 1c0bec29c93..ff37355a61e 100644
--- a/src/backend/storage/buffer/buf_refcount.c
+++ b/src/backend/storage/buffer/buf_refcount.c
@@ -40,10 +40,10 @@
*/
#include "postgres.h"
+#include "common/hashfn.h"
#include "storage/buf_internals.h"
#include "storage/buf_refcount.h"
#include "storage/proc.h"
-#include "utils/hsearch.h"
@@ -55,15 +55,36 @@ typedef struct PrivateRefCountData
struct PrivateRefCountEntry
{
- Buffer buffer;
+ Buffer buffer; /* hash key - InvalidBuffer means empty */
PrivateRefCountData data;
};
+/*
+ * Define simplehash parameters for the overflow hash table.
+ * We use buffer == InvalidBuffer as the "empty" marker to avoid needing
+ * a separate status field.
+ */
+#define SH_PREFIX refcount
+#define SH_ELEMENT_TYPE PrivateRefCountEntry
+#define SH_KEY_TYPE Buffer
+#define SH_KEY buffer
+#define SH_HASH_KEY(tb, key) murmurhash32(key)
+#define SH_EQUAL(tb, a, b) ((a) == (b))
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+/* Use buffer field as empty marker - no separate status needed */
+#define SH_ENTRY_EMPTY(entry) ((entry)->buffer == InvalidBuffer)
+#define SH_MAKE_EMPTY(entry) ((entry)->buffer = InvalidBuffer)
+#define SH_MAKE_IN_USE(entry) ((void)0) /* key assignment handles this */
+#include "lib/simplehash.h"
+
struct PrivateRefCountIterator
{
int array_index;
bool in_hash;
- HASH_SEQ_STATUS *hash_status;
+ refcount_iterator hash_iter;
+ bool hash_iter_valid;
};
/* Private refcount array and keys */
@@ -72,7 +93,7 @@ static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
/* Overflow hash table for when array is full */
-static HTAB *PrivateRefCountHash = NULL;
+static refcount_hash *PrivateRefCountHash = NULL;
/* Count of entries that have overflowed into the hash table */
static int32 PrivateRefCountOverflowed = 0;
@@ -100,26 +121,18 @@ static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buff
void
InitPrivateRefCount(void)
{
- HASHCTL hash_ctl;
-
-
/*
* An advisory limit on the number of pins each backend should hold, based
* on shared_buffers and the maximum number of connections possible.
* That's very pessimistic, but outside toy-sized shared_buffers it should
* allow plenty of pins. LimitAdditionalPins() and
* GetAdditionalPinLimit() can be used to check the remaining balance.
- */
- MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
-
+ */
+ MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
- hash_ctl.keysize = sizeof(Buffer);
- hash_ctl.entrysize = sizeof(PrivateRefCountEntry);
-
- PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl,
- HASH_ELEM | HASH_BLOBS);
+ PrivateRefCountHash = refcount_create(CurrentMemoryContext, 16, NULL);
}
/*
@@ -173,10 +186,9 @@ ReservePrivateRefCountEntry(void)
Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer);
/* enter victim array entry into hashtable */
- hashent = hash_search(PrivateRefCountHash,
- &PrivateRefCountArrayKeys[victim_slot],
- HASH_ENTER,
- &found);
+ hashent = refcount_insert(PrivateRefCountHash,
+ PrivateRefCountArrayKeys[victim_slot],
+ &found);
Assert(!found);
hashent->data = victim_entry->data;
@@ -253,7 +265,7 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
if (PrivateRefCountOverflowed == 0)
return NULL;
- res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL);
+ res = refcount_lookup(PrivateRefCountHash, buffer);
if (res == NULL)
return NULL;
@@ -264,7 +276,6 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
else
{
/* move buffer from hashtable into the free array slot */
- bool found;
PrivateRefCountEntry *free;
ReservePrivateRefCountEntry();
@@ -280,8 +291,7 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
ReservedRefCountSlot = -1;
- hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
- Assert(found);
+ refcount_delete_item(PrivateRefCountHash, res);
Assert(PrivateRefCountOverflowed > 0);
PrivateRefCountOverflowed--;
@@ -384,11 +394,7 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
}
else
{
- bool found;
- Buffer buffer = ref->buffer;
-
- hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
- Assert(found);
+ refcount_delete_item(PrivateRefCountHash, ref);
Assert(PrivateRefCountOverflowed > 0);
PrivateRefCountOverflowed--;
}
@@ -456,10 +462,10 @@ CheckPrivateRefCountLeaks(void)
/* if necessary search the hash */
if (PrivateRefCountOverflowed)
{
- HASH_SEQ_STATUS hstat;
+ refcount_iterator iter;
- hash_seq_init(&hstat, PrivateRefCountHash);
- while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL)
+ refcount_start_iterate(PrivateRefCountHash, &iter);
+ while ((res = refcount_iterate(PrivateRefCountHash, &iter)) != NULL)
{
s = DebugPrintBufferRefcount(res->buffer);
elog(WARNING, "buffer refcount leak: %s", s);
@@ -482,7 +488,7 @@ InitPrivateRefCountIterator(void)
iter->array_index = 0;
iter->in_hash = false;
- iter->hash_status = NULL;
+ iter->hash_iter_valid = false;
return iter;
}
@@ -508,21 +514,20 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
iter->in_hash = true;
if (PrivateRefCountOverflowed > 0)
{
- iter->hash_status = palloc(sizeof(HASH_SEQ_STATUS));
- hash_seq_init(iter->hash_status, PrivateRefCountHash);
+ refcount_start_iterate(PrivateRefCountHash, &iter->hash_iter);
+ iter->hash_iter_valid = true;
}
}
- if (iter->hash_status != NULL)
+ if (iter->hash_iter_valid)
{
PrivateRefCountEntry *res;
- res = (PrivateRefCountEntry *) hash_seq_search(iter->hash_status);
+ res = refcount_iterate(PrivateRefCountHash, &iter->hash_iter);
if (res != NULL)
return res;
- pfree(iter->hash_status);
- iter->hash_status = NULL;
+ iter->hash_iter_valid = false;
}
return NULL;
@@ -534,11 +539,6 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
void
FreePrivateRefCountIterator(PrivateRefCountIterator *iter)
{
- if (iter->hash_status != NULL)
- {
- hash_seq_term(iter->hash_status);
- pfree(iter->hash_status);
- }
pfree(iter);
}
diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index 848719232a4..3c03a7e9c9b 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -287,6 +287,20 @@ SH_SCOPE void SH_STAT(SH_TYPE * tb);
#define SH_COMPARE_KEYS(tb, ahash, akey, b) (SH_EQUAL(tb, b->SH_KEY, akey))
#endif
+/*
+ * Macros to check/set entry status. Users can override these to avoid
+ * needing a separate status field if their key type has an "invalid" value.
+ */
+#ifndef SH_ENTRY_EMPTY
+#define SH_ENTRY_EMPTY(entry) ((entry)->status == SH_STATUS_EMPTY)
+#endif
+#ifndef SH_MAKE_EMPTY
+#define SH_MAKE_EMPTY(entry) ((entry)->status = SH_STATUS_EMPTY)
+#endif
+#ifndef SH_MAKE_IN_USE
+#define SH_MAKE_IN_USE(entry) ((entry)->status = SH_STATUS_IN_USE)
+#endif
+
/*
* Wrap the following definitions in include guards, to avoid multiple
* definition errors if this header is included more than once. The rest of
@@ -544,7 +558,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
uint32 hash;
uint32 optimal;
- if (oldentry->status != SH_STATUS_IN_USE)
+ if (SH_ENTRY_EMPTY(oldentry))
{
startelem = i;
break;
@@ -566,7 +580,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
{
SH_ELEMENT_TYPE *oldentry = &olddata[copyelem];
- if (oldentry->status == SH_STATUS_IN_USE)
+ if (!SH_ENTRY_EMPTY(oldentry))
{
uint32 hash;
uint32 startelem2;
@@ -582,7 +596,7 @@ SH_GROW(SH_TYPE * tb, uint64 newsize)
{
newentry = &newdata[curelem];
- if (newentry->status == SH_STATUS_EMPTY)
+ if (SH_ENTRY_EMPTY(newentry))
{
break;
}
@@ -653,14 +667,14 @@ restart:
SH_ELEMENT_TYPE *entry = &data[curelem];
/* any empty bucket can directly be used */
- if (entry->status == SH_STATUS_EMPTY)
+ if (SH_ENTRY_EMPTY(entry))
{
tb->members++;
entry->SH_KEY = key;
#ifdef SH_STORE_HASH
SH_GET_HASH(tb, entry) = hash;
#endif
- entry->status = SH_STATUS_IN_USE;
+ SH_MAKE_IN_USE(entry);
*found = false;
return entry;
}
@@ -675,7 +689,7 @@ restart:
if (SH_COMPARE_KEYS(tb, hash, key, entry))
{
- Assert(entry->status == SH_STATUS_IN_USE);
+ Assert(!SH_ENTRY_EMPTY(entry));
*found = true;
return entry;
}
@@ -699,7 +713,7 @@ restart:
emptyelem = SH_NEXT(tb, emptyelem, startelem);
emptyentry = &data[emptyelem];
- if (emptyentry->status == SH_STATUS_EMPTY)
+ if (SH_ENTRY_EMPTY(emptyentry))
{
lastentry = emptyentry;
break;
@@ -748,7 +762,7 @@ restart:
#ifdef SH_STORE_HASH
SH_GET_HASH(tb, entry) = hash;
#endif
- entry->status = SH_STATUS_IN_USE;
+ SH_MAKE_IN_USE(entry);
*found = false;
return entry;
}
@@ -810,12 +824,12 @@ SH_LOOKUP_HASH_INTERNAL(SH_TYPE * tb, SH_KEY_TYPE key, uint32 hash)
{
SH_ELEMENT_TYPE *entry = &tb->data[curelem];
- if (entry->status == SH_STATUS_EMPTY)
+ if (SH_ENTRY_EMPTY(entry))
{
return NULL;
}
- Assert(entry->status == SH_STATUS_IN_USE);
+ Assert(!SH_ENTRY_EMPTY(entry));
if (SH_COMPARE_KEYS(tb, hash, key, entry))
return entry;
@@ -868,10 +882,10 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
{
SH_ELEMENT_TYPE *entry = &tb->data[curelem];
- if (entry->status == SH_STATUS_EMPTY)
+ if (SH_ENTRY_EMPTY(entry))
return false;
- if (entry->status == SH_STATUS_IN_USE &&
+ if (!SH_ENTRY_EMPTY(entry) &&
SH_COMPARE_KEYS(tb, hash, key, entry))
{
SH_ELEMENT_TYPE *lastentry = entry;
@@ -894,9 +908,9 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
curelem = SH_NEXT(tb, curelem, startelem);
curentry = &tb->data[curelem];
- if (curentry->status != SH_STATUS_IN_USE)
+ if (SH_ENTRY_EMPTY(curentry))
{
- lastentry->status = SH_STATUS_EMPTY;
+ SH_MAKE_EMPTY(lastentry);
break;
}
@@ -906,7 +920,7 @@ SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key)
/* current is at optimal position, done */
if (curoptimal == curelem)
{
- lastentry->status = SH_STATUS_EMPTY;
+ SH_MAKE_EMPTY(lastentry);
break;
}
@@ -957,9 +971,9 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
curelem = SH_NEXT(tb, curelem, startelem);
curentry = &tb->data[curelem];
- if (curentry->status != SH_STATUS_IN_USE)
+ if (SH_ENTRY_EMPTY(curentry))
{
- lastentry->status = SH_STATUS_EMPTY;
+ SH_MAKE_EMPTY(lastentry);
break;
}
@@ -969,7 +983,7 @@ SH_DELETE_ITEM(SH_TYPE * tb, SH_ELEMENT_TYPE * entry)
/* current is at optimal position, done */
if (curoptimal == curelem)
{
- lastentry->status = SH_STATUS_EMPTY;
+ SH_MAKE_EMPTY(lastentry);
break;
}
@@ -997,7 +1011,7 @@ SH_START_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter)
{
SH_ELEMENT_TYPE *entry = &tb->data[i];
- if (entry->status != SH_STATUS_IN_USE)
+ if (SH_ENTRY_EMPTY(entry))
{
startelem = i;
break;
@@ -1063,7 +1077,7 @@ SH_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter)
if ((iter->cur & tb->sizemask) == (iter->end & tb->sizemask))
iter->done = true;
- if (elem->status == SH_STATUS_IN_USE)
+ if (!SH_ENTRY_EMPTY(elem))
{
return elem;
}
@@ -1140,7 +1154,7 @@ SH_STAT(SH_TYPE * tb)
elem = &tb->data[i];
- if (elem->status != SH_STATUS_IN_USE)
+ if (SH_ENTRY_EMPTY(elem))
continue;
hash = SH_ENTRY_HASH(tb, elem);
@@ -1205,6 +1219,9 @@ SH_STAT(SH_TYPE * tb)
#undef SH_STORE_HASH
#undef SH_USE_NONDEFAULT_ALLOCATOR
#undef SH_EQUAL
+#undef SH_ENTRY_EMPTY
+#undef SH_MAKE_EMPTY
+#undef SH_MAKE_IN_USE
/* undefine locally declared macros */
#undef SH_MAKE_PREFIX
--
2.53.0
[application/octet-stream] v1-0004-Compact-PrivateRefCountEntry.patch (10.8K, 4-v1-0004-Compact-PrivateRefCountEntry.patch)
download | inline diff:
From 56bfdd6d7296397a7b3f0b282e239fdc86d6580d Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Fri, 6 Mar 2026 17:15:44 +0000
Subject: [PATCH 4/5] Compact PrivateRefCountEntry
The previous implementation used an 8-bite (64-bit) entry to store
a uint32 count and an uint32 lock mode. That is fine when we store
the data separate from the key (buffer). But in the simplehash
{key, value} are stored together, so each entry is 12-bytes.
This is somewhat awkward as we have to either pad the entry to 16-bytes,
or the access will be an alternating aligned/misaligned addreses.
However, we are probably not going to need even 16-bits for the count
and 2-bits is enough for the lock mode. So we can easily pack these
int to a single uint32.
Incrementing/decrementing the count continue the same, just using
4 instead of 1, lock mode access will require one or two additional
bitwise operations.
No bit-shifts are required.
---
src/backend/storage/buffer/buf_refcount.c | 167 +++++++++-------------
1 file changed, 70 insertions(+), 97 deletions(-)
diff --git a/src/backend/storage/buffer/buf_refcount.c b/src/backend/storage/buffer/buf_refcount.c
index ff37355a61e..29dfb720997 100644
--- a/src/backend/storage/buffer/buf_refcount.c
+++ b/src/backend/storage/buffer/buf_refcount.c
@@ -40,53 +40,54 @@
*/
#include "postgres.h"
-#include "common/hashfn.h"
#include "storage/buf_internals.h"
#include "storage/buf_refcount.h"
#include "storage/proc.h"
-typedef struct PrivateRefCountData
+/*
+ * Implementation details - opaque to callers.
+ * Packed refcount and lockmode in a single uint32:
+ * Bits 0-1: lockmode (4 values: UNLOCK, SHARE, SHARE_EXCLUSIVE, EXCLUSIVE)
+ * Bits 2-31: refcount (30 bits, max ~1 billion)
+ */
+struct PrivateRefCountEntry
{
- int32 refcount;
- BufferLockMode lockmode;
-} PrivateRefCountData;
+ Buffer buffer;
+ uint32 data;
+};
-struct PrivateRefCountEntry
+#define PRIVATEREFCOUNT_LOCKMODE_MASK 0x3
+#define ONE_PRIVATE_REFERENCE 4 /* 1 << 2 */
+
+#define PrivateRefCountGetLockMode(d) ((BufferLockMode)((d) & PRIVATEREFCOUNT_LOCKMODE_MASK))
+#define PrivateRefCountSetLockMode(d, m) ((d) = ((d) & ~PRIVATEREFCOUNT_LOCKMODE_MASK) | (m))
+#define PrivateRefCountGetRefCount(d) ((int32)((d) >> 2))
+#define PrivateRefCountIsZero(d) ((d) < ONE_PRIVATE_REFERENCE)
+
+struct PrivateRefCountIterator
{
- Buffer buffer; /* hash key - InvalidBuffer means empty */
- PrivateRefCountData data;
+ int array_index;
+ bool in_hash;
+ void *hash_status;
};
-/*
- * Define simplehash parameters for the overflow hash table.
- * We use buffer == InvalidBuffer as the "empty" marker to avoid needing
- * a separate status field.
- */
+/* Define simplehash for private refcount overflow hash table */
#define SH_PREFIX refcount
#define SH_ELEMENT_TYPE PrivateRefCountEntry
#define SH_KEY_TYPE Buffer
#define SH_KEY buffer
-#define SH_HASH_KEY(tb, key) murmurhash32(key)
+#define SH_HASH_KEY(tb, key) ((uint32) (key))
#define SH_EQUAL(tb, a, b) ((a) == (b))
#define SH_SCOPE static inline
-#define SH_DEFINE
-#define SH_DECLARE
-/* Use buffer field as empty marker - no separate status needed */
#define SH_ENTRY_EMPTY(entry) ((entry)->buffer == InvalidBuffer)
#define SH_MAKE_EMPTY(entry) ((entry)->buffer = InvalidBuffer)
-#define SH_MAKE_IN_USE(entry) ((void)0) /* key assignment handles this */
+#define SH_MAKE_IN_USE(entry) ((void) 0)
+#define SH_DECLARE
+#define SH_DEFINE
#include "lib/simplehash.h"
-struct PrivateRefCountIterator
-{
- int array_index;
- bool in_hash;
- refcount_iterator hash_iter;
- bool hash_iter_valid;
-};
-
/* Private refcount array and keys */
#define REFCOUNT_ARRAY_ENTRIES 8
static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
@@ -113,7 +114,7 @@ static uint32 MaxProportionalPins = 0;
/* Forward declarations */
static void ReservePrivateRefCountEntry(void);
static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
-static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move);
+static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer);
/*
* Initialize private refcount tracking for this backend.
@@ -132,7 +133,7 @@ InitPrivateRefCount(void)
memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
- PrivateRefCountHash = refcount_create(CurrentMemoryContext, 16, NULL);
+ PrivateRefCountHash = refcount_create(CurrentMemoryContext, 64, NULL);
}
/*
@@ -158,6 +159,11 @@ ReservePrivateRefCountEntry(void)
if (PrivateRefCountArrayKeys[i] == InvalidBuffer)
{
ReservedRefCountSlot = i;
+
+ /*
+ * We could return immediately, but iterating till the end of
+ * the array allows compiler-autovectorization.
+ */
}
}
@@ -195,10 +201,7 @@ ReservePrivateRefCountEntry(void)
/* clear the now free array slot */
PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer;
victim_entry->buffer = InvalidBuffer;
-
- memset(&victim_entry->data, 0, sizeof(victim_entry->data));
- victim_entry->data.refcount = 0;
- victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK;
+ victim_entry->data = 0;
PrivateRefCountOverflowed++;
}
@@ -221,8 +224,7 @@ NewPrivateRefCountEntry(Buffer buffer)
/* and fill it */
PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
res->buffer = buffer;
- res->data.refcount = 0;
- res->data.lockmode = BUFFER_LOCK_UNLOCK;
+ res->data = 0;
/* update cache for the next lookup */
PrivateRefCountEntryLast = ReservedRefCountSlot;
@@ -236,7 +238,7 @@ NewPrivateRefCountEntry(Buffer buffer)
* Slow-path for GetSharedBufferEntry().
*/
static pg_noinline PrivateRefCountEntry *
-GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
+GetPrivateRefCountEntrySlow(Buffer buffer)
{
PrivateRefCountEntry *res;
int match = -1;
@@ -266,41 +268,11 @@ GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
return NULL;
res = refcount_lookup(PrivateRefCountHash, buffer);
-
- if (res == NULL)
- return NULL;
- else if (!do_move)
- {
- return res;
- }
- else
- {
- /* move buffer from hashtable into the free array slot */
- PrivateRefCountEntry *free;
-
- ReservePrivateRefCountEntry();
-
- Assert(ReservedRefCountSlot != -1);
- free = &PrivateRefCountArray[ReservedRefCountSlot];
- Assert(free->buffer == InvalidBuffer);
-
- free->buffer = buffer;
- free->data = res->data;
- PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
- PrivateRefCountEntryLast = ReservedRefCountSlot;
-
- ReservedRefCountSlot = -1;
-
- refcount_delete_item(PrivateRefCountHash, res);
- Assert(PrivateRefCountOverflowed > 0);
- PrivateRefCountOverflowed--;
-
- return free;
- }
+ return res;
}
/*
- * Return the PrivateRefCountEntry for the passed buffer.
+ * Return the PrivateRefCount entry for the passed buffer.
* Returns NULL if the buffer is not currently pinned.
*/
PrivateRefCountEntry *
@@ -316,7 +288,7 @@ GetSharedBufferEntry(Buffer buffer)
return &PrivateRefCountArray[PrivateRefCountEntryLast];
}
- return GetPrivateRefCountEntrySlow(buffer, false);
+ return GetPrivateRefCountEntrySlow(buffer);
}
/*
@@ -332,25 +304,20 @@ SharedBufferRef(Buffer buffer)
Assert(BufferIsValid(buffer));
Assert(!BufferIsLocal(buffer));
- /* Check cache first, then slow path */
- if (likely(PrivateRefCountEntryLast != -1) &&
- likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer))
- {
- ref = &PrivateRefCountArray[PrivateRefCountEntryLast];
- }
- else
- {
- ref = GetPrivateRefCountEntrySlow(buffer, true);
- }
+ ref = GetSharedBufferEntry(buffer);
if (ref == NULL)
{
/* New pin - create entry */
ReservePrivateRefCountEntry();
ref = NewPrivateRefCountEntry(buffer);
+ ref->data = ONE_PRIVATE_REFERENCE;
+ }
+ else
+ {
+ /* Already pinned - increment */
+ ref->data += ONE_PRIVATE_REFERENCE;
}
-
- ref->data.refcount++;
return ref;
}
@@ -363,8 +330,8 @@ void
SharedBufferRefExisting(PrivateRefCountEntry *ref)
{
Assert(ref != NULL);
- Assert(ref->data.refcount > 0);
- ref->data.refcount++;
+ Assert(!PrivateRefCountIsZero(ref->data));
+ ref->data += ONE_PRIVATE_REFERENCE;
}
/*
@@ -376,14 +343,14 @@ bool
SharedBufferUnref(PrivateRefCountEntry *ref)
{
Assert(ref != NULL);
- Assert(ref->data.refcount > 0);
+ Assert(!PrivateRefCountIsZero(ref->data));
- ref->data.refcount--;
+ ref->data -= ONE_PRIVATE_REFERENCE;
- if (ref->data.refcount == 0)
+ if (PrivateRefCountIsZero(ref->data))
{
/* No more references - clean up the entry */
- Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK);
+ Assert(SharedBufferGetLockMode(ref) == BUFFER_LOCK_UNLOCK);
if (ref >= &PrivateRefCountArray[0] &&
ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES])
@@ -394,7 +361,8 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
}
else
{
- refcount_delete_item(PrivateRefCountHash, ref);
+ /* could make slightly more efficient by using the pointer */
+ refcount_delete(PrivateRefCountHash, ref->buffer);
Assert(PrivateRefCountOverflowed > 0);
PrivateRefCountOverflowed--;
}
@@ -406,24 +374,24 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
}
/*
- * Accessors for refcount entry fields.
+ * Accessors for refcount entry fields - opaque to callers.
*/
int32
SharedBufferRefCount(PrivateRefCountEntry *ref)
{
- return ref->data.refcount;
+ return PrivateRefCountGetRefCount(ref->data);
}
BufferLockMode
SharedBufferGetLockMode(PrivateRefCountEntry *ref)
{
- return ref->data.lockmode;
+ return PrivateRefCountGetLockMode(ref->data);
}
void
SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode)
{
- ref->data.lockmode = mode;
+ PrivateRefCountSetLockMode(ref->data, mode);
}
Buffer
@@ -488,7 +456,7 @@ InitPrivateRefCountIterator(void)
iter->array_index = 0;
iter->in_hash = false;
- iter->hash_iter_valid = false;
+ iter->hash_status = NULL;
return iter;
}
@@ -514,20 +482,23 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
iter->in_hash = true;
if (PrivateRefCountOverflowed > 0)
{
- refcount_start_iterate(PrivateRefCountHash, &iter->hash_iter);
- iter->hash_iter_valid = true;
+ refcount_iterator *hiter = palloc(sizeof(refcount_iterator));
+
+ refcount_start_iterate(PrivateRefCountHash, hiter);
+ iter->hash_status = hiter;
}
}
- if (iter->hash_iter_valid)
+ if (iter->hash_status != NULL)
{
PrivateRefCountEntry *res;
- res = refcount_iterate(PrivateRefCountHash, &iter->hash_iter);
+ res = refcount_iterate(PrivateRefCountHash, (refcount_iterator *) iter->hash_status);
if (res != NULL)
return res;
- iter->hash_iter_valid = false;
+ pfree(iter->hash_status);
+ iter->hash_status = NULL;
}
return NULL;
@@ -539,6 +510,8 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
void
FreePrivateRefCountIterator(PrivateRefCountIterator *iter)
{
+ if (iter->hash_status != NULL)
+ pfree(iter->hash_status);
pfree(iter);
}
--
2.53.0
[application/octet-stream] v1-0002-Refactoring-reference-counting.patch (50.6K, 5-v1-0002-Refactoring-reference-counting.patch)
download | inline diff:
From c8b90725fd033465c68688f4663892ce1196a48e Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Fri, 6 Mar 2026 16:31:00 +0000
Subject: [PATCH 2/5] Refactoring reference counting
This patch refactors the reference counting mechanism moving the
implementation details away from bufmgr. Unfortunately, this
comes with additional calls overhead, but I think that the ease
of maintenance will pay off. And with the next optimisations,
we will end up better than before.
---
src/backend/storage/buffer/Makefile | 1 +
src/backend/storage/buffer/buf_refcount.c | 602 ++++++++++++++++++++
src/backend/storage/buffer/bufmgr.c | 661 +++-------------------
src/include/storage/buf_refcount.h | 58 ++
4 files changed, 727 insertions(+), 595 deletions(-)
create mode 100644 src/backend/storage/buffer/buf_refcount.c
create mode 100644 src/include/storage/buf_refcount.h
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index fd7c40dcb08..c81271aabf6 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
buf_init.o \
+ buf_refcount.o \
buf_table.o \
bufmgr.o \
freelist.o \
diff --git a/src/backend/storage/buffer/buf_refcount.c b/src/backend/storage/buffer/buf_refcount.c
new file mode 100644
index 00000000000..1c0bec29c93
--- /dev/null
+++ b/src/backend/storage/buffer/buf_refcount.c
@@ -0,0 +1,602 @@
+/*-------------------------------------------------------------------------
+ *
+ * buf_refcount.c
+ * Backend-private buffer refcount tracking
+ *
+ * Each buffer has a private refcount that keeps track of the number of
+ * times the buffer is pinned in the current process. This is so that the
+ * shared refcount needs to be modified only once if a buffer is pinned more
+ * than once by an individual backend. This mechanism is also used to track
+ * whether this backend has a buffer locked, and, if so, in what mode.
+ *
+ * To avoid - as we used to - requiring an array with NBuffers entries to keep
+ * track of local buffers, we use a small sequentially searched array
+ * (PrivateRefCountArrayKeys, with the corresponding data stored in
+ * PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to
+ * keep track of backend local pins.
+ *
+ * Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all
+ * refcounts are kept track of in the array; after that, new array entries
+ * displace old ones into the hash table. That way a frequently used entry
+ * can't get "stuck" in the hashtable while infrequent ones clog the array.
+ *
+ * This was initially designed trying to optimize for the case where the
+ * number of pinned buffers is expected to not exceed REFCOUNT_ARRAY_ENTRIES.
+ * However this might not be the case with the introduction of prefetching.
+ *
+ * To enter a buffer into the refcount tracking mechanism first reserve a free
+ * entry using ReservePrivateRefCountEntry() and then later, if necessary,
+ * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing
+ * memory allocations in NewPrivateRefCountEntry() which can be important
+ * because in some scenarios it's called with a spinlock held...
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/storage/buffer/buf_refcount.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "storage/buf_internals.h"
+#include "storage/buf_refcount.h"
+#include "storage/proc.h"
+#include "utils/hsearch.h"
+
+
+
+typedef struct PrivateRefCountData
+{
+ int32 refcount;
+ BufferLockMode lockmode;
+} PrivateRefCountData;
+
+struct PrivateRefCountEntry
+{
+ Buffer buffer;
+ PrivateRefCountData data;
+};
+
+struct PrivateRefCountIterator
+{
+ int array_index;
+ bool in_hash;
+ HASH_SEQ_STATUS *hash_status;
+};
+
+/* Private refcount array and keys */
+#define REFCOUNT_ARRAY_ENTRIES 8
+static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
+static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
+
+/* Overflow hash table for when array is full */
+static HTAB *PrivateRefCountHash = NULL;
+
+/* Count of entries that have overflowed into the hash table */
+static int32 PrivateRefCountOverflowed = 0;
+
+/* Clock hand for selecting victim when array is full */
+static uint32 PrivateRefCountClock = 0;
+
+/* Reserved slot index, or -1 if none reserved */
+static int ReservedRefCountSlot = -1;
+
+/* Cache for last accessed entry */
+static int PrivateRefCountEntryLast = -1;
+
+/* Advisory limit on the number of pins each backend should hold */
+static uint32 MaxProportionalPins = 0;
+
+/* Forward declarations */
+static void ReservePrivateRefCountEntry(void);
+static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
+static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move);
+
+/*
+ * Initialize private refcount tracking for this backend.
+ */
+void
+InitPrivateRefCount(void)
+{
+ HASHCTL hash_ctl;
+
+
+ /*
+ * An advisory limit on the number of pins each backend should hold, based
+ * on shared_buffers and the maximum number of connections possible.
+ * That's very pessimistic, but outside toy-sized shared_buffers it should
+ * allow plenty of pins. LimitAdditionalPins() and
+ * GetAdditionalPinLimit() can be used to check the remaining balance.
+ */
+ MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
+
+ memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
+ memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
+
+ hash_ctl.keysize = sizeof(Buffer);
+ hash_ctl.entrysize = sizeof(PrivateRefCountEntry);
+
+ PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl,
+ HASH_ELEM | HASH_BLOBS);
+}
+
+/*
+ * Ensure that the PrivateRefCountArray has sufficient space to store one more
+ * entry.
+ */
+static void
+ReservePrivateRefCountEntry(void)
+{
+ /* Already reserved (or freed), nothing to do */
+ if (ReservedRefCountSlot != -1)
+ return;
+
+ /*
+ * First search for a free entry the array, that'll be sufficient in the
+ * majority of cases.
+ */
+ {
+ int i;
+
+ for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+ {
+ if (PrivateRefCountArrayKeys[i] == InvalidBuffer)
+ {
+ ReservedRefCountSlot = i;
+ }
+ }
+
+ if (ReservedRefCountSlot != -1)
+ return;
+ }
+
+ /*
+ * No luck. All array entries are full. Move one array entry into the hash
+ * table.
+ */
+ {
+ int victim_slot;
+ PrivateRefCountEntry *victim_entry;
+ PrivateRefCountEntry *hashent;
+ bool found;
+
+ /* select victim slot */
+ victim_slot = PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES;
+ victim_entry = &PrivateRefCountArray[victim_slot];
+ ReservedRefCountSlot = victim_slot;
+
+ /* Better be used, otherwise we shouldn't get here. */
+ Assert(PrivateRefCountArrayKeys[victim_slot] != InvalidBuffer);
+ Assert(PrivateRefCountArray[victim_slot].buffer != InvalidBuffer);
+ Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer);
+
+ /* enter victim array entry into hashtable */
+ hashent = hash_search(PrivateRefCountHash,
+ &PrivateRefCountArrayKeys[victim_slot],
+ HASH_ENTER,
+ &found);
+ Assert(!found);
+ hashent->data = victim_entry->data;
+
+ /* clear the now free array slot */
+ PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer;
+ victim_entry->buffer = InvalidBuffer;
+
+ memset(&victim_entry->data, 0, sizeof(victim_entry->data));
+ victim_entry->data.refcount = 0;
+ victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK;
+
+ PrivateRefCountOverflowed++;
+ }
+}
+
+/*
+ * Create a new refcount entry for the given buffer.
+ */
+static PrivateRefCountEntry *
+NewPrivateRefCountEntry(Buffer buffer)
+{
+ PrivateRefCountEntry *res;
+
+ /* only allowed to be called when a reservation has been made */
+ Assert(ReservedRefCountSlot != -1);
+
+ /* use up the reserved entry */
+ res = &PrivateRefCountArray[ReservedRefCountSlot];
+
+ /* and fill it */
+ PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
+ res->buffer = buffer;
+ res->data.refcount = 0;
+ res->data.lockmode = BUFFER_LOCK_UNLOCK;
+
+ /* update cache for the next lookup */
+ PrivateRefCountEntryLast = ReservedRefCountSlot;
+
+ ReservedRefCountSlot = -1;
+
+ return res;
+}
+
+/*
+ * Slow-path for GetSharedBufferEntry().
+ */
+static pg_noinline PrivateRefCountEntry *
+GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
+{
+ PrivateRefCountEntry *res;
+ int match = -1;
+ int i;
+
+ /*
+ * First search for references in the array.
+ */
+ for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+ {
+ if (PrivateRefCountArrayKeys[i] == buffer)
+ {
+ match = i;
+ }
+ }
+
+ if (likely(match != -1))
+ {
+ PrivateRefCountEntryLast = match;
+ return &PrivateRefCountArray[match];
+ }
+
+ /*
+ * Only look up the buffer in the hashtable if we've previously overflowed.
+ */
+ if (PrivateRefCountOverflowed == 0)
+ return NULL;
+
+ res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL);
+
+ if (res == NULL)
+ return NULL;
+ else if (!do_move)
+ {
+ return res;
+ }
+ else
+ {
+ /* move buffer from hashtable into the free array slot */
+ bool found;
+ PrivateRefCountEntry *free;
+
+ ReservePrivateRefCountEntry();
+
+ Assert(ReservedRefCountSlot != -1);
+ free = &PrivateRefCountArray[ReservedRefCountSlot];
+ Assert(free->buffer == InvalidBuffer);
+
+ free->buffer = buffer;
+ free->data = res->data;
+ PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
+ PrivateRefCountEntryLast = ReservedRefCountSlot;
+
+ ReservedRefCountSlot = -1;
+
+ hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
+ Assert(found);
+ Assert(PrivateRefCountOverflowed > 0);
+ PrivateRefCountOverflowed--;
+
+ return free;
+ }
+}
+
+/*
+ * Return the PrivateRefCountEntry for the passed buffer.
+ * Returns NULL if the buffer is not currently pinned.
+ */
+PrivateRefCountEntry *
+GetSharedBufferEntry(Buffer buffer)
+{
+ Assert(BufferIsValid(buffer));
+ Assert(!BufferIsLocal(buffer));
+
+ /* Fast path: check one-entry cache */
+ if (likely(PrivateRefCountEntryLast != -1) &&
+ likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer))
+ {
+ return &PrivateRefCountArray[PrivateRefCountEntryLast];
+ }
+
+ return GetPrivateRefCountEntrySlow(buffer, false);
+}
+
+/*
+ * Increment the private refcount for a shared buffer.
+ * Creates a new entry if one doesn't exist.
+ * Returns the entry pointer.
+ */
+PrivateRefCountEntry *
+SharedBufferRef(Buffer buffer)
+{
+ PrivateRefCountEntry *ref;
+
+ Assert(BufferIsValid(buffer));
+ Assert(!BufferIsLocal(buffer));
+
+ /* Check cache first, then slow path */
+ if (likely(PrivateRefCountEntryLast != -1) &&
+ likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer))
+ {
+ ref = &PrivateRefCountArray[PrivateRefCountEntryLast];
+ }
+ else
+ {
+ ref = GetPrivateRefCountEntrySlow(buffer, true);
+ }
+
+ if (ref == NULL)
+ {
+ /* New pin - create entry */
+ ReservePrivateRefCountEntry();
+ ref = NewPrivateRefCountEntry(buffer);
+ }
+
+ ref->data.refcount++;
+
+ return ref;
+}
+
+/*
+ * Increment the private refcount for an existing entry.
+ * Use when you already have the entry from a previous lookup.
+ */
+void
+SharedBufferRefExisting(PrivateRefCountEntry *ref)
+{
+ Assert(ref != NULL);
+ Assert(ref->data.refcount > 0);
+ ref->data.refcount++;
+}
+
+/*
+ * Decrement the private refcount for a buffer.
+ * If the refcount reaches zero, removes the entry and returns true.
+ * Returns false if the buffer still has references.
+ */
+bool
+SharedBufferUnref(PrivateRefCountEntry *ref)
+{
+ Assert(ref != NULL);
+ Assert(ref->data.refcount > 0);
+
+ ref->data.refcount--;
+
+ if (ref->data.refcount == 0)
+ {
+ /* No more references - clean up the entry */
+ Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK);
+
+ if (ref >= &PrivateRefCountArray[0] &&
+ ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES])
+ {
+ ref->buffer = InvalidBuffer;
+ PrivateRefCountArrayKeys[ref - PrivateRefCountArray] = InvalidBuffer;
+ ReservedRefCountSlot = ref - PrivateRefCountArray;
+ }
+ else
+ {
+ bool found;
+ Buffer buffer = ref->buffer;
+
+ hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
+ Assert(found);
+ Assert(PrivateRefCountOverflowed > 0);
+ PrivateRefCountOverflowed--;
+ }
+
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Accessors for refcount entry fields.
+ */
+int32
+SharedBufferRefCount(PrivateRefCountEntry *ref)
+{
+ return ref->data.refcount;
+}
+
+BufferLockMode
+SharedBufferGetLockMode(PrivateRefCountEntry *ref)
+{
+ return ref->data.lockmode;
+}
+
+void
+SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode)
+{
+ ref->data.lockmode = mode;
+}
+
+Buffer
+SharedBufferGetBuffer(PrivateRefCountEntry *ref)
+{
+ return ref->buffer;
+}
+
+/*
+ * Check for buffer refcount leaks.
+ */
+void
+CheckPrivateRefCountLeaks(void)
+{
+#ifdef USE_ASSERT_CHECKING
+ int RefCountErrors = 0;
+ PrivateRefCountEntry *res;
+ int i;
+ char *s;
+
+ /* check the array */
+ for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+ {
+ if (PrivateRefCountArrayKeys[i] != InvalidBuffer)
+ {
+ res = &PrivateRefCountArray[i];
+
+ s = DebugPrintBufferRefcount(res->buffer);
+ elog(WARNING, "buffer refcount leak: %s", s);
+ pfree(s);
+
+ RefCountErrors++;
+ }
+ }
+
+ /* if necessary search the hash */
+ if (PrivateRefCountOverflowed)
+ {
+ HASH_SEQ_STATUS hstat;
+
+ hash_seq_init(&hstat, PrivateRefCountHash);
+ while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL)
+ {
+ s = DebugPrintBufferRefcount(res->buffer);
+ elog(WARNING, "buffer refcount leak: %s", s);
+ pfree(s);
+ RefCountErrors++;
+ }
+ }
+
+ Assert(RefCountErrors == 0);
+#endif
+}
+
+/*
+ * Initialize an iterator for walking all private refcount entries.
+ */
+PrivateRefCountIterator *
+InitPrivateRefCountIterator(void)
+{
+ PrivateRefCountIterator *iter = palloc(sizeof(PrivateRefCountIterator));
+
+ iter->array_index = 0;
+ iter->in_hash = false;
+ iter->hash_status = NULL;
+ return iter;
+}
+
+/*
+ * Get the next private refcount entry.
+ * Returns NULL when iteration is complete.
+ */
+PrivateRefCountEntry *
+GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
+{
+ /* First iterate through the array */
+ while (!iter->in_hash && iter->array_index < REFCOUNT_ARRAY_ENTRIES)
+ {
+ int idx = iter->array_index++;
+
+ if (PrivateRefCountArrayKeys[idx] != InvalidBuffer)
+ return &PrivateRefCountArray[idx];
+ }
+
+ /* Then iterate through the hash if there are overflowed entries */
+ if (!iter->in_hash)
+ {
+ iter->in_hash = true;
+ if (PrivateRefCountOverflowed > 0)
+ {
+ iter->hash_status = palloc(sizeof(HASH_SEQ_STATUS));
+ hash_seq_init(iter->hash_status, PrivateRefCountHash);
+ }
+ }
+
+ if (iter->hash_status != NULL)
+ {
+ PrivateRefCountEntry *res;
+
+ res = (PrivateRefCountEntry *) hash_seq_search(iter->hash_status);
+ if (res != NULL)
+ return res;
+
+ pfree(iter->hash_status);
+ iter->hash_status = NULL;
+ }
+
+ return NULL;
+}
+
+/*
+ * Free an iterator from InitPrivateRefCountIterator.
+ */
+void
+FreePrivateRefCountIterator(PrivateRefCountIterator *iter)
+{
+ if (iter->hash_status != NULL)
+ {
+ hash_seq_term(iter->hash_status);
+ pfree(iter->hash_status);
+ }
+ pfree(iter);
+}
+
+
+/*
+ * Return the maximum number of buffers that a backend should try to pin once,
+ * to avoid exceeding its fair share. This is the highest value that
+ * GetAdditionalPinLimit() could ever return. Note that it may be zero on a
+ * system with a very small buffer pool relative to max_connections.
+ */
+ uint32
+ GetPinLimit(void)
+ {
+ return MaxProportionalPins;
+ }
+
+ /*
+ * Return the maximum number of additional buffers that this backend should
+ * pin if it wants to stay under the per-backend limit, considering the number
+ * of buffers it has already pinned. Unlike LimitAdditionalPins(), the limit
+ * return by this function can be zero.
+ */
+ uint32
+ GetAdditionalPinLimit(void)
+ {
+ uint32 estimated_pins_held;
+
+ /*
+ * We get the number of "overflowed" pins for free, but don't know the
+ * number of pins in PrivateRefCountArray. The cost of calculating that
+ * exactly doesn't seem worth it, so just assume the max.
+ */
+ estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
+
+ /* Is this backend already holding more than its fair share? */
+ if (estimated_pins_held > MaxProportionalPins)
+ return 0;
+
+ return MaxProportionalPins - estimated_pins_held;
+ }
+
+ /*
+ * Limit the number of pins a batch operation may additionally acquire, to
+ * avoid running out of pinnable buffers.
+ *
+ * One additional pin is always allowed, on the assumption that the operation
+ * requires at least one to make progress.
+ */
+ void
+ LimitAdditionalPins(uint32 *additional_pins)
+ {
+ uint32 limit;
+
+ if (*additional_pins <= 1)
+ return;
+
+ limit = GetAdditionalPinLimit();
+ limit = Max(limit, 1);
+ if (limit < *additional_pins)
+ *additional_pins = limit;
+ }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5f3d083e938..aa99e97e286 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -54,6 +54,7 @@
#include "storage/aio.h"
#include "storage/buf_internals.h"
#include "storage/bufmgr.h"
+#include "storage/buf_refcount.h"
#include "storage/fd.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
@@ -93,43 +94,6 @@
*/
#define BUF_DROP_FULL_SCAN_THRESHOLD (uint64) (NBuffers / 32)
-/*
- * This is separated out from PrivateRefCountEntry to allow for copying all
- * the data members via struct assignment.
- */
-typedef struct PrivateRefCountData
-{
- /*
- * How many times has the buffer been pinned by this backend.
- */
- int32 refcount;
-
- /*
- * Is the buffer locked by this backend? BUFFER_LOCK_UNLOCK indicates that
- * the buffer is not locked.
- */
- BufferLockMode lockmode;
-} PrivateRefCountData;
-
-typedef struct PrivateRefCountEntry
-{
- /*
- * Note that this needs to be same as the entry's corresponding
- * PrivateRefCountArrayKeys[i], if the entry is stored in the array. We
- * store it in both places as this is used for the hashtable key and
- * because it is more convenient (passing around a PrivateRefCountEntry
- * suffices to identify the buffer) and faster (checking the keys array is
- * faster when checking many entries, checking the entry is faster if just
- * checking a single entry).
- */
- Buffer buffer;
-
- PrivateRefCountData data;
-} PrivateRefCountEntry;
-
-/* 64 bytes, about the size of a cache line on common systems */
-#define REFCOUNT_ARRAY_ENTRIES 8
-
/*
* Status of buffers to checkpoint for a particular tablespace, used
* internally in BufferSync.
@@ -213,55 +177,6 @@ int backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
/* local state for LockBufferForCleanup */
static BufferDesc *PinCountWaitBuf = NULL;
-/*
- * Backend-Private refcount management:
- *
- * Each buffer also has a private refcount that keeps track of the number of
- * times the buffer is pinned in the current process. This is so that the
- * shared refcount needs to be modified only once if a buffer is pinned more
- * than once by an individual backend. It's also used to check that no
- * buffers are still pinned at the end of transactions and when exiting. We
- * also use this mechanism to track whether this backend has a buffer locked,
- * and, if so, in what mode.
- *
- *
- * To avoid - as we used to - requiring an array with NBuffers entries to keep
- * track of local buffers, we use a small sequentially searched array
- * (PrivateRefCountArrayKeys, with the corresponding data stored in
- * PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to
- * keep track of backend local pins.
- *
- * Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all
- * refcounts are kept track of in the array; after that, new array entries
- * displace old ones into the hash table. That way a frequently used entry
- * can't get "stuck" in the hashtable while infrequent ones clog the array.
- *
- * Note that in most scenarios the number of pinned buffers will not exceed
- * REFCOUNT_ARRAY_ENTRIES.
- *
- *
- * To enter a buffer into the refcount tracking mechanism first reserve a free
- * entry using ReservePrivateRefCountEntry() and then later, if necessary,
- * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing
- * memory allocations in NewPrivateRefCountEntry() which can be important
- * because in some scenarios it's called with a spinlock held...
- */
-static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
-static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
-static HTAB *PrivateRefCountHash = NULL;
-static int32 PrivateRefCountOverflowed = 0;
-static uint32 PrivateRefCountClock = 0;
-static int ReservedRefCountSlot = -1;
-static int PrivateRefCountEntryLast = -1;
-
-static uint32 MaxProportionalPins;
-
-static void ReservePrivateRefCountEntry(void);
-static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
-static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
-static inline int32 GetPrivateRefCount(Buffer buffer);
-static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref);
-
/* ResourceOwner callbacks to hold in-progress I/Os and buffer pins */
static void ResOwnerReleaseBufferIO(Datum res);
static char *ResOwnerPrintBufferIO(Datum res);
@@ -286,301 +201,6 @@ const ResourceOwnerDesc buffer_resowner_desc =
.DebugPrint = ResOwnerPrintBuffer
};
-/*
- * Ensure that the PrivateRefCountArray has sufficient space to store one more
- * entry. This has to be called before using NewPrivateRefCountEntry() to fill
- * a new entry - but it's perfectly fine to not use a reserved entry.
- */
-static void
-ReservePrivateRefCountEntry(void)
-{
- /* Already reserved (or freed), nothing to do */
- if (ReservedRefCountSlot != -1)
- return;
-
- /*
- * First search for a free entry the array, that'll be sufficient in the
- * majority of cases.
- */
- {
- int i;
-
- for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
- {
- if (PrivateRefCountArrayKeys[i] == InvalidBuffer)
- {
- ReservedRefCountSlot = i;
-
- /*
- * We could return immediately, but iterating till the end of
- * the array allows compiler-autovectorization.
- */
- }
- }
-
- if (ReservedRefCountSlot != -1)
- return;
- }
-
- /*
- * No luck. All array entries are full. Move one array entry into the hash
- * table.
- */
- {
- /*
- * Move entry from the current clock position in the array into the
- * hashtable. Use that slot.
- */
- int victim_slot;
- PrivateRefCountEntry *victim_entry;
- PrivateRefCountEntry *hashent;
- bool found;
-
- /* select victim slot */
- victim_slot = PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES;
- victim_entry = &PrivateRefCountArray[victim_slot];
- ReservedRefCountSlot = victim_slot;
-
- /* Better be used, otherwise we shouldn't get here. */
- Assert(PrivateRefCountArrayKeys[victim_slot] != InvalidBuffer);
- Assert(PrivateRefCountArray[victim_slot].buffer != InvalidBuffer);
- Assert(PrivateRefCountArrayKeys[victim_slot] == PrivateRefCountArray[victim_slot].buffer);
-
- /* enter victim array entry into hashtable */
- hashent = hash_search(PrivateRefCountHash,
- &PrivateRefCountArrayKeys[victim_slot],
- HASH_ENTER,
- &found);
- Assert(!found);
- /* move data from the entry in the array to the hash entry */
- hashent->data = victim_entry->data;
-
- /* clear the now free array slot */
- PrivateRefCountArrayKeys[victim_slot] = InvalidBuffer;
- victim_entry->buffer = InvalidBuffer;
-
- /* clear the whole data member, just for future proofing */
- memset(&victim_entry->data, 0, sizeof(victim_entry->data));
- victim_entry->data.refcount = 0;
- victim_entry->data.lockmode = BUFFER_LOCK_UNLOCK;
-
- PrivateRefCountOverflowed++;
- }
-}
-
-/*
- * Fill a previously reserved refcount entry.
- */
-static PrivateRefCountEntry *
-NewPrivateRefCountEntry(Buffer buffer)
-{
- PrivateRefCountEntry *res;
-
- /* only allowed to be called when a reservation has been made */
- Assert(ReservedRefCountSlot != -1);
-
- /* use up the reserved entry */
- res = &PrivateRefCountArray[ReservedRefCountSlot];
-
- /* and fill it */
- PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
- res->buffer = buffer;
- res->data.refcount = 0;
- res->data.lockmode = BUFFER_LOCK_UNLOCK;
-
- /* update cache for the next lookup */
- PrivateRefCountEntryLast = ReservedRefCountSlot;
-
- ReservedRefCountSlot = -1;
-
- return res;
-}
-
-/*
- * Slow-path for GetPrivateRefCountEntry(). This is big enough to not be worth
- * inlining. This particularly seems to be true if the compiler is capable of
- * auto-vectorizing the code, as that imposes additional stack-alignment
- * requirements etc.
- */
-static pg_noinline PrivateRefCountEntry *
-GetPrivateRefCountEntrySlow(Buffer buffer, bool do_move)
-{
- PrivateRefCountEntry *res;
- int match = -1;
- int i;
-
- /*
- * First search for references in the array, that'll be sufficient in the
- * majority of cases.
- */
- for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
- {
- if (PrivateRefCountArrayKeys[i] == buffer)
- {
- match = i;
- /* see ReservePrivateRefCountEntry() for why we don't return */
- }
- }
-
- if (likely(match != -1))
- {
- /* update cache for the next lookup */
- PrivateRefCountEntryLast = match;
-
- return &PrivateRefCountArray[match];
- }
-
- /*
- * By here we know that the buffer, if already pinned, isn't residing in
- * the array.
- *
- * Only look up the buffer in the hashtable if we've previously overflowed
- * into it.
- */
- if (PrivateRefCountOverflowed == 0)
- return NULL;
-
- res = hash_search(PrivateRefCountHash, &buffer, HASH_FIND, NULL);
-
- if (res == NULL)
- return NULL;
- else if (!do_move)
- {
- /* caller doesn't want us to move the hash entry into the array */
- return res;
- }
- else
- {
- /* move buffer from hashtable into the free array slot */
- bool found;
- PrivateRefCountEntry *free;
-
- /* Ensure there's a free array slot */
- ReservePrivateRefCountEntry();
-
- /* Use up the reserved slot */
- Assert(ReservedRefCountSlot != -1);
- free = &PrivateRefCountArray[ReservedRefCountSlot];
- Assert(PrivateRefCountArrayKeys[ReservedRefCountSlot] == free->buffer);
- Assert(free->buffer == InvalidBuffer);
-
- /* and fill it */
- free->buffer = buffer;
- free->data = res->data;
- PrivateRefCountArrayKeys[ReservedRefCountSlot] = buffer;
- /* update cache for the next lookup */
- PrivateRefCountEntryLast = match;
-
- ReservedRefCountSlot = -1;
-
-
- /* delete from hashtable */
- hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
- Assert(found);
- Assert(PrivateRefCountOverflowed > 0);
- PrivateRefCountOverflowed--;
-
- return free;
- }
-}
-
-/*
- * Return the PrivateRefCount entry for the passed buffer.
- *
- * Returns NULL if a buffer doesn't have a refcount entry. Otherwise, if
- * do_move is true, and the entry resides in the hashtable the entry is
- * optimized for frequent access by moving it to the array.
- */
-static inline PrivateRefCountEntry *
-GetPrivateRefCountEntry(Buffer buffer, bool do_move)
-{
- Assert(BufferIsValid(buffer));
- Assert(!BufferIsLocal(buffer));
-
- /*
- * It's very common to look up the same buffer repeatedly. To make that
- * fast, we have a one-entry cache.
- *
- * In contrast to the loop in GetPrivateRefCountEntrySlow(), here it
- * faster to check PrivateRefCountArray[].buffer, as in the case of a hit
- * fewer addresses are computed and fewer cachelines are accessed. Whereas
- * in GetPrivateRefCountEntrySlow()'s case, checking
- * PrivateRefCountArrayKeys saves a lot of memory accesses.
- */
- if (likely(PrivateRefCountEntryLast != -1) &&
- likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer))
- {
- return &PrivateRefCountArray[PrivateRefCountEntryLast];
- }
-
- /*
- * The code for the cached lookup is small enough to be worth inlining
- * into the caller. In the miss case however, that empirically doesn't
- * seem worth it.
- */
- return GetPrivateRefCountEntrySlow(buffer, do_move);
-}
-
-/*
- * Returns how many times the passed buffer is pinned by this backend.
- *
- * Only works for shared memory buffers!
- */
-static inline int32
-GetPrivateRefCount(Buffer buffer)
-{
- PrivateRefCountEntry *ref;
-
- Assert(BufferIsValid(buffer));
- Assert(!BufferIsLocal(buffer));
-
- /*
- * Not moving the entry - that's ok for the current users, but we might
- * want to change this one day.
- */
- ref = GetPrivateRefCountEntry(buffer, false);
-
- if (ref == NULL)
- return 0;
- return ref->data.refcount;
-}
-
-/*
- * Release resources used to track the reference count of a buffer which we no
- * longer have pinned and don't want to pin again immediately.
- */
-static void
-ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
-{
- Assert(ref->data.refcount == 0);
- Assert(ref->data.lockmode == BUFFER_LOCK_UNLOCK);
-
- if (ref >= &PrivateRefCountArray[0] &&
- ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES])
- {
- ref->buffer = InvalidBuffer;
- PrivateRefCountArrayKeys[ref - PrivateRefCountArray] = InvalidBuffer;
-
-
- /*
- * Mark the just used entry as reserved - in many scenarios that
- * allows us to avoid ever having to search the array/hash for free
- * entries.
- */
- ReservedRefCountSlot = ref - PrivateRefCountArray;
- }
- else
- {
- bool found;
- Buffer buffer = ref->buffer;
-
- hash_search(PrivateRefCountHash, &buffer, HASH_REMOVE, &found);
- Assert(found);
- Assert(PrivateRefCountOverflowed > 0);
- PrivateRefCountOverflowed--;
- }
-}
-
/*
* BufferIsPinned
* True iff the buffer is pinned (also checks for valid buffer number).
@@ -596,7 +216,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
BufferIsLocal(bufnum) ? \
(LocalRefCount[-(bufnum) - 1] > 0) \
: \
- (GetPrivateRefCount(bufnum) > 0) \
+ (GetSharedBufferEntry(bufnum) != NULL) \
)
@@ -653,7 +273,6 @@ static void RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
RelFileLocator dstlocator,
ForkNumber forkNum, bool permanent);
static void AtProcExit_Buffers(int code, Datum arg);
-static void CheckForBufferLeaks(void);
#ifdef USE_ASSERT_CHECKING
static void AssertNotCatalogBufferLock(Buffer buffer, BufferLockMode mode);
#endif
@@ -812,7 +431,6 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
Assert(BufferIsValid(recent_buffer));
ResourceOwnerEnlarge(CurrentResourceOwner);
- ReservePrivateRefCountEntry();
InitBufferTag(&tag, &rlocator, forkNum, blockNum);
if (BufferIsLocal(recent_buffer))
@@ -2115,7 +1733,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
/* Make sure we will have room to remember the buffer pin */
ResourceOwnerEnlarge(CurrentResourceOwner);
- ReservePrivateRefCountEntry();
/* create a tag so we can lookup the buffer */
InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
@@ -2327,7 +1944,7 @@ retry:
UnlockBufHdr(buf);
LWLockRelease(oldPartitionLock);
/* safety check: should definitely not be our *own* pin */
- if (GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) > 0)
+ if (GetSharedBufferEntry(BufferDescriptorGetBuffer(buf)) != NULL)
elog(ERROR, "buffer is pinned in InvalidateBuffer");
WaitIO(buf);
goto retry;
@@ -2380,7 +1997,7 @@ InvalidateVictimBuffer(BufferDesc *buf_hdr)
LWLock *partition_lock;
BufferTag tag;
- Assert(GetPrivateRefCount(BufferDescriptorGetBuffer(buf_hdr)) == 1);
+ Assert(GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr)) != NULL);
/* have buffer pinned, so it's safe to read tag without lock */
tag = buf_hdr->tag;
@@ -2461,7 +2078,6 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
* Ensure, before we pin a victim buffer, that there's a free refcount
* entry and resource owner slot for the pin.
*/
- ReservePrivateRefCountEntry();
ResourceOwnerEnlarge(CurrentResourceOwner);
/* we return here if a prospective victim buffer gets used concurrently */
@@ -2595,64 +2211,6 @@ again:
return buf;
}
-/*
- * Return the maximum number of buffers that a backend should try to pin once,
- * to avoid exceeding its fair share. This is the highest value that
- * GetAdditionalPinLimit() could ever return. Note that it may be zero on a
- * system with a very small buffer pool relative to max_connections.
- */
-uint32
-GetPinLimit(void)
-{
- return MaxProportionalPins;
-}
-
-/*
- * Return the maximum number of additional buffers that this backend should
- * pin if it wants to stay under the per-backend limit, considering the number
- * of buffers it has already pinned. Unlike LimitAdditionalPins(), the limit
- * return by this function can be zero.
- */
-uint32
-GetAdditionalPinLimit(void)
-{
- uint32 estimated_pins_held;
-
- /*
- * We get the number of "overflowed" pins for free, but don't know the
- * number of pins in PrivateRefCountArray. The cost of calculating that
- * exactly doesn't seem worth it, so just assume the max.
- */
- estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
-
- /* Is this backend already holding more than its fair share? */
- if (estimated_pins_held > MaxProportionalPins)
- return 0;
-
- return MaxProportionalPins - estimated_pins_held;
-}
-
-/*
- * Limit the number of pins a batch operation may additionally acquire, to
- * avoid running out of pinnable buffers.
- *
- * One additional pin is always allowed, on the assumption that the operation
- * requires at least one to make progress.
- */
-void
-LimitAdditionalPins(uint32 *additional_pins)
-{
- uint32 limit;
-
- if (*additional_pins <= 1)
- return;
-
- limit = GetAdditionalPinLimit();
- limit = Max(limit, 1);
- if (limit < *additional_pins)
- *additional_pins = limit;
-}
-
/*
* Logic shared between ExtendBufferedRelBy(), ExtendBufferedRelTo(). Just to
* avoid duplicating the tracing and relpersistence related logic.
@@ -2816,7 +2374,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
/* in case we need to pin an existing buffer below */
ResourceOwnerEnlarge(CurrentResourceOwner);
- ReservePrivateRefCountEntry();
InitBufferTag(&tag, &BMR_GET_SMGR(bmr)->smgr_rlocator.locator, fork,
first_block + i);
@@ -3188,9 +2745,8 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
PrivateRefCountEntry *ref;
Assert(!BufferIsLocal(b));
- Assert(ReservedRefCountSlot != -1);
- ref = GetPrivateRefCountEntry(b, true);
+ ref = GetSharedBufferEntry(b);
if (ref == NULL)
{
@@ -3260,8 +2816,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
*/
result = (pg_atomic_read_u64(&buf->state) & BM_VALID) != 0;
- Assert(ref->data.refcount > 0);
- ref->data.refcount++;
+ SharedBufferRefExisting(ref);
ResourceOwnerRememberBuffer(CurrentResourceOwner, b);
}
@@ -3299,7 +2854,7 @@ PinBuffer_Locked(BufferDesc *buf)
* As explained, We don't expect any preexisting pins. That allows us to
* manipulate the PrivateRefCount after releasing the spinlock
*/
- Assert(GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf), false) == NULL);
+ Assert(GetSharedBufferEntry(BufferDescriptorGetBuffer(buf)) == NULL);
/*
* Since we hold the buffer spinlock, we can update the buffer state and
@@ -3376,11 +2931,10 @@ UnpinBufferNoOwner(BufferDesc *buf)
Assert(!BufferIsLocal(b));
/* not moving as we're likely deleting it soon anyway */
- ref = GetPrivateRefCountEntry(b, false);
+ ref = GetSharedBufferEntry(b);
Assert(ref != NULL);
- Assert(ref->data.refcount > 0);
- ref->data.refcount--;
- if (ref->data.refcount == 0)
+
+ if (SharedBufferUnref(ref))
{
uint64 old_buf_state;
@@ -3405,8 +2959,6 @@ UnpinBufferNoOwner(BufferDesc *buf)
/* Support LockBufferForCleanup() */
if (old_buf_state & BM_PIN_COUNT_WAITER)
WakePinCountWaiter(buf);
-
- ForgetPrivateRefCountEntry(ref);
}
}
@@ -3417,10 +2969,7 @@ UnpinBufferNoOwner(BufferDesc *buf)
inline void
TrackNewBufferPin(Buffer buf)
{
- PrivateRefCountEntry *ref;
-
- ref = NewPrivateRefCountEntry(buf);
- ref->data.refcount++;
+ SharedBufferRef(buf);
ResourceOwnerRememberBuffer(CurrentResourceOwner, buf);
@@ -4040,7 +3589,6 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
BufferTag tag;
/* Make sure we can handle the pin */
- ReservePrivateRefCountEntry();
ResourceOwnerEnlarge(CurrentResourceOwner);
/*
@@ -4104,11 +3652,9 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
void
AtEOXact_Buffers(bool isCommit)
{
- CheckForBufferLeaks();
+ CheckPrivateRefCountLeaks();
AtEOXact_LocalBuffers(isCommit);
-
- Assert(PrivateRefCountOverflowed == 0);
}
/*
@@ -4121,25 +3667,8 @@ AtEOXact_Buffers(bool isCommit)
void
InitBufferManagerAccess(void)
{
- HASHCTL hash_ctl;
-
- /*
- * An advisory limit on the number of pins each backend should hold, based
- * on shared_buffers and the maximum number of connections possible.
- * That's very pessimistic, but outside toy-sized shared_buffers it should
- * allow plenty of pins. LimitAdditionalPins() and
- * GetAdditionalPinLimit() can be used to check the remaining balance.
- */
- MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
-
- memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
- memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
-
- hash_ctl.keysize = sizeof(Buffer);
- hash_ctl.entrysize = sizeof(PrivateRefCountEntry);
- PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl,
- HASH_ELEM | HASH_BLOBS);
+ InitPrivateRefCount();
/*
* AtProcExit_Buffers needs LWLock access, and thereby has to be called at
@@ -4158,62 +3687,12 @@ AtProcExit_Buffers(int code, Datum arg)
{
UnlockBuffers();
- CheckForBufferLeaks();
+ CheckPrivateRefCountLeaks();
/* localbuf.c needs a chance too */
AtProcExit_LocalBuffers();
}
-/*
- * CheckForBufferLeaks - ensure this backend holds no buffer pins
- *
- * As of PostgreSQL 8.0, buffer pins should get released by the
- * ResourceOwner mechanism. This routine is just a debugging
- * cross-check that no pins remain.
- */
-static void
-CheckForBufferLeaks(void)
-{
-#ifdef USE_ASSERT_CHECKING
- int RefCountErrors = 0;
- PrivateRefCountEntry *res;
- int i;
- char *s;
-
- /* check the array */
- for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
- {
- if (PrivateRefCountArrayKeys[i] != InvalidBuffer)
- {
- res = &PrivateRefCountArray[i];
-
- s = DebugPrintBufferRefcount(res->buffer);
- elog(WARNING, "buffer refcount leak: %s", s);
- pfree(s);
-
- RefCountErrors++;
- }
- }
-
- /* if necessary search the hash */
- if (PrivateRefCountOverflowed)
- {
- HASH_SEQ_STATUS hstat;
-
- hash_seq_init(&hstat, PrivateRefCountHash);
- while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL)
- {
- s = DebugPrintBufferRefcount(res->buffer);
- elog(WARNING, "buffer refcount leak: %s", s);
- pfree(s);
- RefCountErrors++;
- }
- }
-
- Assert(RefCountErrors == 0);
-#endif
-}
-
#ifdef USE_ASSERT_CHECKING
/*
* Check for exclusive-locked catalog buffers. This is the core of
@@ -4235,33 +3714,20 @@ CheckForBufferLeaks(void)
void
AssertBufferLocksPermitCatalogRead(void)
{
+ PrivateRefCountIterator *iter;
PrivateRefCountEntry *res;
- /* check the array */
- for (int i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+ iter = InitPrivateRefCountIterator();
+ while ((res = GetNextPrivateRefCountEntry(iter)) != NULL)
{
- if (PrivateRefCountArrayKeys[i] != InvalidBuffer)
- {
- res = &PrivateRefCountArray[i];
-
- if (res->buffer == InvalidBuffer)
- continue;
-
- AssertNotCatalogBufferLock(res->buffer, res->data.lockmode);
- }
- }
+ Buffer buf = SharedBufferGetBuffer(res);
- /* if necessary search the hash */
- if (PrivateRefCountOverflowed)
- {
- HASH_SEQ_STATUS hstat;
+ if (buf == InvalidBuffer)
+ continue;
- hash_seq_init(&hstat, PrivateRefCountHash);
- while ((res = (PrivateRefCountEntry *) hash_seq_search(&hstat)) != NULL)
- {
- AssertNotCatalogBufferLock(res->buffer, res->data.lockmode);
- }
+ AssertNotCatalogBufferLock(buf, SharedBufferGetLockMode(res));
}
+ FreePrivateRefCountIterator(iter);
}
static void
@@ -4315,8 +3781,10 @@ DebugPrintBufferRefcount(Buffer buffer)
}
else
{
+ PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
+
buf = GetBufferDescriptor(buffer - 1);
- loccount = GetPrivateRefCount(buffer);
+ loccount = ref ? SharedBufferRefCount(ref) : 0;
backend = INVALID_PROC_NUMBER;
}
@@ -5102,7 +4570,6 @@ FlushRelationBuffers(Relation rel)
error_context_stack = &errcallback;
/* Make sure we can handle the pin */
- ReservePrivateRefCountEntry();
ResourceOwnerEnlarge(CurrentResourceOwner);
/*
@@ -5138,7 +4605,6 @@ FlushRelationBuffers(Relation rel)
continue;
/* Make sure we can handle the pin */
- ReservePrivateRefCountEntry();
ResourceOwnerEnlarge(CurrentResourceOwner);
buf_state = LockBufHdr(bufHdr);
@@ -5233,7 +4699,6 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
continue;
/* Make sure we can handle the pin */
- ReservePrivateRefCountEntry();
ResourceOwnerEnlarge(CurrentResourceOwner);
buf_state = LockBufHdr(bufHdr);
@@ -5459,7 +4924,6 @@ FlushDatabaseBuffers(Oid dbid)
continue;
/* Make sure we can handle the pin */
- ReservePrivateRefCountEntry();
ResourceOwnerEnlarge(CurrentResourceOwner);
buf_state = LockBufHdr(bufHdr);
@@ -5534,17 +4998,18 @@ UnlockReleaseBuffer(Buffer buffer)
void
IncrBufferRefCount(Buffer buffer)
{
- Assert(BufferIsPinned(buffer));
ResourceOwnerEnlarge(CurrentResourceOwner);
if (BufferIsLocal(buffer))
+ {
+ Assert(LocalRefCount[-buffer - 1] > 0);
LocalRefCount[-buffer - 1]++;
+ }
else
{
- PrivateRefCountEntry *ref;
+ PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
- ref = GetPrivateRefCountEntry(buffer, true);
Assert(ref != NULL);
- ref->data.refcount++;
+ SharedBufferRefExisting(ref);
}
ResourceOwnerRememberBuffer(CurrentResourceOwner, buffer);
}
@@ -5580,7 +5045,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
bufHdr = GetBufferDescriptor(buffer - 1);
- Assert(GetPrivateRefCount(buffer) > 0);
+ Assert(GetSharedBufferEntry(buffer) != NULL);
/* here, either share or exclusive lock is OK */
Assert(BufferIsLockedByMe(buffer));
@@ -5763,12 +5228,12 @@ BufferLockAcquire(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
* Get reference to the refcount entry before we hold the lock, it seems
* better to do before holding the lock.
*/
- entry = GetPrivateRefCountEntry(buffer, true);
+ entry = GetSharedBufferEntry(buffer);
/*
* We better not already hold a lock on the buffer.
*/
- Assert(entry->data.lockmode == BUFFER_LOCK_UNLOCK);
+ Assert(SharedBufferGetLockMode(entry) == BUFFER_LOCK_UNLOCK);
/*
* Lock out cancel/die interrupts until we exit the code section protected
@@ -5857,7 +5322,7 @@ BufferLockAcquire(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
}
/* Remember that we now hold this lock */
- entry->data.lockmode = mode;
+ SharedBufferSetLockMode(entry, mode);
/*
* Fix the process wait semaphore's count for any absorbed wakeups.
@@ -5908,7 +5373,7 @@ BufferLockUnlock(Buffer buffer, BufferDesc *buf_hdr)
static bool
BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
{
- PrivateRefCountEntry *entry = GetPrivateRefCountEntry(buffer, true);
+ PrivateRefCountEntry *entry = GetSharedBufferEntry(buffer);
bool mustwait;
/*
@@ -5916,7 +5381,7 @@ BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
* already has locked, return false, independent of the existing and
* desired lock level.
*/
- if (entry->data.lockmode != BUFFER_LOCK_UNLOCK)
+ if (SharedBufferGetLockMode(entry) != BUFFER_LOCK_UNLOCK)
return false;
/*
@@ -5936,7 +5401,7 @@ BufferLockConditional(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
}
else
{
- entry->data.lockmode = mode;
+ SharedBufferSetLockMode(entry, mode);
}
return !mustwait;
@@ -6146,11 +5611,11 @@ BufferLockDisownInternal(Buffer buffer, BufferDesc *buf_hdr)
BufferLockMode mode;
PrivateRefCountEntry *ref;
- ref = GetPrivateRefCountEntry(buffer, false);
+ ref = GetSharedBufferEntry(buffer);
if (ref == NULL)
elog(ERROR, "lock %d is not held", buffer);
- mode = ref->data.lockmode;
- ref->data.lockmode = BUFFER_LOCK_UNLOCK;
+ mode = SharedBufferGetLockMode(ref);
+ SharedBufferSetLockMode(ref, BUFFER_LOCK_UNLOCK);
return mode;
}
@@ -6384,12 +5849,12 @@ static bool
BufferLockHeldByMeInMode(BufferDesc *buf_hdr, BufferLockMode mode)
{
PrivateRefCountEntry *entry =
- GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf_hdr), false);
+ GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr));
if (!entry)
return false;
else
- return entry->data.lockmode == mode;
+ return SharedBufferGetLockMode(entry) == mode;
}
/*
@@ -6402,12 +5867,12 @@ static bool
BufferLockHeldByMe(BufferDesc *buf_hdr)
{
PrivateRefCountEntry *entry =
- GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf_hdr), false);
+ GetSharedBufferEntry(BufferDescriptorGetBuffer(buf_hdr));
if (!entry)
return false;
else
- return entry->data.lockmode != BUFFER_LOCK_UNLOCK;
+ return SharedBufferGetLockMode(entry) != BUFFER_LOCK_UNLOCK;
}
/*
@@ -6503,9 +5968,13 @@ CheckBufferIsPinnedOnce(Buffer buffer)
}
else
{
- if (GetPrivateRefCount(buffer) != 1)
- elog(ERROR, "incorrect local pin count: %d",
- GetPrivateRefCount(buffer));
+ {
+ PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
+ int32 refcount = ref ? SharedBufferRefCount(ref) : 0;
+
+ if (refcount != 1)
+ elog(ERROR, "incorrect local pin count: %d", refcount);
+ }
}
}
@@ -6686,7 +6155,7 @@ HoldingBufferPinThatDelaysRecovery(void)
if (bufid < 0)
return false;
- if (GetPrivateRefCount(bufid + 1) > 0)
+ if (GetSharedBufferEntry(bufid + 1) != NULL)
return true;
return false;
@@ -6721,8 +6190,12 @@ ConditionalLockBufferForCleanup(Buffer buffer)
}
/* There should be exactly one local pin */
- refcount = GetPrivateRefCount(buffer);
- Assert(refcount);
+ {
+ PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
+
+ refcount = ref ? SharedBufferRefCount(ref) : 0;
+ Assert(refcount);
+ }
if (refcount != 1)
return false;
@@ -6776,8 +6249,12 @@ IsBufferCleanupOK(Buffer buffer)
}
/* There should be exactly one local pin */
- if (GetPrivateRefCount(buffer) != 1)
- return false;
+ {
+ PrivateRefCountEntry *ref = GetSharedBufferEntry(buffer);
+
+ if (!ref || SharedBufferRefCount(ref) != 1)
+ return false;
+ }
bufHdr = GetBufferDescriptor(buffer - 1);
@@ -7447,7 +6924,7 @@ ResOwnerReleaseBuffer(Datum res)
{
PrivateRefCountEntry *ref;
- ref = GetPrivateRefCountEntry(buffer, false);
+ ref = GetSharedBufferEntry(buffer);
/* not having a private refcount would imply resowner corruption */
Assert(ref != NULL);
@@ -7456,7 +6933,7 @@ ResOwnerReleaseBuffer(Datum res)
* If the buffer was locked at the time of the resowner release,
* release the lock now. This should only happen after errors.
*/
- if (ref->data.lockmode != BUFFER_LOCK_UNLOCK)
+ if (SharedBufferGetLockMode(ref) != BUFFER_LOCK_UNLOCK)
{
BufferDesc *buf = GetBufferDescriptor(buffer - 1);
@@ -7549,7 +7026,6 @@ EvictUnpinnedBuffer(Buffer buf, bool *buffer_flushed)
/* Make sure we can pin the buffer. */
ResourceOwnerEnlarge(CurrentResourceOwner);
- ReservePrivateRefCountEntry();
desc = GetBufferDescriptor(buf - 1);
LockBufHdr(desc);
@@ -7590,7 +7066,6 @@ EvictAllUnpinnedBuffers(int32 *buffers_evicted, int32 *buffers_flushed,
continue;
ResourceOwnerEnlarge(CurrentResourceOwner);
- ReservePrivateRefCountEntry();
LockBufHdr(desc);
@@ -7644,7 +7119,6 @@ EvictRelUnpinnedBuffers(Relation rel, int32 *buffers_evicted,
/* Make sure we can pin the buffer. */
ResourceOwnerEnlarge(CurrentResourceOwner);
- ReservePrivateRefCountEntry();
buf_state = LockBufHdr(desc);
@@ -7736,7 +7210,6 @@ MarkDirtyUnpinnedBuffer(Buffer buf, bool *buffer_already_dirty)
/* Make sure we can pin the buffer. */
ResourceOwnerEnlarge(CurrentResourceOwner);
- ReservePrivateRefCountEntry();
desc = GetBufferDescriptor(buf - 1);
LockBufHdr(desc);
@@ -7789,7 +7262,6 @@ MarkDirtyRelUnpinnedBuffers(Relation rel,
/* Make sure we can pin the buffer. */
ResourceOwnerEnlarge(CurrentResourceOwner);
- ReservePrivateRefCountEntry();
buf_state = LockBufHdr(desc);
@@ -7841,7 +7313,6 @@ MarkDirtyAllUnpinnedBuffers(int32 *buffers_dirtied,
continue;
ResourceOwnerEnlarge(CurrentResourceOwner);
- ReservePrivateRefCountEntry();
LockBufHdr(desc);
diff --git a/src/include/storage/buf_refcount.h b/src/include/storage/buf_refcount.h
new file mode 100644
index 00000000000..842760ad2ee
--- /dev/null
+++ b/src/include/storage/buf_refcount.h
@@ -0,0 +1,58 @@
+/*-------------------------------------------------------------------------
+ *
+ * buf_refcount.h
+ * Backend-private buffer refcount tracking
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/buf_refcount.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BUF_REFCOUNT_H
+#define BUF_REFCOUNT_H
+
+#include "storage/buf.h"
+#include "storage/bufmgr.h"
+
+/* Opaque handle to a private refcount entry */
+typedef struct PrivateRefCountEntry PrivateRefCountEntry;
+
+/* Initialization */
+extern void InitPrivateRefCount(void);
+
+/* Pure lookup */
+extern PrivateRefCountEntry *GetSharedBufferEntry(Buffer buffer);
+
+/* Reference counting - complex operations */
+extern PrivateRefCountEntry *SharedBufferRef(Buffer buffer);
+extern void SharedBufferRefExisting(PrivateRefCountEntry *ref);
+extern bool SharedBufferUnref(PrivateRefCountEntry *ref);
+
+/* Accessors */
+extern int32 SharedBufferRefCount(PrivateRefCountEntry *ref);
+extern BufferLockMode SharedBufferGetLockMode(PrivateRefCountEntry *ref);
+extern void SharedBufferSetLockMode(PrivateRefCountEntry *ref, BufferLockMode mode);
+extern Buffer SharedBufferGetBuffer(PrivateRefCountEntry *ref);
+
+/* Pin limiting */
+extern uint32 GetPinLimit(void);
+extern uint32 GetAdditionalPinLimit(void);
+extern void LimitAdditionalPins(uint32 *additional_pins);
+
+/* Leak checking */
+extern void CheckPrivateRefCountLeaks(void);
+
+/*
+ * Iterator for walking all private refcount entries.
+ * Used by assertion checking code in bufmgr.c.
+ */
+typedef struct PrivateRefCountIterator PrivateRefCountIterator;
+
+extern PrivateRefCountIterator *InitPrivateRefCountIterator(void);
+extern PrivateRefCountEntry *GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter);
+extern void FreePrivateRefCountIterator(PrivateRefCountIterator *iter);
+
+
+#endif /* BUF_REFCOUNT_H */
--
2.53.0
[application/octet-stream] v1-0001-Benchmark-buffer-pinning.patch (26.6K, 6-v1-0001-Benchmark-buffer-pinning.patch)
download | inline diff:
From 272fa376cf33c5e3d8fefc3b39678d01925cba1f Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Wed, 4 Mar 2026 15:13:53 +0000
Subject: [PATCH] Benchmark buffer pinning
Introduces a benchmark facility that will be used to establish a baseline
and evaluate each of the subsequent patches. It includes a test module, and
a convenience python script to run and plot the results.
---
.gitignore | 3 +
src/test/modules/test_buffer_pin/Makefile | 18 +
src/test/modules/test_buffer_pin/benchmark.py | 190 ++++++++
.../modules/test_buffer_pin/requirements.txt | 9 +
.../test_buffer_pin/test_buffer_pin--1.0.sql | 88 ++++
.../modules/test_buffer_pin/test_buffer_pin.c | 421 ++++++++++++++++++
.../test_buffer_pin/test_buffer_pin.control | 4 +
7 files changed, 733 insertions(+)
create mode 100644 src/test/modules/test_buffer_pin/Makefile
create mode 100755 src/test/modules/test_buffer_pin/benchmark.py
create mode 100644 src/test/modules/test_buffer_pin/requirements.txt
create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql
create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin.c
create mode 100644 src/test/modules/test_buffer_pin/test_buffer_pin.control
diff --git a/.gitignore b/.gitignore
index 4e911395fe3..bc7a0314380 100644
--- a/.gitignore
+++ b/.gitignore
@@ -43,3 +43,6 @@ lib*.pc
/Release/
/tmp_install/
/portlock/
+
+# hidden files
+.*
\ No newline at end of file
diff --git a/src/test/modules/test_buffer_pin/Makefile b/src/test/modules/test_buffer_pin/Makefile
new file mode 100644
index 00000000000..2707f84c5b0
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/Makefile
@@ -0,0 +1,18 @@
+MODULE_big = test_buffer_pin
+OBJS = test_buffer_pin.o
+
+EXTENSION = test_buffer_pin
+DATA = test_buffer_pin--1.0.sql
+
+PGFILEDESC = "test_buffer_pin - buffer pinning benchmarks"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_buffer_pin
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_buffer_pin/benchmark.py b/src/test/modules/test_buffer_pin/benchmark.py
new file mode 100755
index 00000000000..0cf795a50c1
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/benchmark.py
@@ -0,0 +1,190 @@
+#!/usr/bin/env python3
+"""
+Buffer pinning benchmark - runs SQL benchmarks and saves results.
+
+Usage:
+ python3 benchmark.py [options]
+
+Examples:
+ python3 benchmark.py --port 5434 --name pg18-baseline --title "PostgreSQL 18 Baseline"
+ python3 benchmark.py --port 5434 --name patch-0005 --title "With Patch 0005"
+"""
+
+import argparse
+import numpy as np
+import pandas as pd
+from sqlalchemy import create_engine, text
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+from pathlib import Path
+
+
+def geomspace_int(start, stop, num):
+ """Generate geometrically spaced integers, pushing duplicates forward."""
+ raw = np.geomspace(start, stop, num)
+ result = []
+ for v in raw:
+ candidate = max(int(round(v)), result[-1] + 1 if result else 1)
+ result.append(candidate)
+ return result
+
+
+def ensure_test_table(conn, table_name, min_blocks):
+ """Create a table with enough blocks for the benchmark."""
+ rows_needed = min_blocks * 226 + 10000
+ print(f"Creating table {table_name} with ~{min_blocks} blocks...")
+ conn.execute(text(f"DROP TABLE IF EXISTS {table_name}"))
+ conn.execute(text(f"CREATE UNLOGGED TABLE {table_name} AS SELECT generate_series(1, {rows_needed}) AS id"))
+ result = conn.execute(text(f"SELECT pg_relation_size('{table_name}') / 8192 as blocks")).fetchone()
+ assert result[0] >= min_blocks, f"Table {table_name} has {result[0]} blocks, expected at least {min_blocks}"
+
+def run_benchmark(conn, func_template, buffer_counts):
+ """Run a benchmark function across all buffer counts and patterns in one query."""
+ array_str = str(buffer_counts)
+ sql = text(f"""
+ SELECT
+ CASE WHEN r.random THEN 'random' ELSE 'sequential' END as pattern,
+ b.num_buffers,
+ percentile_cont(0.5) WITHIN GROUP (ORDER BY bench.per_op_ns) as median_ns
+ FROM
+ unnest(ARRAY[false, true]) AS r(random),
+ unnest(ARRAY[{array_str}]) AS b(num_buffers),
+ LATERAL {func_template} AS bench
+ WHERE bench.iteration > 0
+ GROUP BY r.random, b.num_buffers
+ ORDER BY r.random, b.num_buffers
+ """)
+ return [dict(row._mapping) for row in conn.execute(sql)]
+
+
+def save_results(results_dir, name, results_dict, title):
+ """Save benchmark results to CSV and generate SVG plot."""
+ results_dir = Path(results_dir)
+ results_dir.mkdir(parents=True, exist_ok=True)
+
+ all_data = []
+ for bench_name, data in results_dict.items():
+ df = pd.DataFrame(data)
+ df['benchmark'] = bench_name
+ all_data.append(df)
+
+ combined_df = pd.concat(all_data, ignore_index=True)
+
+ csv_path = results_dir / f"{name}.csv"
+ combined_df.to_csv(csv_path, index=False)
+ print(f"Saved CSV: {csv_path}")
+ return combined_df
+
+
+def plot_results(results_dir, name, results_dict, title):
+ """Generate SVG plot from benchmark results."""
+ results_dir = Path(results_dir)
+
+ fig, ax = plt.subplots(figsize=(10, 6))
+
+ benchmarks = [
+ ('read', 'blue', 'o', 'ReadBuffer/ReleaseBuffer'),
+ ('pinning', 'green', 's', 'IncrBufferRefCount/ReleaseBuffer'),
+ ('locking', 'purple', 'd', 'LockBuffer/UnlockBuffer'),
+ ('resowner', 'red', '^', 'ResourceOwner Remember/Forget'),
+ ]
+
+ for bench_name, color, marker, label in benchmarks:
+ if bench_name not in results_dict:
+ continue
+ df = pd.DataFrame(results_dict[bench_name])
+
+ seq_data = df[df['pattern'] == 'sequential']
+ ax.plot(seq_data['num_buffers'], seq_data['median_ns'],
+ color=color, linestyle='-', marker=marker,
+ linewidth=2, markersize=6, label=f'{label} (seq)')
+
+ rand_data = df[df['pattern'] == 'random']
+ ax.plot(rand_data['num_buffers'], rand_data['median_ns'],
+ color=color, linestyle='--', marker=marker,
+ linewidth=2, markersize=6, label=f'{label} (rand)')
+
+ ax.set_xlabel('Number of Buffers (sliding window)')
+ ax.set_ylabel('Time per operation (ns)')
+ ax.set_title(title)
+ ax.set_ylim(0, None)
+ ax.grid(True, alpha=0.3)
+ ax.set_xscale('log', base=2)
+ ax.legend(fontsize=8, loc='upper left')
+
+ plt.tight_layout()
+ svg_path = results_dir / f"{name}.svg"
+ plt.savefig(svg_path, format='svg')
+ plt.close()
+ print(f"Saved SVG: {svg_path}")
+
+
+def main():
+ parser = argparse.ArgumentParser(description='Buffer pinning benchmark')
+ parser.add_argument('--iterations', '-n', type=int, default=10000,
+ help='Number of operations per sample (default: 10000)')
+ parser.add_argument('--samples', type=int, default=50,
+ help='Number of samples per data point (default: 50)')
+ parser.add_argument('--port', type=int, default=5432,
+ help='PostgreSQL port (default: 5432)')
+ parser.add_argument('--points', type=int, default=50,
+ help='Number of data points (default: 200)')
+ parser.add_argument('--max-dist', type=int, default=500,
+ help='Maximum buffer count to test (default: 512)')
+ parser.add_argument('--name', default='benchmark',
+ help='Name for output files (default: benchmark)')
+ parser.add_argument('--title', default='PostgreSQL Buffer Pinning Performance',
+ help='Title for the plot')
+ parser.add_argument('--results-dir', default=None,
+ help='Directory for results (default: ./results)')
+ parser.add_argument('--table', default='bench_large',
+ help='Table name to use for benchmarks (default: bench_large)')
+ args = parser.parse_args()
+
+ if args.results_dir is None:
+ args.results_dir = Path(__file__).parent / 'results'
+
+ engine = create_engine(f'postgresql://localhost:{args.port}/postgres')
+ buffer_counts = geomspace_int(1, args.max_dist, args.points)
+
+ min_blocks_needed = args.max_dist + 100
+
+ with engine.connect().execution_options(isolation_level="AUTOCOMMIT") as conn:
+ conn.execute(text("CREATE EXTENSION IF NOT EXISTS test_buffer_pin"))
+ ensure_test_table(conn, args.table, min_blocks_needed)
+
+
+ n, s = args.iterations, args.samples
+ print(f"\nRunning benchmarks: {n} iterations, {s} samples, max {args.max_dist} buffers")
+
+ results = {}
+
+ print(" Running: bench_pinning...")
+ results['pinning'] = run_benchmark(conn,
+ f"bench_pinning('{args.table}', b.num_buffers, {n}, {s}, r.random)",
+ buffer_counts)
+
+ print(" Running: bench_locking...")
+ results['locking'] = run_benchmark(conn,
+ f"bench_locking('{args.table}', b.num_buffers, {n}, {s}, r.random)",
+ buffer_counts)
+
+ print(" Running: bench_prefetch_pipeline...")
+ results['read'] = run_benchmark(conn,
+ f"bench_prefetch_pipeline('{args.table}', b.num_buffers, {n}, {s}, r.random)",
+ buffer_counts)
+
+ print(" Running: bench_resowner...")
+ results['resowner'] = run_benchmark(conn,
+ f"bench_resowner(b.num_buffers, {n}, {s}, r.random)",
+ buffer_counts)
+
+ save_results(args.results_dir, args.name, results, args.title)
+ plot_results(args.results_dir, args.name, results, args.title)
+
+ print(f"\nDone! Results saved to {args.results_dir}/{args.name}.*")
+
+
+if __name__ == '__main__':
+ main()
diff --git a/src/test/modules/test_buffer_pin/requirements.txt b/src/test/modules/test_buffer_pin/requirements.txt
new file mode 100644
index 00000000000..e73ab569275
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/requirements.txt
@@ -0,0 +1,9 @@
+# Requirements for benchmark.py
+# pip install -r requirements.txt
+# not enforcing versions, as it might simply
+# work with your installed versions
+numpy # tested with 1.24+
+pandas # tested with 2.0+
+sqlalchemy # tested with 2.0+
+matplotlib # tested with 3.7+
+psycopg2-binary # tested with 2.9+
diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql b/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql
new file mode 100644
index 00000000000..2fb5577fac2
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/test_buffer_pin--1.0.sql
@@ -0,0 +1,88 @@
+/* test_buffer_pin--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_buffer_pin" to load this file. \quit
+
+CREATE FUNCTION bench_prefetch_pipeline(
+ relname text,
+ prefetch_dist int,
+ num_ops int,
+ iterations int,
+ random_access bool
+)
+RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8)
+AS 'MODULE_PATHNAME', 'bench_prefetch_pipeline'
+LANGUAGE C STRICT;
+
+COMMENT ON FUNCTION bench_prefetch_pipeline IS
+'Benchmark buffer pinning with sliding window prefetch simulation.
+Arguments:
+ relname - name of the relation to use
+ prefetch_dist - number of buffers to keep pinned (sliding window size)
+ num_ops - number of pin/unpin operations to perform
+ iterations - number of times to repeat the benchmark
+ random_access - if true, access blocks randomly; if false, sequentially
+';
+
+CREATE FUNCTION bench_pinning(
+ relname text,
+ num_buffers int,
+ num_ops int,
+ iterations int,
+ random_access bool
+)
+RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8)
+AS 'MODULE_PATHNAME', 'bench_pinning'
+LANGUAGE C STRICT;
+
+COMMENT ON FUNCTION bench_pinning IS
+'Benchmark pure pin/unpin operations without buffer lookup.
+Uses IncrBufferRefCount/ReleaseBuffer on pre-pinned buffers to isolate
+refcount tracking overhead from buffer table lookups and I/O.
+Arguments:
+ relname - name of the relation to use for pinning buffers
+ num_buffers - number of buffers to keep as base pins
+ num_ops - number of pin/unpin pairs to perform
+ iterations - number of times to repeat the benchmark
+ random_access - if true, access buffers randomly; if false, sequentially';
+
+CREATE FUNCTION bench_locking(
+ relname text,
+ num_buffers int,
+ num_ops int,
+ iterations int,
+ random_access bool
+)
+RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8)
+AS 'MODULE_PATHNAME', 'bench_locking'
+LANGUAGE C STRICT;
+
+COMMENT ON FUNCTION bench_locking IS
+'Benchmark buffer lock/unlock operations.
+Pre-pins all blocks then times LockBuffer/UnlockBuffer cycles over
+different buffers to measure locking overhead separately from pinning.
+Arguments:
+ relname - name of the relation to use
+ num_buffers - sliding window size for locks
+ num_ops - number of lock/unlock pairs to perform
+ iterations - number of times to repeat the benchmark
+ random_access - if true, access buffers randomly; if false, sequentially';
+
+CREATE FUNCTION bench_resowner(
+ num_buffers int,
+ num_ops int,
+ iterations int,
+ random_access bool
+)
+RETURNS TABLE(iteration int, total_ns bigint, per_op_ns float8)
+AS 'MODULE_PATHNAME', 'bench_resowner'
+LANGUAGE C STRICT;
+
+COMMENT ON FUNCTION bench_resowner IS
+'Benchmark ResourceOwner remember/forget operations only.
+Uses fake resource values - no actual resources are tracked.
+Arguments:
+ num_buffers - number of fake resources to track
+ num_ops - number of remember/forget pairs to perform
+ iterations - number of times to repeat the benchmark
+ random_access - if true, access resources randomly; if false, sequentially';
diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin.c b/src/test/modules/test_buffer_pin/test_buffer_pin.c
new file mode 100644
index 00000000000..d5e59974dcb
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/test_buffer_pin.c
@@ -0,0 +1,421 @@
+/*
+ * test_buffer_pin.c - Buffer pinning benchmark
+ */
+
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "catalog/namespace.h"
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "portability/instr_time.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+#include "utils/resowner.h"
+#include "utils/varlena.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(bench_prefetch_pipeline);
+PG_FUNCTION_INFO_V1(bench_pinning);
+PG_FUNCTION_INFO_V1(bench_locking);
+PG_FUNCTION_INFO_V1(bench_resowner);
+
+/* Custom ResourceOwnerDesc for benchmark - does nothing on release */
+static void bench_release_resource(Datum res) { /* no-op */ }
+
+static const ResourceOwnerDesc bench_resowner_desc = {
+ .name = "BenchmarkResource",
+ .release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
+ .release_priority = RELEASE_PRIO_FIRST,
+ .ReleaseResource = bench_release_resource,
+ .DebugPrint = NULL,
+};
+
+/*
+ * Generate an access sequence for benchmarking.
+ * If random_access is true, uses Fisher-Yates shuffle.
+ * Otherwise, generates sequential pattern modulo num_items.
+ */
+static void
+generate_access_sequence(int *sequence, int num_operations, int num_items, bool random_access)
+{
+ for (int i = 0; i < num_operations; i++)
+ sequence[i] = i % num_items;
+
+ if (random_access)
+ {
+ for (int i = num_operations - 1; i > 0; i--)
+ {
+ int j = pg_prng_uint64_range(&pg_global_prng_state, 0, i);
+ int tmp = sequence[i];
+ sequence[i] = sequence[j];
+ sequence[j] = tmp;
+ }
+ }
+}
+
+
+/*
+ * bench_pinning - benchmark pure pin/unpin operations
+ *
+ * Warms up the cache by reading all blocks in the relation.
+ * Precompute a block sequence in which the buffers will be read.
+ * Then scan times repeatedly a scan following the block sequence.
+ * with a fixed distance of `num_buffers` (plus ramp up and ramp down)
+*/
+Datum
+bench_prefetch_pipeline(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ text *relname_text = PG_GETARG_TEXT_PP(0);
+ int num_buffers = PG_GETARG_INT32(1);
+ int num_operations = PG_GETARG_INT32(2);
+ int iterations = PG_GETARG_INT32(3);
+ bool random_access = PG_GETARG_BOOL(4);
+
+ Oid relid;
+ Relation rel;
+ BlockNumber nblocks;
+ Buffer *pipeline;
+ int *block_sequence;
+ int iter;
+
+ Tuplestorestate *tupstore;
+ TupleDesc tupdesc;
+ Datum values[3];
+ bool nulls[3] = {false, false, false};
+
+ if (num_buffers < 1 || num_operations < num_buffers)
+ ereport(ERROR, (errmsg("invalid parameters")));
+
+ InitMaterializedSRF(fcinfo, 0);
+ tupstore = rsinfo->setResult;
+ tupdesc = rsinfo->setDesc;
+
+ relid = RangeVarGetRelid(makeRangeVarFromNameList(textToQualifiedNameList(relname_text)), AccessShareLock, false);
+ rel = relation_open(relid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (nblocks == 0)
+ ereport(ERROR, (errmsg("relation has no blocks")));
+
+ pipeline = palloc0(num_buffers * sizeof(Buffer));
+ block_sequence = palloc(num_operations * sizeof(int));
+
+ /* Warm up */
+ for (int i = 0; i < num_operations && i < (int) nblocks; i++)
+ {
+ Buffer buf = ReadBuffer(rel, i);
+ ReleaseBuffer(buf);
+ }
+ generate_access_sequence(block_sequence, num_operations, nblocks, random_access);
+
+ for (iter = 0; iter < iterations; iter++)
+ {
+ instr_time start_time, end_time;
+ int64 elapsed_ns;
+
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ for (int i = 0; i < num_operations + num_buffers; i++)
+ {
+ if (i >= num_buffers)
+ ReleaseBuffer(pipeline[(i - num_buffers) % num_buffers]);
+ if (i < num_operations)
+ pipeline[i % num_buffers] = ReadBuffer(rel, block_sequence[i]);
+ }
+
+ INSTR_TIME_SET_CURRENT(end_time);
+ INSTR_TIME_SUBTRACT(end_time, start_time);
+ elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time);
+
+ values[0] = Int32GetDatum(iter);
+ values[1] = Int64GetDatum(elapsed_ns);
+ values[2] = Float8GetDatum((double) elapsed_ns / num_operations);
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+
+ pfree(pipeline);
+ pfree(block_sequence);
+ relation_close(rel, AccessShareLock);
+
+ return (Datum) 0;
+}
+
+/*
+ * bench_pinning - benchmark pure pin/unpin operations
+ *
+ * Pre-reads buffers into cache, then times IncrBufferRefCount/ReleaseBuffer
+ * cycles in a pipelined fashion. This is intended to separate the
+ * pin count dependent part of the ReadBuffer/ReleaseBuffer operations
+ */
+Datum
+bench_pinning(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ text *relname_text = PG_GETARG_TEXT_PP(0);
+ int num_buffers = PG_GETARG_INT32(1);
+ int num_operations = PG_GETARG_INT32(2);
+ int iterations = PG_GETARG_INT32(3);
+ bool random_access = PG_GETARG_BOOL(4);
+
+ Oid relid;
+ Relation rel;
+ BlockNumber nblocks;
+ Buffer *base_buffers;
+ Buffer *pipeline;
+ int *access_sequence;
+ int iter;
+
+ Tuplestorestate *tupstore;
+ TupleDesc tupdesc;
+ Datum values[3];
+ bool nulls[3] = {false, false, false};
+
+ if (num_buffers < 1 || num_operations < num_buffers)
+ ereport(ERROR, (errmsg("invalid parameters")));
+
+ InitMaterializedSRF(fcinfo, 0);
+ tupstore = rsinfo->setResult;
+ tupdesc = rsinfo->setDesc;
+
+ relid = RangeVarGetRelid(makeRangeVarFromNameList(textToQualifiedNameList(relname_text)), AccessShareLock, false);
+ rel = relation_open(relid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ if ((BlockNumber) num_buffers > nblocks)
+ ereport(ERROR, (errmsg("not enough blocks in relation")));
+
+ base_buffers = palloc(num_buffers * sizeof(Buffer));
+ pipeline = palloc(num_buffers * sizeof(Buffer));
+ access_sequence = palloc(num_operations * sizeof(int));
+
+ generate_access_sequence(access_sequence, num_operations, num_buffers, random_access);
+
+ /* Pin the buffers as base pins (keeps them in cache) */
+ for (int i = 0; i < num_buffers; i++)
+ base_buffers[i] = ReadBuffer(rel, i);
+
+ for (iter = 0; iter < iterations; iter++)
+ {
+ instr_time start_time, end_time;
+ int64 elapsed_ns;
+
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ for (int i = 0; i < num_operations + num_buffers; i++)
+ {
+ if (i >= num_buffers)
+ ReleaseBuffer(pipeline[(i - num_buffers) % num_buffers]);
+ if (i < num_operations)
+ {
+ Buffer buf = base_buffers[access_sequence[i]];
+ IncrBufferRefCount(buf);
+ pipeline[i % num_buffers] = buf;
+ }
+ }
+
+ INSTR_TIME_SET_CURRENT(end_time);
+ INSTR_TIME_SUBTRACT(end_time, start_time);
+ elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time);
+
+ values[0] = Int32GetDatum(iter);
+ values[1] = Int64GetDatum(elapsed_ns);
+ values[2] = Float8GetDatum((double) elapsed_ns / num_operations);
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+
+ /* Release base pins */
+ for (int i = 0; i < num_buffers; i++)
+ ReleaseBuffer(base_buffers[i]);
+
+ pfree(access_sequence);
+ pfree(pipeline);
+ pfree(base_buffers);
+ relation_close(rel, AccessShareLock);
+
+ return (Datum) 0;
+}
+
+/*
+ * bench_locking - benchmark buffer lock/unlock operations
+ *
+ * Pre-pins all blocks to keep them in cache, then times LockBuffer/UnlockBuffer
+ * cycles in a pipelined fashion over different buffers. This measures locking
+ * overhead separately from pinning, while accessing many different buffers.
+ */
+Datum
+bench_locking(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ text *relname_text = PG_GETARG_TEXT_PP(0);
+ int num_buffers = PG_GETARG_INT32(1);
+ int num_operations = PG_GETARG_INT32(2);
+ int iterations = PG_GETARG_INT32(3);
+ bool random_access = PG_GETARG_BOOL(4);
+
+ Oid relid;
+ Relation rel;
+ BlockNumber nblocks;
+ Buffer *base_buffers;
+ Buffer *pipeline;
+ int *block_sequence;
+ int iter;
+
+ Tuplestorestate *tupstore;
+ TupleDesc tupdesc;
+ Datum values[3];
+ bool nulls[3] = {false, false, false};
+
+ if (num_buffers < 1 || num_operations < num_buffers)
+ ereport(ERROR, (errmsg("invalid parameters")));
+
+ InitMaterializedSRF(fcinfo, 0);
+ tupstore = rsinfo->setResult;
+ tupdesc = rsinfo->setDesc;
+
+ relid = RangeVarGetRelid(makeRangeVarFromNameList(textToQualifiedNameList(relname_text)), AccessShareLock, false);
+ rel = relation_open(relid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (nblocks == 0)
+ ereport(ERROR, (errmsg("relation has no blocks")));
+
+ base_buffers = palloc(nblocks * sizeof(Buffer));
+ pipeline = palloc(num_buffers * sizeof(Buffer));
+ block_sequence = palloc(num_operations * sizeof(int));
+
+ /* Generate access pattern over all blocks */
+ generate_access_sequence(block_sequence, num_operations, nblocks, random_access);
+
+ /* Pin ALL blocks as base pins to keep them in cache */
+ for (BlockNumber i = 0; i < nblocks; i++)
+ base_buffers[i] = ReadBuffer(rel, i);
+
+ for (iter = 0; iter < iterations; iter++)
+ {
+ instr_time start_time, end_time;
+ int64 elapsed_ns;
+
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ for (int i = 0; i < num_operations + num_buffers; i++)
+ {
+ if (i >= num_buffers)
+ LockBuffer(pipeline[(i - num_buffers) % num_buffers], BUFFER_LOCK_UNLOCK);
+ if (i < num_operations)
+ {
+ Buffer buf = base_buffers[block_sequence[i]];
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ pipeline[i % num_buffers] = buf;
+ }
+ }
+
+ INSTR_TIME_SET_CURRENT(end_time);
+ INSTR_TIME_SUBTRACT(end_time, start_time);
+ elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time);
+
+ values[0] = Int32GetDatum(iter);
+ values[1] = Int64GetDatum(elapsed_ns);
+ values[2] = Float8GetDatum((double) elapsed_ns / num_operations);
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+
+ /* Release all base pins */
+ for (BlockNumber i = 0; i < nblocks; i++)
+ ReleaseBuffer(base_buffers[i]);
+
+ pfree(block_sequence);
+ pfree(pipeline);
+ pfree(base_buffers);
+ relation_close(rel, AccessShareLock);
+
+ return (Datum) 0;
+}
+
+/*
+ * bench_resowner - benchmark ResourceOwner remember/forget operations
+ *
+ * Uses fake resource values to test pure ResourceOwner tracking overhead
+ * without any actual resource operations. Uses same pipelined structure.
+ */
+Datum
+bench_resowner(PG_FUNCTION_ARGS)
+{
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ int num_buffers = PG_GETARG_INT32(0);
+ int num_operations = PG_GETARG_INT32(1);
+ int iterations = PG_GETARG_INT32(2);
+ bool random_access = PG_GETARG_BOOL(3);
+
+ Datum *resources;
+ Datum *pipeline;
+ int *access_sequence;
+ int iter;
+
+ Tuplestorestate *tupstore;
+ TupleDesc tupdesc;
+ Datum values[3];
+ bool nulls[3] = {false, false, false};
+
+ if (num_buffers < 1 || num_operations < num_buffers)
+ ereport(ERROR, (errmsg("invalid parameters")));
+
+ InitMaterializedSRF(fcinfo, 0);
+ tupstore = rsinfo->setResult;
+ tupdesc = rsinfo->setDesc;
+
+ resources = palloc(num_buffers * sizeof(Datum));
+ pipeline = palloc(num_buffers * sizeof(Datum));
+ access_sequence = palloc(num_operations * sizeof(int));
+
+ /* Use fake resource values (pointers to array elements for uniqueness) */
+ for (int i = 0; i < num_buffers; i++)
+ resources[i] = PointerGetDatum(&resources[i]);
+
+ generate_access_sequence(access_sequence, num_operations, num_buffers, random_access);
+
+ for (iter = 0; iter < iterations; iter++)
+ {
+ instr_time start_time, end_time;
+ int64 elapsed_ns;
+
+ INSTR_TIME_SET_CURRENT(start_time);
+
+ for (int i = 0; i < num_operations + num_buffers; i++)
+ {
+ if (i >= num_buffers)
+ ResourceOwnerForget(CurrentResourceOwner,
+ pipeline[(i - num_buffers) % num_buffers],
+ &bench_resowner_desc);
+ if (i < num_operations)
+ {
+ Datum res = resources[access_sequence[i]];
+ ResourceOwnerEnlarge(CurrentResourceOwner);
+ ResourceOwnerRemember(CurrentResourceOwner, res, &bench_resowner_desc);
+ pipeline[i % num_buffers] = res;
+ }
+ }
+
+ INSTR_TIME_SET_CURRENT(end_time);
+ INSTR_TIME_SUBTRACT(end_time, start_time);
+ elapsed_ns = INSTR_TIME_GET_NANOSEC(end_time);
+
+ values[0] = Int32GetDatum(iter);
+ values[1] = Int64GetDatum(elapsed_ns);
+ values[2] = Float8GetDatum((double) elapsed_ns / num_operations);
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ }
+
+ pfree(access_sequence);
+ pfree(pipeline);
+ pfree(resources);
+
+ return (Datum) 0;
+}
diff --git a/src/test/modules/test_buffer_pin/test_buffer_pin.control b/src/test/modules/test_buffer_pin/test_buffer_pin.control
new file mode 100644
index 00000000000..f0659417127
--- /dev/null
+++ b/src/test/modules/test_buffer_pin/test_buffer_pin.control
@@ -0,0 +1,4 @@
+comment = 'test_buffer_pin - buffer pinning benchmarks'
+default_version = '1.0'
+module_pathname = '$libdir/test_buffer_pin'
+relocatable = true
--
2.53.0
[application/octet-stream] v1-0005-REFCOUNT_ARRAY_ENTRIES-0.patch (6.5K, 7-v1-0005-REFCOUNT_ARRAY_ENTRIES-0.patch)
download | inline diff:
From 85abc829474a4b0f40461af169dd0803e2e22c88 Mon Sep 17 00:00:00 2001
From: Alexandre Felipe <[email protected]>
Date: Fri, 6 Mar 2026 17:46:38 +0000
Subject: [PATCH 5/5] REFCOUNT_ARRAY_ENTRIES=0
The simple hash performance is is fairly close to the direct array.
For few pins we are trading one small for loop for an index calculation
(buffer % N), this could be a (buffer & (N-1)) if we restrict the
simple array to use powers of 2 sizes.
For more than REFCOUNT_ARRAY_ENTRIES on distinct buffers pinned/unpinned in
a FIFO fashion, this is a strict improvement as every pin requires a hash
table operation.
---
src/backend/storage/buffer/buf_refcount.c | 60 +-
src/include/storage/buf_refcount.h | 3 +
2 files changed
diff --git a/src/backend/storage/buffer/buf_refcount.c b/src/backend/storage/buffer/buf_refcount.c
index 29dfb720997..90cb42edbb5 100644
--- a/src/backend/storage/buffer/buf_refcount.c
+++ b/src/backend/storage/buffer/buf_refcount.c
@@ -88,17 +88,18 @@ struct PrivateRefCountIterator
#define SH_DEFINE
#include "lib/simplehash.h"
-/* Private refcount array and keys */
-#define REFCOUNT_ARRAY_ENTRIES 8
+
+/*
+ * Private refcount array and keys
+ * If set to 0, all the code handling the transfers between the array
+ * and the hash table is disabled at compilation time.
+*/
+#define REFCOUNT_ARRAY_ENTRIES 0
+
+#if REFCOUNT_ARRAY_ENTRIES > 0
static Buffer PrivateRefCountArrayKeys[REFCOUNT_ARRAY_ENTRIES];
static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
-/* Overflow hash table for when array is full */
-static refcount_hash *PrivateRefCountHash = NULL;
-
-/* Count of entries that have overflowed into the hash table */
-static int32 PrivateRefCountOverflowed = 0;
-
/* Clock hand for selecting victim when array is full */
static uint32 PrivateRefCountClock = 0;
@@ -107,14 +108,23 @@ static int ReservedRefCountSlot = -1;
/* Cache for last accessed entry */
static int PrivateRefCountEntryLast = -1;
+#endif
/* Advisory limit on the number of pins each backend should hold */
static uint32 MaxProportionalPins = 0;
+/* Hash table (overflow when array used, primary when hash-only) */
+static refcount_hash *PrivateRefCountHash = NULL;
+
+/* Count of entries in the hash table */
+static int32 PrivateRefCountOverflowed = 0;
+
+#if REFCOUNT_ARRAY_ENTRIES > 0
/* Forward declarations */
static void ReservePrivateRefCountEntry(void);
static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
static pg_noinline PrivateRefCountEntry *GetPrivateRefCountEntrySlow(Buffer buffer);
+#endif
/*
* Initialize private refcount tracking for this backend.
@@ -130,12 +140,15 @@ InitPrivateRefCount(void)
* GetAdditionalPinLimit() can be used to check the remaining balance.
*/
MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
+#if REFCOUNT_ARRAY_ENTRIES > 0
memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
memset(&PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));
+#endif
PrivateRefCountHash = refcount_create(CurrentMemoryContext, 64, NULL);
}
+#if REFCOUNT_ARRAY_ENTRIES > 0
/*
* Ensure that the PrivateRefCountArray has sufficient space to store one more
* entry.
@@ -233,7 +246,9 @@ NewPrivateRefCountEntry(Buffer buffer)
return res;
}
+#endif /* REFCOUNT_ARRAY_ENTRIES > 0 */
+#if REFCOUNT_ARRAY_ENTRIES > 0
/*
* Slow-path for GetSharedBufferEntry().
*/
@@ -270,6 +285,7 @@ GetPrivateRefCountEntrySlow(Buffer buffer)
res = refcount_lookup(PrivateRefCountHash, buffer);
return res;
}
+#endif /* REFCOUNT_ARRAY_ENTRIES > 0 */
/*
* Return the PrivateRefCount entry for the passed buffer.
@@ -281,6 +297,7 @@ GetSharedBufferEntry(Buffer buffer)
Assert(BufferIsValid(buffer));
Assert(!BufferIsLocal(buffer));
+#if REFCOUNT_ARRAY_ENTRIES > 0
/* Fast path: check one-entry cache */
if (likely(PrivateRefCountEntryLast != -1) &&
likely(PrivateRefCountArray[PrivateRefCountEntryLast].buffer == buffer))
@@ -289,6 +306,10 @@ GetSharedBufferEntry(Buffer buffer)
}
return GetPrivateRefCountEntrySlow(buffer);
+#else
+ /* Hash-only mode: direct lookup */
+ return refcount_lookup(PrivateRefCountHash, buffer);
+#endif
}
/*
@@ -308,10 +329,20 @@ SharedBufferRef(Buffer buffer)
if (ref == NULL)
{
- /* New pin - create entry */
+#if REFCOUNT_ARRAY_ENTRIES > 0
+ /* New pin - create entry in array */
ReservePrivateRefCountEntry();
ref = NewPrivateRefCountEntry(buffer);
ref->data = ONE_PRIVATE_REFERENCE;
+#else
+ /* Hash-only mode: insert directly */
+ bool found;
+
+ ref = refcount_insert(PrivateRefCountHash, buffer, &found);
+ Assert(!found);
+ ref->data = ONE_PRIVATE_REFERENCE;
+ PrivateRefCountOverflowed++;
+#endif
}
else
{
@@ -352,6 +383,7 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
/* No more references - clean up the entry */
Assert(SharedBufferGetLockMode(ref) == BUFFER_LOCK_UNLOCK);
+#if REFCOUNT_ARRAY_ENTRIES > 0
if (ref >= &PrivateRefCountArray[0] &&
ref < &PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES])
{
@@ -360,6 +392,7 @@ SharedBufferUnref(PrivateRefCountEntry *ref)
ReservedRefCountSlot = ref - PrivateRefCountArray;
}
else
+#endif
{
/* could make slightly more efficient by using the pointer */
refcount_delete(PrivateRefCountHash, ref->buffer);
@@ -409,11 +442,11 @@ CheckPrivateRefCountLeaks(void)
#ifdef USE_ASSERT_CHECKING
int RefCountErrors = 0;
PrivateRefCountEntry *res;
- int i;
char *s;
+#if REFCOUNT_ARRAY_ENTRIES > 0
/* check the array */
- for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
+ for (int i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
{
if (PrivateRefCountArrayKeys[i] != InvalidBuffer)
{
@@ -426,8 +459,9 @@ CheckPrivateRefCountLeaks(void)
RefCountErrors++;
}
}
+#endif
- /* if necessary search the hash */
+ /* search the hash */
if (PrivateRefCountOverflowed)
{
refcount_iterator iter;
@@ -467,6 +501,7 @@ InitPrivateRefCountIterator(void)
PrivateRefCountEntry *
GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
{
+#if REFCOUNT_ARRAY_ENTRIES > 0
/* First iterate through the array */
while (!iter->in_hash && iter->array_index < REFCOUNT_ARRAY_ENTRIES)
{
@@ -475,8 +510,9 @@ GetNextPrivateRefCountEntry(PrivateRefCountIterator *iter)
if (PrivateRefCountArrayKeys[idx] != InvalidBuffer)
return &PrivateRefCountArray[idx];
}
+#endif
- /* Then iterate through the hash if there are overflowed entries */
+ /* Then iterate through the hash if there are entries */
if (!iter->in_hash)
{
iter->in_hash = true;
--
2.53.0
[text/x-sh] run-all.sh (2.7K, 8-run-all.sh)
download | inline:
#!/bin/bash
set -e
SCRIPT_DIR=$(cd "$(dirname "$0")" && pwd)
PATCH_DIR="$SCRIPT_DIR/"
REPO_ROOT=$(cd "$SCRIPT_DIR/../.." && pwd)
BUILDS_DIR="$REPO_ROOT/.builds"
DBDATA_DIR="$REPO_ROOT/.dbdata"
RESULTS_DIR="$REPO_ROOT/src/test/modules/test_buffer_pin/results"
BENCHMARK_DIR="$REPO_ROOT/src/test/modules/test_buffer_pin"
mkdir -p $BENCHMARK_DIR/results/
PORT=5434
cd "$REPO_ROOT"
#AM_OPTS="--whitespace=nowarn --3way"
AM_OPTS=""
function clean_repo() {
git reset --hard
# git clean -fd
}
function build_and_test() {
local NAME="$1"
local TITLE="$2"
local PREFIX="$BUILDS_DIR/$NAME"
echo ""
echo "============================================================"
echo "Building: $NAME"
echo "============================================================"
# Configure and build
make distclean >/dev/null 2>&1 || true
./configure --prefix="$PREFIX" --without-icu --without-readline --without-zlib >/dev/null
make -j8 -s
make install -s
# Build and install extension
cd "$BENCHMARK_DIR"
make clean -s 2>/dev/null || true
make PG_CONFIG="$PREFIX/bin/pg_config" USE_PGXS=1 install -s
cd "$REPO_ROOT"
echo ""
echo "============================================================"
echo "Testing: $NAME"
echo "============================================================"
# Stop any running server
"$PREFIX/bin/pg_ctl" -D "$DBDATA_DIR" stop 2>/dev/null || true
sleep 1
# Initialize fresh database
rm -rf "$DBDATA_DIR"
"$PREFIX/bin/initdb" -D "$DBDATA_DIR" >/dev/null
# Start server
"$PREFIX/bin/pg_ctl" -D "$DBDATA_DIR" -l "$REPO_ROOT/logfile_$NAME" -o "-p $PORT" start
sleep 2
# Run benchmark
mkdir -p "$RESULTS_DIR"
cd "$BENCHMARK_DIR"
python3 benchmark.py --port $PORT --name "$NAME" --title "$TITLE" \
--max-dist 10000 --points 200 --samples 50 --iterations 20000
cd "$REPO_ROOT"
# Stop server
"$PREFIX/bin/pg_ctl" -D "$DBDATA_DIR" stop
echo "Results saved to: $RESULTS_DIR/$NAME.{svg,csv,txt}"
}
# # Master with patches applied incrementally
git checkout -B pins origin/master
# clean_repo
# Then apply each numbered patch and benchmark after each
for patch in "$PATCH_DIR"/*.patch; do
[ -f "$patch" ] || continue
# Extract patch number (e.g., 0001 -> 1)
patchnum=$(basename "$patch" | cut -c1-4 | sed 's/^0*//')
patchname=$(basename "$patch" .patch)
git am $AM_OPTS "$patch"
build_and_test "patch-$patchnum" "Patch $patchnum: $patchname"
done
echo ""
echo "============================================================"
echo "All benchmarks complete!"
echo "Results in: $RESULTS_DIR/"
echo "============================================================"
[text/x-python-script] compare-patches.py (3.4K, 9-compare-patches.py)
download | inline:
#!/usr/bin/env python3
"""Generate comparison charts for step*.csv benchmark results."""
import pandas as pd
import matplotlib.pyplot as plt
import glob
import os
RESULTS_DIR = os.path.dirname(os.path.abspath(__file__)) + '/results'
# Find all step*.csv files
csv_files = sorted(glob.glob(f'{RESULTS_DIR}/*.csv'))
# Filter to main steps only
main_steps = [f'patch-{i}' for i in [1, 2, 3, 4, 5]]
csv_files = [f for f in csv_files if any(s in f for s in main_steps)]
print(f"Found {len(csv_files)} step CSV files:")
for f in csv_files:
print(f" - {os.path.basename(f)}")
# Load all data as pivoted tables
all_data = []
for csv_file in csv_files:
df = pd.read_csv(csv_file)
step_name = os.path.basename(csv_file).replace('.csv', '')
pivot = df.pivot_table(index=['pattern', 'num_buffers'],
columns='benchmark', values='median_ns').reset_index()
pivot['step'] = step_name
# Compute derived metrics
if 'read' in pivot.columns and 'resowner' in pivot.columns:
pivot['read-resowner'] = pivot['read'] - pivot['resowner']
all_data.append(pivot)
data = pd.concat(all_data, ignore_index=True)
# Define nice labels
step_labels = {
'v4-pg18-baseline': 'PostgreSQL 18 Baseline',
'patch-1': 'Patch 1: Pg19 Baseline',
'patch-2': 'Patch 2: Refactoring',
'patch-3': 'Patch 3: Simple hash',
'patch-4': 'Patch 4: Comopact entry',
'patch-5': 'Patch 5: No array',
}
# Color scheme
colors = {k: c for k, c in zip(
step_labels.keys(),
['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00']
)}
def plot_benchmark(benchmark_name, title, output_name):
"""Plot comparison for a specific benchmark."""
if benchmark_name not in data.columns:
print(f"Skipping {benchmark_name}: column not found")
return
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
fig.suptitle(title, fontsize=14, fontweight='bold')
for idx, pattern in enumerate(['sequential', 'random']):
ax = axes[idx]
ax.set_title(f'{pattern.capitalize()} Access Pattern')
ax.set_xlabel('Number of Buffers')
ax.set_ylabel('Time (ns)')
ax.set_xscale('log')
ax.grid(True, alpha=0.3)
for step in main_steps:
subset = data[(data['pattern'] == pattern) & (data['step'] == step)]
if len(subset) > 0 and benchmark_name in subset.columns:
ax.plot(subset['num_buffers'], subset[benchmark_name],
label=step_labels.get(step, step),
color=colors.get(step, 'gray'),
linewidth=2, marker='o', markersize=3)
ax.legend(loc='upper left', fontsize=9)
plt.tight_layout()
output_path = f'{RESULTS_DIR}/{output_name}.svg'
plt.savefig(output_path, format='svg', bbox_inches='tight')
plt.savefig(output_path.replace('.svg', '.png'), format='png', dpi=150, bbox_inches='tight')
print(f"Saved: {output_path}")
plt.close()
# Generate three comparison figures
plot_benchmark('read-resowner', 'Read/Release excluding Resowner', 'compare-read-resowner')
plot_benchmark('read', 'Read/Release Performance Comparison', 'compare-read')
plot_benchmark('resowner', 'ResourceOwner Performance Comparison', 'compare-resowner')
plot_benchmark('pinning', 'Pin/Unpin Performance Comparison', 'compare-pinning')
plot_benchmark('locking', 'Lock/Unlock Performance Comparison', 'compare-locking')
print("\nDone! Generated comparison charts.")
view thread (6+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected]
Subject: Re: Addressing buffer private reference count scalability issue
In-Reply-To: <CAE8JnxNTETEUiAOF31=_yo=pvyAi9npOeJfcTvEJJbi4vomtYA@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox