public inbox for [email protected]  
help / color / mirror / Atom feed
From: Greg Burd <[email protected]>
To: PostgreSQL Hackers <[email protected]>
Cc: Andres Freund <[email protected]>
Cc: Tomas Vondra <[email protected]>
Cc: Nathan Bossart <[email protected]>
Subject: Re: [PATCH] Batched clock sweep to reduce cross-socket atomic contention
Date: Mon, 27 Apr 2026 09:26:12 -0400
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>


On Sat, Apr 25, 2026, at 4:08 PM, Greg Burd wrote:
> Hello hackers,

Hi again, attached is v2:

0001 - unchanged, batches clock-sweep to reduce contention
0002 - changed ComputeClockBatchSize() such that non-NUMA multi-core systems use batches as well and no longer default to batch size 1

Details below...

> A colleague of mine, Jim Mlodgenski, has been poking at NUMA behavior 
> on some of the newer AWS bare-metal instance types (r8i in particular, 
> which exposes 6 NUMA nodes via SNC3 on a 2-socket box), and in the 
> process landed on a very small change to freelist.c that I think is 
> worth showing around.  His patch is attached with some tweaks of my own.
>
> Full disclosure: the exploration that led Jim to this patch idea was 
> done with help from an AI assistant (Kiro); the idea, the benchmarking, 
> and the final shape of the patch are human-driven, but I wanted to be 
> up front about how his investigation started.  Happy to discuss that 
> separately if people want to.
>
> The one-line summary: instead of advancing nextVictimBuffer one buffer 
> at a time via pg_atomic_fetch_add_u32, each backend claims a batch of 
> 64 consecutive buffer IDs from the shared hand and then iterates them 
> privately.  Global sweep order is preserved -- every buffer is still
> visited exactly once per complete pass -- but the atomic contention on 
> that one cache line drops by roughly the batch size.
>
>
> Why this matters
> ----------------
>
> On multi-socket boxes under eviction pressure, every backend that needs 
> a victim buffer ends up CAS'ing the same cache line.  On a single 
> socket, a locked RMW on that cache line stays warm in L1/L2 and 
> completes in ~20ns.  On 2+ sockets, the line bounces over QPI/UPI at 
> ~100-200ns per op, and with hundreds of backends running 
> StrategyGetBuffer() concurrently, the line ping-pongs constantly.  It's 
> a textbook NUMA scalability bottleneck, and once shared_buffers is 
> smaller than the working set and the sweep is running continuously, 
> that single atomic is what you hit in a perf profile (elevated 
> bus-cycles, cache-misses on the cache line holding nextVictimBuffer).
>
> Andres pointed at the same spot in his pgconf.eu 2024 talk, and Tomas 
> called it out in the "Adding basic NUMA awareness" thread [1] -- so 
> this isn't news to anyone who's been looking at this area.  What I 
> think is new is a fix that's just this, without any of the surrounding 
> architectural change.
>
> The framing (credit to Jim): the clock hand is doing two jobs.  It 
> *coordinates* backends so they don't redundantly decrement usage_count 
> on the same buffers and so they eventually visit every buffer in the 
> pool exactly once per pass.  It also *serializes* access to the 
> counter.  Coordination is the part we want.  Serialization is the part 
> that's killing us on bigger NUMA boxes.  Batching keeps the 
> coordination and thins out the serialization.
>
>
> How it works
> ------------
>
> Two per-backend statics, MyBatchPos and MyBatchEnd.  When a backend 
> calls ClockSweepTick() and its local batch is exhausted, it does a 
> single fetch-add of CLOCK_SWEEP_BATCH_SIZE (64) against 
> nextVictimBuffer and now owns that range.  Subsequent ticks just bump 
> the local counter.
>
> Wraparound got a small rewrite.  The original code had the backend that 
> crossed NBuffers drive completePasses++ under the spinlock via a CAS 
> loop.  With batching, multiple backends can each land a fetch-add that 
> returns a value >= NBuffers in the same pass, so the logic now is: 
> whoever sees a start >= NBuffers takes the spinlock, re-reads the 
> counter, and if it's still out of range does a single CAS to wrap it 
> and bumps completePasses.  If somebody else already wrapped, we just 
> release and move on.  StrategySyncStart() still sees a consistent 
> (nextVictimBuffer, completePasses) pair.
>
> The batch size is gated on whether we actually have multiple NUMA 
> nodes.  On a single-socket box the atomic is already socket-local, 
> batching just makes backends skip further ahead than they need to, so 
> we fall back to batch size 1 -- which is bit-for-bit the original 
> behavior.  The guard:
>
>     if (pg_numa_init() != -1 && pg_numa_get_max_node() >= 1)
>         ClockSweepBatchSize = Min(CLOCK_SWEEP_BATCH_SIZE, (uint32) NBuffers);
>     else
>         ClockSweepBatchSize = 1;
>
> Min() against NBuffers covers the small-shared_buffers corner so a 
> batch never wraps the pool multiple times in one claim.

Thinking more about this approach led me to believe that this non-NUMA default is wrong and induces overhead for a very common case.

> Does batching mess up the meaning of usage_count?
> --------------------------------------------------
>
> Short answer: no.  I want to walk through this because it was my first 
> concern too, and I think it's the question that will come up most on 
> review.
>
> The clock sweep's usage_count is an access-frequency approximation 
> measured in units of *complete passes*.  A buffer with usage_count = N 
> survives N passes without a re-pin.  The semantic meaning lives at pass 
> granularity, not at individual-buffer granularity.
>
> What batching changes: intra-pass temporal ordering.  Without batching, 
> with N backends sweeping, decrements are interleaved -- backend A hits 
> B[0], backend B hits B[1], backend C hits B[2].  With batching, backend 
> A hits B[0..63] in a tight local burst, then backend B hits B[64..127], 
> etc.  The 64-buffer chunks are decremented in bursts rather than 
> individually.
>
> Why it doesn't matter:
>
>   1. Every buffer still gets decremented exactly once per complete
>      pass.  The invariant the algorithm actually depends on is
>      untouched.
>
>   2. A buffer's survival window is the time between consecutive
>      passes.  That's milliseconds to seconds under load.  Whether
>      B[0] gets decremented 50us before or 50us after B[63] within
>      the same pass is below the resolution of anything usage_count
>      is trying to measure.
>
>   3. The bgwriter's feedback loop reads (nextVictimBuffer,
>      completePasses, numBufferAllocs) via StrategySyncStart() every
>      ~200ms.  nextVictimBuffer still advances at the same *total*
>      rate (64 per atomic op, but atomic ops happen 1/64 as often).
>      The position it reports can jitter by up to 64 buffers relative
>      to the one-at-a-time case, but BgBufferSync()'s smoothed
>      estimates operate over thousands of buffers per cycle, so the
>      jitter disappears into the averaging.  numBufferAllocs still
>      increments once per allocation.  strategy_delta,
>      smoothed_alloc, smoothed_density, reusable_buffers_est -- all
>      unaffected in any way I can see.
>
> Table form, because it's easier to argue with:
>
>   Property                          | Unpatched      | Batched
>   ----------------------------------+----------------+----------------
>   Buffers visited per pass          | NBuffers       | NBuffers
>   Decrements per buffer per pass    | 1              | 1
>   Eviction threshold                | usage_count==0 | usage_count==0
>   Max survival (passes)             | 6              | 6
>   Decrement ordering within a pass  | interleaved    | chunked
>   bgwriter allocation rate signal   | accurate       | accurate
>   Cross-socket atomic traffic       | 1 per buffer   | 1 per 64
>
> There is one subtle difference worth naming.  When a backend finds a 
> victim at B[5] of its batch, it returns with MyBatchEnd still sitting 
> at B[63].  The next time that backend needs a victim it resumes at 
> B[6], not at wherever the global hand now points.  So the backend 
> drains its batch over multiple StrategyGetBuffer() calls rather than 
> all at once.  Under heavy load, where batches are consumed in 
> microseconds, this is invisible.  Under light load, the implication is 
> that some buffers can sit with slightly stale usage_count for longer 
> than they would have before.  But "light load" means "the sweep is 
> barely moving and nothing wants to evict anyway" -- so the effect
> doesn't show up where it would hurt.
>
> There's also a small positive side-effect: cache locality.  The backend 
> that just touched BufferDescriptor[B[0]] has the adjacent descriptors 
> warm in L1/L2.  Walking B[0..63] locally is cheaper than walking a 
> striped interleaving where each descriptor was last touched by a 
> different core.  I haven't tried to isolate this in perf, but it falls 
> out naturally.
>
>
> Benchmarks
> ----------
>
> Jim ran these; I'm still working on reproducing them locally and will 
> post independent numbers in a follow-up.  All bare metal, Linux, huge 
> pages enabled throughout (more on that below), postmaster pinned to 
> node 0 with `numactl --cpunodebind=0` because otherwise stock TPS 
> varied from 31K to 40K depending on which node the postmaster happened 
> to land on at launch -- worth flagging for anyone trying to reproduce.
>
> Workload is pgbench scale 3000 (~45GB) with shared_buffers=32GB, so the 
> working set always spills and the sweep is hot.
>
>   r8i.metal-96xl (384 vCPUs, 2 sockets, 6 NUMA nodes via SNC3):
>
>     pgbench RO:
>       Clients   Stock    Patched   Delta
>       64        31,457   36,353    +16%
>       128       31,678   37,864    +20%
>       256       31,510   37,558    +19%
>       384       31,431   37,464    +19%
>       512       31,329   37,040    +18%
>
>     pgbench RW:
>       Clients   Stock    Patched   Delta
>       64         7,685    7,713     0%
>       128       10,420   10,541    +1%
>       256       12,393   12,463    +1%
>       384       15,317   15,197    -1%
>       512       17,930   17,978     0%
>
>   m6i.metal (128 vCPUs, 2 sockets, Ice Lake):
>     RO +19-20%, RW within noise.
>
>   c8i.metal-48xl (192 vCPUs, 1 socket):
>     Single-socket -> batch_size=1 -> original code path.  No
>     behavioral change.  (I double-checked this one specifically
>     because it's the sanity test for the gate.)
>
>   HammerDB TPC-C on m6i.metal (1000 warehouses):
>     VUs   Stock     Patched   Delta
>     128   358,518   349,787    -2%
>     256   332,098   330,272    -1%
>     384   365,782   377,519    +3%
>     512   370,663   386,526    +4%
>
> No TPC-C regression, which was the thing we were most worried about. An 
> earlier attempt (per-socket partitioned sweep, see below) was -13% on 
> this same workload.
>
> The general shape is: the scaling curve flattens later.  Unpatched, TPS 
> tops out around 128 clients and stays flat up to 512 because backends 
> are spending cycles waiting on the cache line rather than
> doing work.  Patched, the curve keeps rising past the point where 
> unpatched plateaus.
>
> Huge pages caveat: all of the above was run with huge pages on, on 
> large-memory instances (the r8i.96xl has 3TB, so Jim never considered 
> running without them).  We have not characterized the non-huge-pages 
> case.  That's on my list; I don't expect it to change the conclusion, 
> but I shouldn't speak for data I haven't collected.
>
>
> Relationship to Tomas's NUMA series
> -----------------------------------
>
> Tomas posted a multi-patch NUMA-awareness series in [1] covering buffer 
> interleaving across nodes, partitioned freelists, partitioned clock 
> sweep, PGPROC interleaving, and related pieces.  I want to be careful 
> here because I don't think we should frame this patch as competing with 
> that work.
>
> One thing I found striking as I re-read the thread: in the benchmarks 
> Tomas posted later in the series, *most of the benefit comes from 
> partitioning the clock sweep*, and the NUMA memory-placement layer on 
> top sometimes runs slower than partitioning alone.  His own conclusion, 
> quoted roughly: the benefit mostly comes from just partitioning the 
> clock sweep, and it's largely independent of the NUMA stuff; the NUMA 
> partitioning is often slower.
>
> That observation is the thing that makes me think batching is worth 
> considering on its own.  It's going after the same bottleneck Tomas's 
> partitioning addresses, but:
>
>   - without splitting global eviction visibility (which is where
>     cross-partition stealing gets complicated),
>   - without requiring NUMA-aware buffer placement (which has huge
>     page alignment, descriptor-partition-mid-page, and resize
>     complications that are still being worked out in that thread),
>   - without touching PGPROC or bgwriter.
>
> What this patch does *not* do:
>   - place buffers on specific NUMA nodes
>   - partition the freelist
>   - touch PGPROC
>   - add new GUCs
>   - change bgwriter
>
> What this patch *does* do:
>   - target exactly the clock-sweep contention that Tomas's
>     partitioning targets, and reduce it by ~64x, in ~30 lines.
>
> If Tomas's series lands in full, this patch becomes redundant for its 
> primary use case (though even within a partitioned sweep, the 
> per-partition atomic still benefits from batching, so it's arguably a 
> useful primitive either way).  If Tomas's series lands incrementally 
> over several cycles -- which the open items in that thread suggest is 
> the realistic path -- this gets us a real chunk of the multi-socket win 
> now.
>
> This patch is also orthogonal to my earlier thread about removing the 
> freelist entirely [2], but given the proximity to that code Jim agreed 
> that I could propose/steward it here on the list for consideration.
>
>
> Open questions / things I'd like feedback on
> --------------------------------------------
>
> - Batch size.  64 is a round number that worked well in testing, but
>   Nathan raised the reasonable point that on small shared_buffers
>   with high concurrency, a fixed 64 could be unfortunate.  Options:
>   scale with shared_buffers (Min(64, NBuffers / N) for some N), scale
>   with max_connections, keep it fixed but let operators tune it, or
>   make it a function of NUMA node count.  I don't have a strong
>   opinion yet; the Min(batch, NBuffers) cap covers the "obviously
>   wrong" corner but doesn't speak to the "several hundred backends
>   on a few-MB shared_buffers" shape.  Numbers/ideas/proposals welcome.
>
> - NUMA detection.  The gate uses pg_numa_init() /
>   pg_numa_get_max_node().  On systems where libnuma isn't available,
>   or where get_mempolicy is blocked (some container configurations),
>   we fall back to batch size 1.  That's safe but it misses the
>   "single socket, many cores, still benefits from fewer atomics"
>   case.  Might be worth a way to force-enable, or batching on all
>   systems with a smaller batch size when single-socket.  I'd like to
>   measure before deciding.
>
> - Eviction pattern on reads.  Nathan also flagged that with batching,
>   the buffers a backend ends up pinning in one StrategyGetBuffer()
>   call will tend to be contiguous in buffer-id space rather than
>   scattered, which is a different allocation pattern than today.
>   The usage_count analysis above says this is benign, but if anyone
>   has an intuition for a workload where this would be observable
>   (e.g., something that cares about the mapping between buffer-id
>   and relation locality), I'd like to hear it.
>
> - nextVictimBuffer wraparound.  The current code has a mild overflow
>   concern papered over with "highly unlikely and wouldn't be
>   particularly harmful".  With batching this is no worse than before,
>   but if we're already touching this function, it might be worth
>   thinking about whether to tighten it up in the same patch or a
>   follow-up.
>
> - Should the non-NUMA value for this be derived from core counts that
>   imply L1/L2 cache layouts or simply default to 8 rather than 1 to
>   realize some benefit?

So, I'm answering my own question here.  Yes, it should.  Ideas below.

> - Should there be a postgresql.conf setting for this that takes
>   precedence?
>
>
> I'll run the non-huge-pages variant, reproduce the r8i numbers, poke at 
> the small-shared_buffers corner, and post perf stat output showing the 
> atomic/cache-miss deltas over the next few days.  In the meantime, 
> eyeballs and skepticism welcome -- I would especially welcome comments 
> from Andres, who's been in this code recently, and from Tomas, whose 
> series has the most overlap.
>
> I realize that we're past feature freeze and working on release notes 
> for v19, so the chances of merging this are slim to none.  I think this 
> could be considered a "performance bug fix for NUMA systems" in this 
> release, but that is stretching it a bit.  It is a big ask at this 
> stage to land a change like this.
>
> best.
>
> -greg
>
> [1] 
> https://www.postgresql.org/message-id/[email protected]
> [2] 
> https://www.postgresql.org/message-id/[email protected]
> Attachments:
> * v1-0001-Reduce-clock-sweep-atomic-contention-by-claiming-.patch

ComputeClockBatchSize() has two phases: select a base batch from hardware topology, then cap it to prevent over-claiming.

Phase 1: Base batch from topology

  int ncpus = pg_get_online_cpus();
  int numa_nodes = (pg_numa_init() != -1) ? pg_numa_get_max_node() + 1 : 1;

  if (numa_nodes > 1)        base_batch = 64;
  else if (ncpus > 16)       base_batch = 32;
  else if (ncpus > 8)        base_batch = 16;
  else if (ncpus > 4)        base_batch = 8;
  else                       base_batch = 1;

The reasoning at each tier:

  - NUMA (multi-socket): Atomic ops cross the interconnect (QPI/UPI/Infinity
    Fabric). Round-trip latency is ~100-300ns vs ~10-40ns intra-socket.
    Batch=64 amortizes that heavily.

  - >16 cores, single socket: Still significant L3 contention, many cores
    competing for the same cache line. Batch=32 cuts atomic ops by 32x.

  - 9-16 cores: Moderate contention. Batch=16.

  - 5-8 cores: Light contention. Batch=8.

  - <=4 cores: Almost no contention. Batch=1 (no batching). The overhead of 
    batching logic isn't worth it, and there's a fairness tradeoff - batching
    means one backend "owns" a range of buffers temporarily, which matters 
    more when there are few buffers per backend.

Phase 2: Cap to prevent over-claiming

  max_batch = (MaxBackends > 0)
      ? pool_nbuffers / (2 * MaxBackends)
      : pool_nbuffers / 200;
  if (max_batch < 1)
      max_batch = 1;

  return Min(base_batch, Min(max_batch, pool_nbuffers));

The cap ensures that if every backend simultaneously claims a batch, the total claimed doesn't exceed half the pool:

  batch_size * MaxBackends <= pool_nbuffers / 2

Why half? If all backends claimed the entire pool simultaneously, they'd each be sweeping overlapping ranges, thus wasting work and defeating the purpose. Keeping total claims under 50% of the pool means at any instant, at most half the buffers are "in flight" being evaluated by backends, and the other half are available for normal operation.

For a small dynamic pool (say 4096 buffers with MaxBackends=200), the cap computes to 4096 / 400 = 10, which overrides any larger base_batch. For the default pool with shared_buffers = 8GB (1M buffers) and MaxBackends=200, the cap is 1000000 / 400 = 2500 which is well above the max base_batch of 64, so the base_batch wins.

The pool_nbuffers floor at the end handles the degenerate case of a pool smaller than the batch size.

The Tradeoff

Larger batches reduce atomic contention but increase sweep unevenness, one backend might sweep through "cold" buffers while another's batch happens to land on "hot" ones. The tiered approach balances this: batch aggressively only when the hardware topology makes contention the dominant cost (NUMA, many-core), and stay conservative on small systems where fairness matters more.


I think this is better because:

1. The original patch only batched on multi-socket NUMA systems. The new algorithm also provides atomic contention benefits on large single-socket systems (>16 cores) where L3 cache contention matters.

2. Conservative on small systems: Systems with ≤4 cores get batch_size=1 (original behavior) since batching overhead outweighs contention benefits and fairness matters more.

3. Prevents pathological over-claiming: The cap mechanism prevents scenarios where many backends claim huge batches relative to a small buffer pool.


Based on the algorithm, here's what different systems would get:

        System         CPUs   NUMA      Total RAM   Shr Buf   Batch Size  Atomic Reduction                                                                          
  ==================  ====  ========  ===========  ========  ==========  ================                                                                           
  r8i.metal-96xl       384     multi      3072GB   2457.6GB          64    64x                                                                                      
  m6i.metal            128     multi       512GB    409.6GB          64    64x                                                                                      
  c8i.metal-48xl       192  1 socket       192GB    153.6GB          32    32x                                                                                      
  Large server          64     multi       256GB    204.8GB          64    64x
  Medium server         32  1 socket        64GB     51.2GB          32    32x                                                                                      
  Small server          16  1 socket        32GB     25.6GB          16    16x                                                                                      
  Developer machine      8  1 socket        16GB     12.8GB           8    8x
  Small VM               4  1 socket         4GB      3.2GB           1    no change
  Overloaded VM          8  1 socket         4GB      3.2GB           8    8x

best.

-greg



Attachments:

  [application/octet-stream] v2-0001-Reduce-clock-sweep-atomic-contention-by-claiming-.patch (8.2K, 2-v2-0001-Reduce-clock-sweep-atomic-contention-by-claiming-.patch)
  download | inline diff:
From bdcf90fbd89a0aec397a3d57224ae732959733f9 Mon Sep 17 00:00:00 2001
From: Greg Burd <[email protected]>
Date: Sat, 25 Apr 2026 15:52:36 -0400
Subject: [PATCH v2 1/2] Reduce clock-sweep atomic contention by claiming
 buffers in batches

StrategyGetBuffer() advances nextVictimBuffer via
pg_atomic_fetch_add_u32(..., 1) on every tick.  On multi-socket
systems the cache line holding the counter has to travel over the
interconnect on each operation, pushing a sweep tick from ~20ns (the
same-socket case) into the ~100-200ns range.  With hundreds of
concurrent backends under eviction pressure, that one cache line
becomes the dominant cost in the sweep, visible as elevated
bus-cycles and cache-misses in perf profiles.

Each backend now claims a range of CLOCK_SWEEP_BATCH_SIZE (64)
consecutive buffer IDs with a single fetch-add and iterates through
them privately.  The sweep still advances through the pool in order,
each buffer is still visited exactly once per complete pass, and
usage_count is still decremented exactly once per buffer per pass;
the meaning of usage_count as "how many complete passes a buffer
survives without a re-pin" is preserved.  What changes is the
temporal ordering of decrements within a single pass, which the
algorithm does not depend on.

Wraparound handling is adjusted: with batching, multiple backends
can each see their fetch-add return a value past NBuffers within
the same pass.  Any such backend takes buffer_strategy_lock,
re-reads the counter, and if it is still out of range wraps it with
a single CAS and increments completePasses.  StrategySyncStart()
continues to see a consistent (nextVictimBuffer, completePasses)
pair.

Batching is only useful when the atomic is actually contended
across nodes, so it is applied only when libnuma reports more than
one node (pg_numa_get_max_node() >= 1); otherwise the batch size
stays at 1 and the code path matches master bit-for-bit.  The batch
is also capped at NBuffers so a claim cannot wrap the pool more
than once.

Co-Authored-by: Jim Mlodgenski <[email protected]>
Co-Authored-by: Greg Burd <[email protected]>
---
 src/backend/storage/buffer/freelist.c | 136 ++++++++++++++++++--------
 1 file changed, 94 insertions(+), 42 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index fdb5bad7910..e86ed1f7da0 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -22,6 +22,7 @@
 #include "storage/proc.h"
 #include "storage/shmem.h"
 #include "storage/subsystems.h"
+#include "port/pg_numa.h"
 
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
@@ -100,68 +101,101 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
 
+/*
+ * Number of buffer IDs to claim from the shared clock hand at once.
+ * Larger values reduce contention on the shared atomic.  With a batch
+ * size of 64, concurrent backends sweep non-overlapping chunks of 64
+ * buffers rather than interleaving one buffer at a time.  The global
+ * sweep order is preserved — each buffer is still visited exactly once
+ * per complete pass.
+ */
+#define CLOCK_SWEEP_BATCH_SIZE 64
+
+/*
+ * Per-backend state for batched clock sweep.
+ */
+static uint32 MyBatchPos = 0;	/* next buffer within batch */
+static uint32 MyBatchEnd = 0;	/* one past last buffer in batch */
+
+/*
+ * Effective batch size for the clock sweep, computed once at startup.
+ * On non-NUMA systems (single socket, no libnuma, or containers blocking
+ * get_mempolicy), this is 1 -- the original one-at-a-time behavior.
+ * On multi-node NUMA systems, this is Min(CLOCK_SWEEP_BATCH_SIZE, NBuffers)
+ * to reduce cross-socket atomic contention on nextVictimBuffer.
+ */
+static uint32 ClockSweepBatchSize = 1;
+
+static inline uint32
+EffectiveBatchSize(void)
+{
+	return ClockSweepBatchSize;
+}
+
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
  *
- * Move the clock hand one buffer ahead of its current position and return the
- * id of the buffer now under the hand.
+ * Return the next buffer to consider for eviction.  Backends claim batches
+ * of consecutive buffer IDs from the shared clock hand, then iterate through
+ * them locally without further atomic operations.  This preserves the global
+ * sweep order while reducing cross-socket contention on the shared counter.
  */
 static inline uint32
 ClockSweepTick(void)
 {
 	uint32		victim;
 
-	/*
-	 * Atomically move hand ahead one buffer - if there's several processes
-	 * doing this, this can lead to buffers being returned slightly out of
-	 * apparent order.
-	 */
-	victim =
-		pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
-
-	if (victim >= NBuffers)
+	if (MyBatchPos >= MyBatchEnd)
 	{
-		uint32		originalVictim = victim;
-
-		/* always wrap what we look up in BufferDescriptors */
-		victim = victim % NBuffers;
-
 		/*
-		 * If we're the one that just caused a wraparound, force
-		 * completePasses to be incremented while holding the spinlock. We
-		 * need the spinlock so StrategySyncStart() can return a consistent
-		 * value consisting of nextVictimBuffer and completePasses.
+		 * Claim a new batch from the shared clock hand.  This is the only
+		 * atomic operation per batch, reducing contention by the batch size.
 		 */
-		if (victim == 0)
+		uint32		start;
+		uint32		batch_size = EffectiveBatchSize();
+
+		start = pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer,
+										batch_size);
+
+		if (start >= (uint32) NBuffers)
 		{
-			uint32		expected;
-			uint32		wrapped;
-			bool		success = false;
+			start = start % NBuffers;
 
-			expected = originalVictim + 1;
+			/*
+			 * If the counter has grown beyond NBuffers, try to wrap it back.
+			 * We must hold the spinlock so StrategySyncStart() can read
+			 * nextVictimBuffer and completePasses consistently.
+			 *
+			 * Multiple backends may enter this section concurrently. After
+			 * acquiring the spinlock, re-read the counter: if another backend
+			 * already wrapped it below NBuffers, we're done.
+			 */
+			SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 
-			while (!success)
 			{
-				/*
-				 * Acquire the spinlock while increasing completePasses. That
-				 * allows other readers to read nextVictimBuffer and
-				 * completePasses in a consistent manner which is required for
-				 * StrategySyncStart().  In theory delaying the increment
-				 * could lead to an overflow of nextVictimBuffers, but that's
-				 * highly unlikely and wouldn't be particularly harmful.
-				 */
-				SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
-				wrapped = expected % NBuffers;
+				uint32		current;
+				uint32		wrapped;
 
-				success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
-														 &expected, wrapped);
-				if (success)
-					StrategyControl->completePasses++;
-				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+				current = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
+				if (current >= (uint32) NBuffers)
+				{
+					wrapped = current % NBuffers;
+					if (pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
+													   &current, wrapped))
+						StrategyControl->completePasses++;
+				}
 			}
+
+			SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 		}
+
+		MyBatchPos = start;
+		MyBatchEnd = start + batch_size;
 	}
+
+	victim = MyBatchPos % NBuffers;
+	MyBatchPos++;
+
 	return victim;
 }
 
@@ -408,6 +442,24 @@ StrategyCtlShmemInit(void *arg)
 
 	/* No pending notification */
 	StrategyControl->bgwprocno = -1;
+
+	/*
+	 * Determine the effective clock-sweep batch size.
+	 *
+	 * On multi-node NUMA systems, claiming batches of buffers from the shared
+	 * clock hand reduces cross-socket contention on the atomic counter.  On
+	 * single-socket systems, batching provides no benefit (the atomic is
+	 * already socket-local) and just causes backends to skip buffers, so we
+	 * use batch size 1 for the original behavior.
+	 *
+	 * pg_numa_init() returns -1 when NUMA is unavailable.
+	 * pg_numa_get_max_node() returns 0 for a single NUMA node.
+	 */
+	if (pg_numa_init() != -1 && pg_numa_get_max_node() >= 1)
+		ClockSweepBatchSize = Min(CLOCK_SWEEP_BATCH_SIZE,
+								  (uint32) NBuffers);
+	else
+		ClockSweepBatchSize = 1;
 }
 
 
-- 
2.50.1 (Apple Git-155)



  [application/octet-stream] v2-0002-Improve-clock-sweep-batch-sizing-with-CPU-aware-a.patch (5.0K, 3-v2-0002-Improve-clock-sweep-batch-sizing-with-CPU-aware-a.patch)
  download | inline diff:
From dce473e1a246d877f0f2313ba7a29a4d0e35dc27 Mon Sep 17 00:00:00 2001
From: Greg Burd <[email protected]>
Date: Mon, 27 Apr 2026 08:25:40 -0400
Subject: [PATCH v2 2/2] Improve clock sweep batch sizing with CPU-aware
 algorithm
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Replace simple NUMA-only batch sizing with a tiered approach:

- NUMA systems (multi-socket): batch=64 (high interconnect latency)
- Single socket >16 cores: batch=32 (L3 cache contention)
- Single socket 9-16 cores: batch=16 (moderate contention)
- Single socket 5-8 cores: batch=8 (light contention)
- Single socket ≤4 cores: batch=1 (no batching overhead)

Also adds over-claiming protection: batch_size × MaxBackends ≤ pool_size/2
to ensure total claimed buffers stay under 50% of the pool.

This provides atomic contention benefits on large single-socket systems
while maintaining the original behavior on small systems where fairness
matters more than throughput.

Authored-by: Greg Burd <[email protected]>
---
 src/backend/storage/buffer/freelist.c | 91 ++++++++++++++++++++++-----
 1 file changed, 77 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index e86ed1f7da0..476d7f420a5 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -24,6 +24,10 @@
 #include "storage/subsystems.h"
 #include "port/pg_numa.h"
 
+#ifdef HAVE_UNISTD_H
+#include <unistd.h>
+#endif
+
 #define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
 
@@ -61,6 +65,8 @@ static BufferStrategyControl *StrategyControl = NULL;
 
 static void StrategyCtlShmemRequest(void *arg);
 static void StrategyCtlShmemInit(void *arg);
+static int	pg_get_online_cpus(void);
+static uint32 ComputeClockBatchSize(int pool_nbuffers);
 
 const ShmemCallbacks StrategyCtlShmemCallbacks = {
 	.request_fn = StrategyCtlShmemRequest,
@@ -411,6 +417,69 @@ StrategyNotifyBgWriter(int bgwprocno)
 	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 }
 
+/*
+ * pg_get_online_cpus -- get the number of online CPU cores
+ */
+static int
+pg_get_online_cpus(void)
+{
+#ifdef _SC_NPROCESSORS_ONLN
+	long		ncpus = sysconf(_SC_NPROCESSORS_ONLN);
+
+	if (ncpus > 0)
+		return (int) ncpus;
+#endif
+	/* Fallback if sysconf is unavailable or fails */
+	return 1;
+}
+
+/*
+ * ComputeClockBatchSize -- compute the effective clock-sweep batch size
+ *
+ * The function has two phases: select a base batch from hardware topology,
+ * then cap it to prevent over-claiming.
+ *
+ * Phase 1: Base batch from topology
+ * - NUMA (multi-socket): batch=64 (high cross-socket latency)
+ * - >16 cores, single socket: batch=32 (L3 contention)
+ * - 9-16 cores: batch=16 (moderate contention)
+ * - 5-8 cores: batch=8 (light contention)
+ * - <=4 cores: batch=1 (no batching overhead)
+ *
+ * Phase 2: Cap to prevent over-claiming
+ * - Ensure batch_size * MaxBackends <= pool_nbuffers / 2
+ * - Keeps total claims under 50% of the pool
+ */
+static uint32
+ComputeClockBatchSize(int pool_nbuffers)
+{
+	int			ncpus = pg_get_online_cpus();
+	int			numa_nodes = (pg_numa_init() != -1) ? pg_numa_get_max_node() + 1 : 1;
+	uint32		base_batch;
+	uint32		max_batch;
+
+	/* Phase 1: Base batch from topology */
+	if (numa_nodes > 1)
+		base_batch = 64;
+	else if (ncpus > 16)
+		base_batch = 32;
+	else if (ncpus > 8)
+		base_batch = 16;
+	else if (ncpus > 4)
+		base_batch = 8;
+	else
+		base_batch = 1;
+
+	/* Phase 2: Cap to prevent over-claiming */
+	max_batch = (MaxBackends > 0)
+		? pool_nbuffers / (2 * MaxBackends)
+		: pool_nbuffers / 200;
+	if (max_batch < 1)
+		max_batch = 1;
+
+	return Min(base_batch, Min(max_batch, (uint32) pool_nbuffers));
+}
+
 
 /*
  * StrategyCtlShmemRequest -- request shared memory for the buffer
@@ -444,22 +513,16 @@ StrategyCtlShmemInit(void *arg)
 	StrategyControl->bgwprocno = -1;
 
 	/*
-	 * Determine the effective clock-sweep batch size.
+	 * Compute the effective clock-sweep batch size based on hardware
+	 * topology.
 	 *
-	 * On multi-node NUMA systems, claiming batches of buffers from the shared
-	 * clock hand reduces cross-socket contention on the atomic counter.  On
-	 * single-socket systems, batching provides no benefit (the atomic is
-	 * already socket-local) and just causes backends to skip buffers, so we
-	 * use batch size 1 for the original behavior.
-	 *
-	 * pg_numa_init() returns -1 when NUMA is unavailable.
-	 * pg_numa_get_max_node() returns 0 for a single NUMA node.
+	 * This uses a tiered approach: larger batches on NUMA systems and
+	 * many-core single-socket systems where atomic contention is high,
+	 * smaller batches or no batching on few-core systems where fairness
+	 * matters more. The batch size is also capped to prevent over-claiming
+	 * when there are many backends relative to the buffer pool size.
 	 */
-	if (pg_numa_init() != -1 && pg_numa_get_max_node() >= 1)
-		ClockSweepBatchSize = Min(CLOCK_SWEEP_BATCH_SIZE,
-								  (uint32) NBuffers);
-	else
-		ClockSweepBatchSize = 1;
+	ClockSweepBatchSize = ComputeClockBatchSize(NBuffers);
 }
 
 
-- 
2.50.1 (Apple Git-155)



view thread (4+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: [PATCH] Batched clock sweep to reduce cross-socket atomic contention
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox