Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vzHdK-000rff-2K for pgsql-hackers@arkaria.postgresql.org; Sun, 08 Mar 2026 17:09:38 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vzHdI-00BA0M-2d for pgsql-hackers@arkaria.postgresql.org; Sun, 08 Mar 2026 17:09:37 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vzHdI-00BA0D-1J for pgsql-hackers@lists.postgresql.org; Sun, 08 Mar 2026 17:09:37 +0000 Received: from fhigh-b2-smtp.messagingengine.com ([202.12.124.153]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1vzHdG-000000019y9-2hdi for pgsql-hackers@lists.postgresql.org; Sun, 08 Mar 2026 17:09:36 +0000 Received: from phl-compute-03.internal (phl-compute-03.internal [10.202.2.43]) by mailfhigh.stl.internal (Postfix) with ESMTP id 6B9D37A0045; Sun, 8 Mar 2026 13:09:33 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-03.internal (MEProxy); Sun, 08 Mar 2026 13:09:33 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm1; t=1772989773; x=1773076173; bh=jPX8klqLkh S7fCLmTX9bjCmtn04b6nb+cHKSzwTW/+I=; b=NRLJiA7MKG8kko417UKS4R8r6m naf5/XB7qLlVIAO3V6DoGM45mWQn6at50VUuMAwvisp3UtD7RiCA1MjBWUZwT9Bu hYUHHHqHAyn0RLZsT0cAzfF9N42HEDYGh6Tm7+m9BbmHfUSImV2/Uca0iYGCYUC1 Pr1owiMoMFrIvolKnlb5Za60raTjJzkLtva8DUGcgTzBKrZpV8An3C2AZ3rPJ3vF QR33kiGAoI2ajCGm+jH21vNYnNmTOwMvjrLO38eNzmVkxzH8F5dLbIiJpyvncwJi 5+BvSHIdFffAXC4vn854qgTwfdRhCcWBT/doPtcqiGRbiZV+lMUNkIzZ7GZg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t= 1772989773; x=1773076173; bh=jPX8klqLkhS7fCLmTX9bjCmtn04b6nb+cHK SzwTW/+I=; b=z1h/ZI3866lD1IvT6gItcShe4IV6opAeOCnY7dJpmbuZk+YnoxN lPCtEOv6dYg0zbSbcpDtTAFsXzQ9MqnGQgSlV9wAk0mFvz28Ris+U4jz8U2ZBQ7j D5PMneKHCYAJJGMLucImfyFVzmOD/ICRzZVeUDI6PuxhAAhJ7AxeUzd9p2TRMfJ3 7cwEFmP2WI6ng+Veu9JmKL6iWD2UZcIxIZMnJvMzLij1RRi1umtI8bSyzvhPWkq1 dQxQVjpKk1fxRxEJk3ZOj1Ch4j/bgWzUgUUqVthSgfN5YhT9j2qerkcABtEBSvV4 JQ+ysWzNil3pCj0h0wLX4Vq7zk8RGOw5uEw== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddvjeehjeehucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggujgesthdtsfdttddtvdenucfhrhhomheptehnughrvghs ucfhrhgvuhhnugcuoegrnhgurhgvshesrghnrghrrgiivghlrdguvgeqnecuggftrfgrth htvghrnhepfeffgfelvdffgedtveelgfdtgefghfdvkefggeetieevjeekteduleevjefh ueegnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomheprg hnughrvghssegrnhgrrhgriigvlhdruggvpdhnsggprhgtphhtthhopedvpdhmohguvgep shhmthhpohhuthdprhgtphhtthhopehordgrlhgvgigrnhgurhgvrdhfvghlihhpvgesgh hmrghilhdrtghomhdprhgtphhtthhopehpghhsqhhlqdhhrggtkhgvrhhssehlihhsthhs rdhpohhsthhgrhgvshhqlhdrohhrgh X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Sun, 8 Mar 2026 13:09:32 -0400 (EDT) Date: Sun, 8 Mar 2026 13:09:32 -0400 From: Andres Freund To: Alexandre Felipe Cc: PostgreSQL Hackers Subject: Re: Addressing buffer private reference count scalability issue Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2026-03-08 16:09:07 +0000, Alexandre Felipe wrote: > 2. > > Refactoring reference counting: Before starting to change code and > potentially breaking things I considered prudent to isolate it to limit the > damage. This code was part of a 8k+ LOC file. Not allowing, at least the fast paths, to be inlined will make the most common cases measurably slower, I've tried that. > 3. > > Using simplehash: Simply replacing the HTAB for a simplehash, and adding > a new set of macros SH_ENTRY_EMPTY, SH_MAKE_EMPTY, SH_MAKE_IN_USE. Yea, we'll need to that. Peter Geoghegan has a patch for it as well. > Here I assume that the buffer buffer sequence is independent enough from the > array size, so I use the buffer as the hash key directly, omitting a hash > function call. I doubt that that's good enough. The hash table code relies on the bits being well mixed, but you won't get that with buffer ids. > 4. > > Compact PrivateRefCountEntry: The original implementation used a 4-byte > key and 8-byte value. Reference count uses 32 bits, and it is unreasonable > to expect one backend to pin the same buffer 1 billion times. The lock mode > uses 32 bits but can only assume 4 values. So I packed them in one single > uint32, giving 30 bits for count and 2 bits for lock mode. This makes the > entries 8-byte long, on 64-bit CPUs this represents more than a 1/3 > reduction in memory. This makes the array aligned with the 64-bit words, > copying one entry can be completed in one instruction, and every entry will > be aligned. > 5. I'm rather sceptical that the overhead of needing to shift and mask is worth it. I suspect we'll also want to add more state for each entry (e.g. I think it may be worth getting rid of the io-in-progress resowner). > REFCOUNT_ARRAY_ENTRIES=0: since the simplehash is basically some array > lookup, it is worth trying to remove it completely and keep only the hash. > For small values we would be trading a few branches for a buffer % SIZE, > for the use case of prefetch where pin/unpin in a FIFO fashion, it will > save an 8-entry array lookup, and some extra data moves. I doubt that that's ok, in the vast majority of access we will have 0-2 buffers pinned. And even when we have pinned more buffers, it's *exceedingly* common to access the same entry repeatedly (e.g. pin, lock, unlock, unpin), adding a few cycles to each of those repeated accesses will quickly show up. > From 56bfdd6d7296397a7b3f0b282e239fdc86d6580d Mon Sep 17 00:00:00 2001 > From: Alexandre Felipe > Date: Fri, 6 Mar 2026 17:15:44 +0000 > Subject: [PATCH 4/5] Compact PrivateRefCountEntry > > The previous implementation used an 8-bite (64-bit) entry to store > a uint32 count and an uint32 lock mode. That is fine when we store > the data separate from the key (buffer). But in the simplehash > {key, value} are stored together, so each entry is 12-bytes. > This is somewhat awkward as we have to either pad the entry to 16-bytes, > or the access will be an alternating aligned/misaligned addreses. > > However, we are probably not going to need even 16-bits for the count > and 2-bits is enough for the lock mode. So we can easily pack these > int to a single uint32. I wouldn't want to rely on a 16bit pin counter anyway. > Incrementing/decrementing the count continue the same, just using > 4 instead of 1, lock mode access will require one or two additional > bitwise operations. > > No bit-shifts are required. I don't know how that last sentence can be true, given that: > -struct PrivateRefCountEntry > +#define PRIVATEREFCOUNT_LOCKMODE_MASK 0x3 > +#define ONE_PRIVATE_REFERENCE 4 /* 1 << 2 */ > + > +#define PrivateRefCountGetLockMode(d) ((BufferLockMode)((d) & PRIVATEREFCOUNT_LOCKMODE_MASK)) > +#define PrivateRefCountSetLockMode(d, m) ((d) = ((d) & ~PRIVATEREFCOUNT_LOCKMODE_MASK) | (m)) > +#define PrivateRefCountGetRefCount(d) ((int32)((d) >> 2)) > +#define PrivateRefCountIsZero(d) ((d) < ONE_PRIVATE_REFERENCE) Involves shifts at least in PrivateRefCountGetRefCount().. Greetings, Andres Freund