Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w05aN-001bjg-1M for pgsql-hackers@arkaria.postgresql.org; Tue, 10 Mar 2026 22:29:55 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w05aL-006Qoj-0z for pgsql-hackers@arkaria.postgresql.org; Tue, 10 Mar 2026 22:29:53 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w05aK-006Qob-2t for pgsql-hackers@lists.postgresql.org; Tue, 10 Mar 2026 22:29:53 +0000 Received: from fhigh-a4-smtp.messagingengine.com ([103.168.172.155]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1w05aJ-00000001W6h-1Nox for pgsql-hackers@lists.postgresql.org; Tue, 10 Mar 2026 22:29:52 +0000 Received: from phl-compute-03.internal (phl-compute-03.internal [10.202.2.43]) by mailfhigh.phl.internal (Postfix) with ESMTP id EF151140024C; Tue, 10 Mar 2026 18:29:49 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-03.internal (MEProxy); Tue, 10 Mar 2026 18:29:49 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm1; t=1773181789; x=1773268189; bh=DiOkfYwHaP7YPw81FUIlrRwTt+zRo9CnwLjaV0+Czzs=; b= uHsIxWa1MKDucd39ebpXSOIGoJ/pmqcnlcV3foprgGV8OT5AMXF/aOkYb3SD2geV H4SlDpevhyll79SqNGM78XUbjrA0HGyPFIk1+jp1jdr3ZaAcWNX7nizLJC7U0u/o k4D9VNwV7uc/6jqDo8iHaXaAmllW8qDBo9aDWYE3cdAIWHPlruw8TG0hVzEZW3Hj Qfz4gHo9sxjMxKucK9x2cjEinMrzWWL2X9cDUtRyqRhhuPP2AAn8h2OFC1Xim6lP h6tui1PQ+rYkd7DiSH9d2gMsDYgitg9j0le0HBMsc6pMbqGYPSPbIVkUIDjTHk5h IiuijFICgRD+L7YHTiEu9g== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t=1773181789; x= 1773268189; bh=DiOkfYwHaP7YPw81FUIlrRwTt+zRo9CnwLjaV0+Czzs=; b=d KS+zvHxZ7o/WCdc7USbsxXZJ2p4v8/FNRjjkJK/nt7Y7hYfm/YVNXK3xDNexp6ml xZ/vPsqVC2lN7t4M/a/8hBD9sIlqJ9E3hbB/f9ycIlHxLpOHPtHP3Of3vAvhw54X gft/v78v0DlMXUsW91tNYTLLB/H//geYp51A5y/ljj4r/nMQiUG8GqKtyn1agm9R 0b6RdDPXpGzq55hrE5Rt+gNInnwGyR1oLt+gL+aFxpw9/0G/Ol0Wvw3WmQa56vgn iyealIMaTO7VlKVlZzkQYRDwLDBb/ozg7XUxKVei+yCMUKPvjFxWEc89lbe8q8fT 1i54Vtu9ANYeFfw3XAXkg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddvkedvvddvucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggugfgjsehtkefstddttdejnecuhfhrohhmpeetnhgurhgv shcuhfhrvghunhguuceorghnughrvghssegrnhgrrhgriigvlhdruggvqeenucggtffrrg htthgvrhhnpedtleelvdfgjedvffeiueekfeeuleffhfegfffhgfffkeevueehieehhfei gffhvdenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpe grnhgurhgvshesrghnrghrrgiivghlrdguvgdpnhgspghrtghpthhtohepuddupdhmohgu vgepshhmthhpohhuthdprhgtphhtthhopehpghessghofihtrdhivgdprhgtphhtthhope hknhhiiihhnhhikhesghgrrhhrvghtrdhruhdprhgtphhtthhopegshigrvhhuiiekudes ghhmrghilhdrtghomhdprhgtphhtthhopeguihhlihhpsggrlhgruhhtsehgmhgrihhlrd gtohhmpdhrtghpthhtohepmhgvlhgrnhhivghplhgrghgvmhgrnhesghhmrghilhdrtgho mhdprhgtphhtthhopehordgrlhgvgigrnhgurhgvrdhfvghlihhpvgesghhmrghilhdrtg homhdprhgtphhtthhopehrohgsvghrthhmhhgrrghssehgmhgrihhlrdgtohhmpdhrtghp thhtohepthhhohhmrghsrdhmuhhnrhhosehgmhgrihhlrdgtohhmpdhrtghpthhtohepph hgshhqlhdqhhgrtghkvghrsheslhhishhtshdrphhoshhtghhrvghsqhhlrdhorhhg X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 10 Mar 2026 18:29:49 -0400 (EDT) Date: Tue, 10 Mar 2026 18:29:48 -0400 From: Andres Freund To: Peter Geoghegan Cc: Tomas Vondra , Alexandre Felipe , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar Subject: Re: index prefetching Message-ID: References: <52512325-b1f2-4fff-819e-f68122b2e427@vondra.me> <64mfcfv7iihc4pmqlxarii4esnmqry52ckz5m7lmwylnfnuxuz@oxh4ioxkjtep> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2026-03-01 15:39:55 -0500, Peter Geoghegan wrote: > On Sat, Feb 28, 2026 at 5:08 PM Andres Freund wrote: > > > I think that it makes perfect sense that the AM table controls this. > > > After all, holding on to buffer pins exists purely for heapam's > > > benefit -- it has nothing to do with index AM considerations. > > > > I think it's right that the table AM controls *when* the pins are released, > > what doesn't seem right that it knows *which* specific buffer being released > > is the right thing and that's done. > > The fact is that the "Index Locking Considerations" section from > "Chapter 63. Index Access Method Interface Definition" in the docs > describes things that only make sense when using heapam, or a table AM > very much like it. Any design derived from the existing one will > likely retain some of that. > > For example, the entire concept of ambuldelete requiring a cleanup > lock is unrelated to index AM considerations. If (say) nbtree only > MVCC snapshot plain index scans and bitmap index scans, then there'd > be no need for btbulkdelete to acquire cleanup locks, and no need for > any scan to ever hold on to a buffer pin. This seems pretty unrelated to my concern. I have a problem with the fact that heapam knows which specific resources need to be held (&released) to prevent concurrency issues during an index only scan. I am *NOT* concerned with there needing to be a pin and heapam triggering the release of that resource. All I am saying is that instead of /* * It's safe to drop the batch's buffer pin as soon as we've resolved the * visibility status of all of its items */ if (allbatchitemvisible && scan->MVCCScan) { ... ReleaseBuffer(batch->buf); batch->buf = InvalidBuffer; } it should be something like /* * It's safe to drop the batch's buffer pin as soon as we've resolved the * visibility status of all of its items */ if (allbatchitemvisible && scan->MVCCScan) { ... index_batchscan_release(scan, batch); } where, unfortunately, index_batchscan_release() has to call back into the indexam to release the buffer pin. > To me it seems natural to make such buffer pins (those that specifically > exist to form an interlock against unsafe concurrent TID recycling by > VACUUM) the responsibility of the table AM -- that's the only place where it > actually matters. That's the main reason the AM table releases the batch's > buffer pin in the current design of index prefetching (though another > important factor is nestloop anti-joins with inner index-only scans, as I > detail below). To me it doesn't seem acceptable to require that index AMs need to have exactly one buffer that needs to be pinned to prevent dangerous reuse. An index AM should be free to a) Prevent reuse by using a resource other than a buffer pin That's not possible in the patch, because we unconditionally do ReleaseBuffer(batch->buf) b) Require multiple buffers to be pinned to prevent reuse That's not possible in the patch, because exactly one buffer is released. > Here are some points I imagine we agree on already: > > * It is definitely a good idea to drop index page buffer pins during > or right after amgetbatch returns. Doing so avoids unintended > interactions with the read stream. Sure. > * As a practical matter it makes sense to be maximally strict, by > requiring (at least during prefetching) that the index AM immediately > releases any buffer pins it acquires (acquires from an `amgetbatch` > call originating from the read stream callback). That way it simply > isn't possible for the read stream to be affected by these buffer pins > in the first place. IOW while a paranoid policy might not be strictly > necessary, it ends up being simpler that way. Sure. > * It is necessary to avoid bulk-loading a batch's visibility > information immediately (before prefetching begins) to prevent new > regressions in things like nestloop anti-joins with an inner > index-only scan. My microbenchmarking shows this clearly. It's a very plausible way, at least. I can think of other ways to prevent regressions, but I don't really see a reason to go for them. > Specifically for index-only scans, the only way to drop a batch/index > page pin at the earliest possible opportunity like this is to bulk > save visibility information for the batch's items before dropping the > index page pin. This avoids concurrent TID recycling races with VACUUM > that could otherwise lead to wrong answers. At the same time, > index-only scans get the usual strong guarantee about not holding on > to buffer pins that could adversely affect the read stream in > unpredictable ways (just like plain index scans). Right. > Here's where you might disagree with me, or at least where I need > greater clarity to actually understand your position: > > I think it follows from what I said that some piece of code (either in > the index AM or the table AM) must be responsible for saving > visibility info for a batch using the visibility map, and only then > releasing the associated buffer pin. I chose to do this in the table > AM, partly because heapam needs the flexibility to *not* release the > batch/index buffer pin right away. ISTM that heapam needs the > authority to choose whether or not to bulk load all visibility info > from each batch immediately. This is necessary for the first batch > returned during an index-only scan with (say) a LIMIT 5, where loading > all of the visibility information immediately would significantly hurt > performance. Agreed. > How should I resolve the tension among all these goals? Is your > concern that heapam knows about buffer pins specifically, as opposed > to an abstract interlock concept? Yes. > > What if the index AM is one outside of > > postgres' buffer pool? > > Then it also cannot support WAL-logging, I think. Where does that > leave the LSN trick? I don't think WAL-logging requires using the buffer pool. It might have at some point, but these days you can just create an rmgr of your own and manage everything that way. Yes, dealing with FPIs will be a tad more work, but it's entirely doable. > > That helps with multiple pins for the same buffer, but not having to pin > > multiple *different* buffers. Sure, the index AM could just hold onto the > > additional buffers until the scan ends or such, but that's not really an > > option. > > Why is it not an option? If we care about releasing one pin soon, why should we not care about the timely release of other pins? > Is this purely because of the potential to > disrupt the read stream's management of the backend's buffer pin > limit? I'm not particularly bothered by a small number of extra buffer pins, particularly for AMs other than nbtree. They won't cause issues in any real world setups. If nbtree kept a slightly larger number of pins around, I'd expect *some* problems, due to us running our tests with very small number of shared buffers. Imagine the fun of occasionally running out of pins, depending on when cache invalidations arrive and trigger catalog accesses. Greetings, Andres Freund