Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w54as-002rhV-2x for pgsql-hackers@arkaria.postgresql.org; Tue, 24 Mar 2026 16:27:03 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w54ar-007rq8-1R for pgsql-hackers@arkaria.postgresql.org; Tue, 24 Mar 2026 16:27:01 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w54aq-007rpz-1y for pgsql-hackers@lists.postgresql.org; Tue, 24 Mar 2026 16:27:01 +0000 Received: from fout-b4-smtp.messagingengine.com ([202.12.124.147]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1w54an-00000000uEo-2D1K for pgsql-hackers@lists.postgresql.org; Tue, 24 Mar 2026 16:27:00 +0000 Received: from phl-compute-03.internal (phl-compute-03.internal [10.202.2.43]) by mailfout.stl.internal (Postfix) with ESMTP id 8F03D1D000AB; Tue, 24 Mar 2026 12:26:55 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-03.internal (MEProxy); Tue, 24 Mar 2026 12:26:55 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm1; t=1774369615; x=1774456015; bh=6wyiBkdOgI RKbOAMk06Zy37aQG3EC3/GspKm2KcyhJs=; b=XEhEycBcjjp8vLCqSUufRXHTcm 5LX5i1sL2YICQ0AYO9IWB9YKpvTaoeLmlkPU3Ucbeq18DAoFUWxtsrXaLzYTzmgY l7lmhsDbv6kNwS3PQxhLKm7Ej/ZNFjLay6RjnBzpC/3G794a9J85co4bSeHIf0qL GyeUJzY3SL3B4kpAjQK3g2oku3vtv1WVWryYsGdbkCCi1QBpynr6kxw/A7uo6hRl TxjfIodCJtig4nEKaLoC6QwVbWQ5P4st18TXEgDB9ceh7oxJ1kZox8OiI/ycTpL1 VmPKzy6uD5L5QOTjX1nlJUkBNW99vwQ2UpsV+IHgf+ewN8kzAS+Lxprp9muA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t= 1774369615; x=1774456015; bh=6wyiBkdOgIRKbOAMk06Zy37aQG3EC3/GspK m2KcyhJs=; b=kbJ0+xWVZUbOszLvAAIgXSGf5nY3seJKbDuj7uufsHlvXt6MnL4 xW2VClSk4V1YZRFXqkmEhqkiYGjmiHM+qMqkornruD6UDe5Y0TQvkBjFIQNWwXqq A8ERIw43aB6LZel8GF4gEClNT/5/0DG/kb0mobdJTWKU3gjEw6bIfAvGrsK5WmcM SVlDShrMpadK7WNdsbG0zElxisw947I5L3rhjHVelxoOtwB9Yhel1UpxcTv7YQ3b +kib8aQD6wFbFUGmS5IuK2Q/JHEO9/b+UwYtP5q8Ch9VOKItZJNVOaIEjk5bih8M nVdWUrOWgDjLJ7ak/8b8GDqvVdXssd++Qyg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgdefvddvtdeiucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggujgesthdtsfdttddtvdenucfhrhhomheptehnughrvghs ucfhrhgvuhhnugcuoegrnhgurhgvshesrghnrghrrgiivghlrdguvgeqnecuggftrfgrth htvghrnhepfeffgfelvdffgedtveelgfdtgefghfdvkefggeetieevjeekteduleevjefh ueegnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomheprg hnughrvghssegrnhgrrhgriigvlhdruggvpdhnsggprhgtphhtthhopeduuddpmhhouggv pehsmhhtphhouhhtpdhrtghpthhtohepphhgsegsohifthdrihgvpdhrtghpthhtohepkh hnihiihhhnihhksehgrghrrhgvthdrrhhupdhrtghpthhtohepsgihrghvuhiikedusehg mhgrihhlrdgtohhmpdhrtghpthhtohepughilhhiphgsrghlrghuthesghhmrghilhdrtg homhdprhgtphhtthhopehmvghlrghnihgvphhlrghgvghmrghnsehgmhgrihhlrdgtohhm pdhrtghpthhtohepohdrrghlvgigrghnughrvgdrfhgvlhhiphgvsehgmhgrihhlrdgtoh hmpdhrtghpthhtoheprhhosggvrhhtmhhhrggrshesghhmrghilhdrtghomhdprhgtphht thhopehthhhomhgrshdrmhhunhhrohesghhmrghilhdrtghomhdprhgtphhtthhopehpgh hsqhhlqdhhrggtkhgvrhhssehlihhsthhsrdhpohhsthhgrhgvshhqlhdrohhrgh X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 24 Mar 2026 12:26:53 -0400 (EDT) Date: Tue, 24 Mar 2026 12:26:53 -0400 From: Andres Freund To: Peter Geoghegan Cc: Tomas Vondra , Alexandre Felipe , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar Subject: Re: index prefetching Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2026-03-21 19:01:44 -0400, Peter Geoghegan wrote: > I can immediately act on most of what you've said here. I'm planning > to commit the first patch (and maybe the hash index fake LSN patch) in > the next couple of days, ahead of posting a new v17 of the patch set. > You can expect any item I reply to with "fixed" or similar to be in > that version. Other items might not be addressed in v17 -- generally > because I require more context or feedback to act. Cool. > > > This is preparatory work for an upcoming commit that will need xs_blk > > > to manage buffer pin transfers between the scan and the executor slot. > > > > A subsequent commit adds an earlier ExecClearTuple(slot) to make the buffer > > refcount decrement cheaper (due to hitting the one-element cache in > > bufmgr.c). I wonder if it's worth pulling that into this commit? Mostly to > > make that larger commit smaller. > > Is it really worth doing that without also doing the > xs_lastinblock/ExecStorePinnedBufferHeapTuple stuff? We need batches > to do the latter. I see perf benefits from it alone, yes. > v17 will do it the other way around: it'll delay adding the > ExecClearTuple(slot) as well as the > xs_lastinblock/ExecStorePinnedBufferHeapTuple stuff until a *later* > commit. That seems more logical to me. Shrugh, also works. > > The commit message doesn't mention that this affects ammarkpos/amrestrpos. > > It does not. But FWIW there's a prominent "Note" about it in the SGML docs. Approximately nobody looking at the commit to see what they need to change will see that... > > > + #define HEAP_BATCH_VIS_CHECKED 0x01 /* checked item in VM? */ > > > + #define HEAP_BATCH_VIS_ALL_VISIBLE 0x02 /* block is known all-visible? */ > > > > So we only 2 out of 8 bits. I guess it's probably not worth doing bit shift > > stuff. But still curious if you tried? > > No, I didn't try. I doubt saving space is worthwhile, since we'll need > a relatively large allocation for batches used during index-only scans > regardless. Seems fine to not care for now. But, FWIW, the motivating reason wouldn't be to to really save memory, it'd be to make it more likely the data fits into a higher level of the cache. > > > +typedef struct IndexScanBatchData > > > +{ > > > + XLogRecPtr lsn; /* index page's LSN */ > > > > Still doubtful this should be in generic infra (together with > > indexam_util_batch_unlock()). > > We're now advertising that indexam_util_batch_unlock is optional, and > that index AMs can go there own way when needed. Fair enough. > > > + > > > + /* > > > + * heap blocks fetched counts (incremented by index_getnext_slot calls > > > + * within table AMs, though only during index-only scans) > > > + */ > > > + uint64 nheapfetches; > > > } IndexScanInstrumentation; > > > > s/heap/table/ for anything new imo. > > Usually I'd agree, but here we're using the same name as the one > presented in EXPLAIN ANALYZE. IMV this should match that. (Maybe the > EXPLAIN ANALYZE output should change, and this field name along with > it, but that's another discussion entirely.) I think if we add it into more and more places it'll get harder and harder to eventually fix... > > > + * Note on Memory Ordering Effects > > > + * ------------------------------- > > > ISTM that moving this comment here is hiding it too much, I'm pretty sure > > there are other AMs using visibilitymap out there. Perhaps we should just > > move it to the top of visibilitymap.c? > > I think that the most natural place for it in visibilitymap.c is just > above visibilitymap_get_status. They'll be moved over there in the > next revision. Agreed. > > > +static pg_attribute_hot IndexScanBatch > > > +heapam_batch_getnext(IndexScanDesc scan, ScanDirection direction, > > > + IndexScanBatch priorBatch, BatchRingItemPos *pos) > > > +{ > > > > +static pg_attribute_hot pg_attribute_always_inline ItemPointer > > > +heapam_batch_getnext_tid(IndexScanDesc scan, IndexFetchHeapData *hscan, > > > + ScanDirection direction, bool *all_visible) > > > +{ > > > A lot of this also seems like it should be generic code. Where it actually > > starts to be heapam specific is in heapam_batch_return_tid() - and there's two > > calls to that. So maybe there should be generic helper for the bulk of > > heapam_batch_getnext_tid() that's called by the heapam version, which either > > returns a NULL upwards or does a single heapam_batch_return_tid(). > > I agree that it might very well make sense to break down > heapam_batch_getnext and heapam_batch_getnext_tid into inline > functions that allow for some amount of code reuse. Code reuse and making it easier for other AMs to adapt this... > But I'm not going to have time to fix that before the next revision (too > much performance validation is required). Fair. > > I also suspect it'd be worth creating a new heapam.c file for this new > > code. heapam_index.c or such. > > I had thought about that myself. How would that be structured, in > terms of the commits? I'm imagining something like heapam_iscan.c or heapam_indexfetch.c or such. I'd introduce it by adding it in a commit that moves heap_hot_search_buffer(), and heapam_index_fetch_{begin,reset,end}() into it. Moving heap_hot_search_buffer() into the same file will be nice because it'll allow partial inlining of it into some really performance sensitive functions. > It's pretty clear that we'd have to move some existing heapam_handler.c code > as part of this whole process (e.g., heapam_index_fetch_tuple). > I think that it would likely be easiest if we added an extra commit, > right after "heapam: Track heap block in IndexFetchHeapData using > xs_blk" and right before "Add interfaces that enable index > prefetching". This would be a strictly mechanical commit that moved > the code to the new file without adding anything else. Obviously agreed. I think we should move the seq/tid scan stuff into its own file too, but that's obviously a separate thread. > > > @@ -281,7 +280,23 @@ index_beginscan(Relation heapRelation, > > > */ > > > scan->heapRelation = heapRelation; > > > scan->xs_snapshot = snapshot; > > > + scan->MVCCScan = IsMVCCLikeSnapshot(snapshot); > > > > Elsewhere you have an xxx comment about making sure the snapshot is > > pushed/registered - seems it should be here, not there... > > That comment was copied from index_getnext_tid (it's still there in > the patch/hasn't been moved). This was based on the theory that it > made just as much sense in the new code paths, which generally won't > call index_getnext_tid. And, there are numerous other identical > comments, including in procarray.c. Ah, right. > I'm fine with consolidating all this, but I'd prefer it if you gave me > guidance on what to do with all of them. I guess we just need to fix this crap one of these days :( So just ignore what I said on this, I don't see a great solution immediately. > > Any reason this isn't in index_beginscan_internal(), given both > > index_beginscan() and index_beginscan_parallel() need it? I realize you'd > > need to add arguments to index_beginscan_internal(), but I don't see a problem > > with that. Alternatively a helper for this seems like a possibility too. > > Fixed, by moving much more of the initialization done by each variant > (index_beginscan, index_beginscan_bitmap, index_beginscan_parallel) > into index_beginscan_internal itself. Nice. Haven't checked out your new version yet - are you doing that as a separate commit? > > Hm. Would it be worth using a much wider ring position to avoid this kind of > > danger? > > I don't think so. It worked that way a few months ago, but I felt > using a ring buffer with frequently overflowing offsets would be > easier to test. Note that we reset all offsets to 0 to handle a change > in scan direction, so the offsets aren't entirely immutable. Well, a 64bit one would never overflow :) But, jokes aside, fair enough, either there should never be overflows or they should be frequent. > > Couldn't it be that markBatch is one behind scanBatch, in which case wouldn't > > need to throw out all the prefetched batches (in the future commits that do > > prefetching)? > > We can very often use the "scanBatch == markBatch" happy path, since > in practice merge joins rarely need to restore a markPos that's very > far behind scanPos. This is made more likely by some of the nbtree > work from the past 8 years, such as suffix truncation and > deduplication. Equal index tuples naturally avoid spanning multiple > index leaf pages unless it's absolutely unavoidable because there are > too many to fit on one page. > > My understanding is that it'd be rather likely that, if the > > markPos is not the current batch, it'd be in the last batch. > > I agree: cases that cannot use the "scanBatch == markBatch" happy path > are very likely to restore a markBatch for the batch that was most > recently removed from the head of the ring buffer. However, I'd much > rather avoid a special case where we don't remove markBatch from the > batch ring buffer, just in case we restore the mark (which probably > won't happen, since most marks taken are never restored). I get why > you'd ask about this (naturally I thought of it myself), but I just > can't see it paying for itself. > > Consider the costs. The complexity stems from it breaking the > "scanBatch == headBatch" invariant. That would affect pausing. When > the read stream callback pauses, it expects the scan code to consume > all remaining items from scanBatch and then remove scanBatch from the > ring buffer, allowing the read stream to be resumed in passing. Like a > cross-batch restore, pausing is itself a hard-to-hit code path; > testing the intersection of those 2 things would be very tricky. > (There are other reasons why I want to keep this invariant, which I > won't detail now.) > > I also don't see much benefit. We're already comfortably beating the > master branch with merge joins that heavily use mark/restore, due to > improved buffer pin management (particularly with index-only scans). > The need to re-read index pages in these cases is nothing new. We > reset the read stream whenever we restore a mark (even in the happy > path), so there's relatively little risk of the distance to get out of > hand afterwards (the index pages are very likely to still be in > shared_buffers each time). Fair. I did see some large number of index IO page accesses that look like it could be avoided by looking back one batch, but that was an intentional torture query that I can't see as being realistic. > > Do we have decent test coverage of queries changing scan direction? Seems like > > something that could quite easily have bugs. > > You're right to be concerned about potential bugs here. And about the > existing test coverage. > > While the regression tests have limited coverage, my personal test > suite covers many variations. Some of these come from my work on the > nbtree projects included in Postgres 17 and 18. Others were derived > from a stress-testing suite developed by Tomas that generates > cursor-based queries that randomly scroll back and forth, verifying > that the patch agrees with master at every step. > > Improving the core regression tests in this area would be difficult. A > test case like this must focus on one specific problematic page > transition (or a small series of related transitions). Obviously, that > depends on the data being laid out precisely, which depends on many > implementation details. And we usually need quite a bit of data to > tickle the implementation in just the right/wrong way. I don't think we necessarily need the coverage of your full torture test suite in core, but I feel some basic sanity tests really ought to be in the core tests. There's very little, from what I can tell. Even just making sure we have coverage for a index [only] scans going forward and backward, and the same for merge & nestloop joins would be quite a win. > > > +void > > > +tableam_util_free_batch(IndexScanDesc scan, IndexScanBatch batch) > > > +{ > > > Any reason to not use indexam_util_batch_release() to implement the release? > > Seems nicer to not have two copies of this code. > > Yes: We don't always want to use the batch cache, even when it's > available. In particular, we don't want to do that during simple point > queries (like those from pgbench SELECT), when batches are freed only > as the scan ends. > IOW, when we're called through index_batchscan_end (specifically, when > we're called through index_batchscan_reset when it is called by > index_batchscan_end) we don't want to use the cache. It seemed to make > sense to implement this in a way that didn't require any special > handling from within indexam_util_batch_release (since it's not a > concern of index AMs). I am not really following. Right now there's two close copies of this code: void indexam_util_batch_release(IndexScanDesc scan, IndexScanBatch batch) ... * Try to store caller's batch in this amgetbatch scan's cache of * previously released batches first */ for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++) { if (scan->batchcache[i] == NULL) { /* found empty slot, we're done */ scan->batchcache[i] = batch; return; } } /* * Failed to find a free slot for this batch. We'll just free it * ourselves. */ if (batch->deadItems) pfree(batch->deadItems); pfree(batch_alloc_base(batch, scan)); void tableam_util_free_batch(IndexScanDesc scan, IndexScanBatch batch) { ... /* * Use cache, just like indexam_util_batch_release does it (unless scan is * shutting down) */ if (scan->xs_heapfetch) { for (int i = 0; i < INDEX_SCAN_CACHE_BATCHES; i++) { if (scan->batchcache[i] == NULL) { /* found empty slot, we're done */ scan->batchcache[i] = batch; return; } } } if (batch->deadItems) pfree(batch->deadItems); pfree(batch_alloc_base(batch, scan)); Yes, one of them has the additional "if (scan->xs_heapfetch)" condition, but that hardly seems like a real problem preventing sharing the code. I don't think that "if (scan->xs_heapfetch)" is a particularly obvious signifier of the scan being terminated, FWIW. I'd probably either add a parameter to tableam_util_free_batch() signifying whether caching is desired or add an internal helper with that parameter that's then used internally in indexbatch.c. Greetings, Andres Freund