Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w3kHC-001Vfp-1H for pgsql-hackers@arkaria.postgresql.org; Sat, 21 Mar 2026 00:33:15 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w3kHA-008YJf-21 for pgsql-hackers@arkaria.postgresql.org; Sat, 21 Mar 2026 00:33:13 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w3kH9-008YJW-1C for pgsql-hackers@lists.postgresql.org; Sat, 21 Mar 2026 00:33:12 +0000 Received: from fhigh-b8-smtp.messagingengine.com ([202.12.124.159]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1w3kH5-00000000GQi-1lNv for pgsql-hackers@lists.postgresql.org; Sat, 21 Mar 2026 00:33:11 +0000 Received: from phl-compute-05.internal (phl-compute-05.internal [10.202.2.45]) by mailfhigh.stl.internal (Postfix) with ESMTP id 6CACB7A012C; Fri, 20 Mar 2026 20:33:04 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-05.internal (MEProxy); Fri, 20 Mar 2026 20:33:04 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm1; t=1774053184; x=1774139584; bh=nU3BlLu4pd R14U6qWb+TmMM87+RVjGgTfB5H4ClqCIU=; b=wJWVSu+V/1DSxm3JObGVQwZPNe JEkve0ivrxb6jAXkqpvgRxkx+lvXrWLaesNAVuc/7HIL3Ao4xHpZjMLwIGhpenUV JCz/GU3E6zpOyCWxpHMIykPfRG5HlhTtWJg+bch1WcYCSkj4P7eQck+n3BxR9Mml epT3u0p/ydX5AphvW/z8BuusJA0miNc//lfp8VoELM77uC60fkzQQEeFj16bUdt3 mSSQJgLnXyOfHCz57Igg1C7jLNcDXCXZJ5cGwUQB/cdzktvNwKRSa6EOsMSJKYnD fUFBtPOEQwLEHyj4MKrOGdbPqxUULiHB/hfZpGlV9uhAueFLNkA8Vdhh5Paw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t= 1774053184; x=1774139584; bh=nU3BlLu4pdR14U6qWb+TmMM87+RVjGgTfB5 H4ClqCIU=; b=qblTqy+7yh1FEQdVD2OTd+c6JXEYh5AptzG7DAIKtNepxTjKCjG vN8tXjm7jsjGT/0MPI2tRb4qJ8BkFT5yg77kPN/tdIYPT2yu/mBzSl9D7KygNQxd Bq5Ud5b6mH/BFEW0cxuMpkhISNrOl1XYdZY1xK5WlQlrMHPe0PP5fR3pDlzlDTon 94AZKJFNwFYEz9MZ8FC1OAFNtrPqgTZ5trvmcXsRZiQ8lV1myUpTjE/3ThyNqr1I c6njUwhRAXVTUJ7T5cGZaNaM+yR02ES5gYXtD+ubmr3HaTZAhe1eqrLF4yI/0UFG g5luAX1RxIG9Ck8xvi81yu6w9DMkZ8+XaZg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgdefuddufeejucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggujgesthdtsfdttddtvdenucfhrhhomheptehnughrvghs ucfhrhgvuhhnugcuoegrnhgurhgvshesrghnrghrrgiivghlrdguvgeqnecuggftrfgrth htvghrnhepfeffgfelvdffgedtveelgfdtgefghfdvkefggeetieevjeekteduleevjefh ueegnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomheprg hnughrvghssegrnhgrrhgriigvlhdruggvpdhnsggprhgtphhtthhopeduuddpmhhouggv pehsmhhtphhouhhtpdhrtghpthhtohepphhgsegsohifthdrihgvpdhrtghpthhtohepkh hnihiihhhnihhksehgrghrrhgvthdrrhhupdhrtghpthhtohepsgihrghvuhiikedusehg mhgrihhlrdgtohhmpdhrtghpthhtohepughilhhiphgsrghlrghuthesghhmrghilhdrtg homhdprhgtphhtthhopehmvghlrghnihgvphhlrghgvghmrghnsehgmhgrihhlrdgtohhm pdhrtghpthhtohepohdrrghlvgigrghnughrvgdrfhgvlhhiphgvsehgmhgrihhlrdgtoh hmpdhrtghpthhtoheprhhosggvrhhtmhhhrggrshesghhmrghilhdrtghomhdprhgtphht thhopehthhhomhgrshdrmhhunhhrohesghhmrghilhdrtghomhdprhgtphhtthhopehpgh hsqhhlqdhhrggtkhgvrhhssehlihhsthhsrdhpohhsthhgrhgvshhqlhdrohhrgh X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Fri, 20 Mar 2026 20:33:02 -0400 (EDT) Date: Fri, 20 Mar 2026 20:33:02 -0400 From: Andres Freund To: Peter Geoghegan Cc: Tomas Vondra , Alexandre Felipe , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar Subject: Re: index prefetching Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2026-03-19 13:44:16 -0400, Peter Geoghegan wrote: > The other notable change aims to improve performance for cached > workloads through more aggressive specialization of callback routines > like heapam_index_getnext_slot. There are now a total of 4 versions: > > 1. One for index-only scans + amgetbatch. > 2. One for one for index-only scans + amgettuple. > 3. One for plain index scans + amgetbatch. > 4. One for plain index scans + amgettuple. > > I've also aggressively used pg_attribute_always_inline (without this > specialization, performance ends up worse rather than better). > > There might be some concerns about the distributed costs of increased > code size. I'm not particularly concerned about that - I think it actually might often *increase* the icache hit ratio, because without this specialization every scan has to jump over a fair bit of code not used during that scan. > From 8bcd4eb22a6479302fecaf5088c1f03f43f1f407 Mon Sep 17 00:00:00 2001 > From: Peter Geoghegan > Date: Tue, 10 Mar 2026 14:22:25 -0400 > Subject: [PATCH v16 01/17] Make IndexScanInstrumentation a pointer in executor > scan nodes. > > Change the IndexScanInstrumentation fields in IndexScanState, > IndexOnlyScanState, and BitmapIndexScanState from inline structs to > pointers. This avoids adding additional overhead as new fields are > added to IndexScanInstrumentation, at least in the common case where the > instrumentation isn't used (i.e. when the executor node isn't being run > through an EXPLAIN ANALYZE). LGTM > From 997822891f393a13be22c1be2d2d0094a485fddf Mon Sep 17 00:00:00 2001 > From: Peter Geoghegan > Date: Tue, 10 Mar 2026 14:40:35 -0400 > Subject: [PATCH v16 02/17] heapam: Track heap block in IndexFetchHeapData > using xs_blk > > Add an explicit BlockNumber field (xs_blk) to IndexFetchHeapData that > tracks which heap block is currently pinned in xs_cbuf. > > heapam_index_fetch_tuple now uses xs_blk to determine when buffer > switching is needed, replacing the previous approach that compared > buffer identities via ReleaseAndReadBuffer on every non-HOT-chain call. > The new approach skips the buffer-switching path entirely when the next > TID is on the same heap page, which also means that heap_page_prune_opt > is called exactly once per block (when the block is first pinned). I think that was already almost always the case, due to the if (prev_buf != hscan->xs_cbuf) > This is preparatory work for an upcoming commit that will need xs_blk > to manage buffer pin transfers between the scan and the executor slot. A subsequent commit adds an earlier ExecClearTuple(slot) to make the buffer refcount decrement cheaper (due to hitting the one-element cache in bufmgr.c). I wonder if it's worth pulling that into this commit? Mostly to make that larger commit smaller. > From c83aef317ba452b7631feee10062d92f33a64930 Mon Sep 17 00:00:00 2001 > From: Peter Geoghegan > Date: Tue, 9 Sep 2025 19:50:03 -0400 > Subject: [PATCH v16 03/17] Add interfaces that enable index prefetching. > > -/* mark current scan position */ > -typedef void (*ammarkpos_function) (IndexScanDesc scan); > - > -/* restore marked scan position */ > -typedef void (*amrestrpos_function) (IndexScanDesc scan); > +/* invalidate index AM state that independently tracks scan's position */ > +typedef void (*amposreset_function) (IndexScanDesc scan, > + IndexScanBatch batch); > > /* > * Callback function signatures - for parallel index scans. > @@ -309,10 +320,12 @@ typedef struct IndexAmRoutine > ambeginscan_function ambeginscan; > amrescan_function amrescan; > amgettuple_function amgettuple; /* can be NULL */ > + amgetbatch_function amgetbatch; /* can be NULL */ > + amkillitemsbatch_function amkillitemsbatch; /* can be NULL */ > + amunguardbatch_function amunguardbatch; /* can be NULL */ > amgetbitmap_function amgetbitmap; /* can be NULL */ > amendscan_function amendscan; > - ammarkpos_function ammarkpos; /* can be NULL */ > - amrestrpos_function amrestrpos; /* can be NULL */ > + amposreset_function amposreset; /* can be NULL */ > > /* interface functions to support parallel index scans */ > amestimateparallelscan_function amestimateparallelscan; /* can be NULL */ The commit message doesn't mention that this affects ammarkpos/amrestrpos. Does that imply that index AMs that want to support mark/restore need to switch away from amgettuple? Probably fine, but worth mentioning, I think. > diff --git a/src/include/access/genam.h b/src/include/access/genam.h > index 1a27bf060..6b5c48e5e 100644 > --- a/src/include/access/genam.h > +++ b/src/include/access/genam.h > @@ -96,6 +96,7 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state); > > +/* > + * amgetbatch utilities called by indexam.c (in indexbatch.c) > + */ > +extern void index_batchscan_init(IndexScanDesc scan); > +extern void index_batchscan_reset(IndexScanDesc scan); > +extern void index_batchscan_end(IndexScanDesc scan); > +extern void index_batchscan_mark_pos(IndexScanDesc scan); > +extern void index_batchscan_restore_pos(IndexScanDesc scan); > + > +/* > + * amgetbatch utilities called by table AMs (in indexbatch.c) > + */ > +extern void tableam_util_batch_dirchange(IndexScanDesc scan); > +extern void tableam_util_kill_scanpositem(IndexScanDesc scan); > +extern void tableam_util_free_batch(IndexScanDesc scan, IndexScanBatch batch); > +extern void tableam_util_unguard_batch(IndexScanDesc scan, IndexScanBatch batch); > + > +/* > + * amgetbatch utilities called by index AMs (in indexbatch.c) > + */ > +extern void indexam_util_batch_unlock(IndexScanDesc scan, IndexScanBatch batch, > + Buffer buf); > +extern IndexScanBatch indexam_util_batch_alloc(IndexScanDesc scan); > +extern void indexam_util_batch_release(IndexScanDesc scan, IndexScanBatch batch); > + > #endif /* GENAM_H */ I wonder if these should be in a separate header, a reasonably large number of files include genam.h due to systable_beginscan() etc, and those should not need to know about this stuff. > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h > index 9bcf74862..55c9069d6 100644 > --- a/src/include/access/heapam.h > +++ b/src/include/access/heapam.h > @@ -129,10 +129,37 @@ typedef struct IndexFetchHeapData > Buffer xs_cbuf; > BlockNumber xs_blk; > > - /* Current heap block's corresponding page in the visibility map */ > - Buffer xs_vmbuffer; > + /* For index-only scans that must access the visibility map */ > + Buffer xs_vmbuffer; /* visibility map buffer */ > + int xs_vm_items; /* # items to resolve visibility info for */ > + bool xs_lastinblock; /* last TID on this block in current batch? */ > + > } IndexFetchHeapData; FWIW, Melanie is going to use xs_vmbuffer for on-access pruning soon, maybe worth not phrasing it as specific to IOSs as it will need to be revised anyway. > + typedef struct HeapBatchData > + { > + uint8 *visInfo; /* per-item visibility flags, or NULL */ > + } HeapBatchData; > + > + /* > + * Per-item visibility flags stored in HeapBatchData.visInfo array > + */ > + #define HEAP_BATCH_VIS_CHECKED 0x01 /* checked item in VM? */ > + #define HEAP_BATCH_VIS_ALL_VISIBLE 0x02 /* block is known all-visible? */ So we only 2 out of 8 bits. I guess it's probably not worth doing bit shift stuff. But still curious if you tried? > diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h > index da7503c57..74af9efc3 100644 > --- a/src/include/access/nbtree.h > +++ b/src/include/access/nbtree.h > @@ -924,111 +924,27 @@ typedef struct BTVacuumPostingData > > + * Access the btree-private per-batch data from an IndexScanBatch pointer. > + * This follows the standard convention for index AM opaque state: it can be > + * found at a fixed negative offset from the IndexScanBatch pointer. > */ > +static inline BTBatchData * > +BTBatchGetData(IndexScanBatch batch) > { > + return (BTBatchData *) ((char *) batch - MAXALIGN(sizeof(BTBatchData))); > +} I think it maybe worth threading this through a generic helper macro taking IndexScanDesc, IndexScanBatch, and the type of the private patch data that does the pointer math and also can assert that iscan->batch_index_opaque_size == MAXALIGN(sizeof(BTBatchData)) > +/* > + * Data about one batch of items returned by (and passed to) amgetbatch during > + * index scans. > + * > + * Each batch allocation has the following memory layout: > + * > + * [table AM opaque area] <- at -(batch_table_offset) from batch ptr > + * [index AM opaque area] <- at -(batch_index_opaque_size) from batch ptr > + * [IndexScanBatchData] <- the returned pointer > + * [items[maxitemsbatch]] > + * [table AM trailing data] <- e.g. per-item visibility flags > + * [currTuples workspace] <- sized by index AM (batch_tuples_workspace) > + * > + * The AM-specific opaque areas are accessed via accessor functions defined by > + * each table AM and index AM that supports the batch interfaces. > + */ > +typedef struct IndexScanBatchData > +{ > + XLogRecPtr lsn; /* index page's LSN */ Still doubtful this should be in generic infra (together with indexam_util_batch_unlock()). > +/* > + * Maximum number of batches (leaf pages) we can keep in memory. We need a > + * minimum of two, since we'll only consider releasing one batch when another > + * is read. > + * > + * The current maximum of 64 batches is somewhat of an arbitrary limit. Very > + * few scans ever get near to this limit in practice. > + */ > +#define INDEX_SCAN_MAX_BATCHES 64 > +#define INDEX_SCAN_CACHE_BATCHES 2 Just curious: How did you end up with INDEX_SCAN_CACHE_BATCHES == 2? > +/* > + * State used by table AMs to manage an index scan that uses the amgetbatch > + * interface. Scans use a ring buffer of batches returned by amgetbatch. > + * > + * Batches are kept in the order that they were returned in by amgetbatch, > + * since that is the same order that table_index_getnext_slot will return > + * matches in. However, table AMs are free to fetch table tuples in whatever > + * order is most convenient/efficient -- provided that such reordering cannot > + * affect the order that table_index_getnext_slot later returns tuples in. > + */ > +typedef struct BatchRingBuffer > +{ > + /* current positions in batches[] for scan */ > + BatchRingItemPos scanPos; /* scan's read position */ > + BatchRingItemPos markPos; /* mark/restore position */ > + > + IndexScanBatch markBatch; > + > + /* > + * headBatch is an index to the earliest still-valid batch in 'batches'. > + * In practice this must be the scan's current scanPos batch (scanBatch). > + */ > + uint8 headBatch; > + > + /* > + * nextBatch is an index to the next empty batch slot in 'batches'. This > + * is only actually usable when the scan is !index_scan_batch_full(). > + */ > + uint8 nextBatch; Is nextBatch basically the tail? The "only actually usable" bit is a bit confusing, given that index_scan_batch_full() is implemented by (nextBatch - headBatch) == INDEX_SCAN_MAX_BATCHES Maybe worth having a static assert somewhere to ensure that INDEX_SCAN_MAX_BATCHES isn't set to a value bigger than what fits into a uint8? > +/* > + * Append given batch to scan's batch ring buffer. > + */ > +static inline void > +index_scan_batch_append(IndexScanDescData *scan, IndexScanBatch batch) > +{ > + BatchRingBuffer *ringbuf = &scan->batchringbuf; > + uint8 nextBatch = ringbuf->nextBatch; > + > + ringbuf->batches[nextBatch & (INDEX_SCAN_MAX_BATCHES - 1)] = batch; > + ringbuf->nextBatch++; > +} Perhaps worth an assert ensuring that there's space in the ring buffer? > diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h > index 8847d7f94..b5b8f509a 100644 > --- a/src/include/executor/instrument_node.h > +++ b/src/include/executor/instrument_node.h > @@ -48,6 +48,12 @@ typedef struct IndexScanInstrumentation > { > /* Index search count (incremented with pgstat_count_index_scan call) */ > uint64 nsearches; > + > + /* > + * heap blocks fetched counts (incremented by index_getnext_slot calls > + * within table AMs, though only during index-only scans) > + */ > + uint64 nheapfetches; > } IndexScanInstrumentation; s/heap/table/ for anything new imo. > +/* > + * heapam_batch_resolve_visibility > + * Obtain visibility information for a TID from caller's batch. I really dislike these pointless restatements of the function name and the weird indentation of the function "title" that's indented differently across files and in the file. Since they're not in use in heapam_handler.c can we please not introduce them now? > + * > + * Called during index-only scans. We always check the visibility of caller's > + * item (an offset into caller's batch->items[] array). We might also set > + * visibility info for other items from caller's batch more proactively when > + * that makes sense. > + * > + * We keep two competing considerations in balance when determining whether to > + * check additional items: the need to keep the cost of visibility map access > + * under control when most items will never be returned by the scan anyway > + * (important for inner index scans of anti-joins and semi-joins), and the > + * need to not hold onto index leaf pages for too long. > + * > + * Note on Memory Ordering Effects > + * ------------------------------- > + * > + * visibilitymap_get_status does not lock the visibility map buffer, and > + * therefore the result we read here could be slightly stale. However, it > + * can't be stale enough to matter. > + * > + * We need to detect clearing a VM bit due to an insert right away, because > + * the tuple is present in the index page but not visible. The reading of the > + * TID by this scan (using a shared lock on the index buffer) is serialized > + * with the insert of the TID into the index (using an exclusive lock on the > + * index buffer). Because the VM bit is cleared before updating the index, > + * and locking/unlocking of the index page acts as a full memory barrier, we > + * are sure to see the cleared bit if we see a recently-inserted TID. > + * > + * Deletes do not update the index page (only VACUUM will clear out the TID), > + * so the clearing of the VM bit by a delete is not serialized with this test > + * below, and we may see a value that is significantly stale. However, we > + * don't care about the delete right away, because the tuple is still visible > + * until the deleting transaction commits or the statement ends (if it's our > + * transaction). In either case, the lock on the VM buffer will have been > + * released (acting as a write barrier) after clearing the bit. And for us to > + * have a snapshot that includes the deleting transaction (making the tuple > + * invisible), we must have acquired ProcArrayLock after that time, acting as > + * a read barrier. > + * > + * It's worth going through this complexity to avoid needing to lock the VM > + * buffer, which could cause significant contention. > + */ ISTM that moving this comment here is hiding it too much, I'm pretty sure there are other AMs using visibilitymap out there. Perhaps we should just move it to the top of visibilitymap.c? > +static void > +heapam_batch_resolve_visibility(IndexScanDesc scan, ScanDirection direction, > + IndexScanBatch batch, HeapBatchData *hbatch, > + BatchRingItemPos *pos) > +{ > + IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan->xs_heapfetch; > + int posItem = pos->item; > + int noSetItem, > + step; > + bool allbatchitemvisible; > + BlockNumber curvmheapblkno = InvalidBlockNumber; > + uint8 curvmheapblkflags = 0; > + > + Assert(hbatch == heap_batch_data(batch, scan)); > + > + /* > + * We better still have the TID recycling interlock (generally a pin on > + * the batch's index page) -- amunguardbatch has not been called yet > + */ > + Assert(!scan->batchImmediateUnguard); > + > + /* Determine the range of items to set visibility for */ > + if (ScanDirectionIsForward(direction)) > + { > + noSetItem = Min(batch->lastItem + 1, posItem + hscan->xs_vm_items); > + allbatchitemvisible = noSetItem > batch->lastItem && > + (posItem == batch->firstItem || > + (hbatch->visInfo[batch->firstItem] & HEAP_BATCH_VIS_CHECKED)); > + step = 1; > + } > + else > + { > + noSetItem = Max(batch->firstItem - 1, posItem - hscan->xs_vm_items); > + allbatchitemvisible = noSetItem < batch->firstItem && > + (posItem == batch->lastItem || > + (hbatch->visInfo[batch->lastItem] & HEAP_BATCH_VIS_CHECKED)); > + step = -1; > + } > + > + /* > + * Set visibility info for a range of items, in scan order. > + * > + * noSetItem is the first item (in the given scan direction) that won't be > + * set during this call. noSetItem often points to just past the end of > + * (or just before the start of) the batch's 'items' array. > + * > + * We iterate this way to avoid the need for 2 direction-specific loops, > + * since this is a hot code path that's sensitive to code size increases. > + */ I'm surprised this is better than two loops, tbh. It makes the actual looping code more complicated, with higher register pressure. > +/* ---------------- > + * heapam_batch_getnext - get the next batch of TIDs from a scan > + * > + * Called when we need to load the next batch of index entries to process in > + * the given direction. Caller passes us a batch and a batch position, which > + * has just been used to read all items from the batch in the direction passed > + * by caller. > + * > + * Returns the next batch to be processed by the index scan, or NULL when > + * there are no more matches in the given scan direction. Does not advance > + * caller's batch position; that is left up to caller. > + * > + * This is also where batches are appended to the scan's ring buffer. We > + * don't free any batches here, though; that is also left up to caller. > + * ---------------- > + */ > +static pg_attribute_hot IndexScanBatch > +heapam_batch_getnext(IndexScanDesc scan, ScanDirection direction, > + IndexScanBatch priorBatch, BatchRingItemPos *pos) > +{ Kinda seems nothing (except the one assert below) in here is heap specific? I.e. that every table AM will need pretty exactly this? Seems like it ought to be somewhere where it can be reused. Could be a static inline in an indexbatch.h file or such if inlining were desirable. > +/* ---------------- > + * heapam_batch_getnext_tid - get next TID from batch ring buffer > + * > + * Get the next TID from the scan's batch ring buffer, when moving in the > + * given scan direction. > + * ---------------- > + */ > +static pg_attribute_hot pg_attribute_always_inline ItemPointer > +heapam_batch_getnext_tid(IndexScanDesc scan, IndexFetchHeapData *hscan, > + ScanDirection direction, bool *all_visible) > +{ > + BatchRingBuffer *batchringbuf = &scan->batchringbuf; > + BatchRingItemPos *scanPos = &batchringbuf->scanPos; > + IndexScanBatch scanBatch = NULL; > + > + Assert(!scanPos->valid || batchringbuf->headBatch == scanPos->batch); > + Assert(scanPos->valid || index_scan_batch_count(scan) == 0); > + Assert(all_visible == NULL || scan->xs_want_itup); > + > + /* > + * Check if there's an existing loaded scanBatch for us to return the next > + * matching item's TID/index tuple from > + */ > + if (scanPos->valid) > + { > + /* > + * scanPos is valid, so scanBatch must already be loaded in batch ring > + * buffer. We rely on that here. > + */ > + Assert(batchringbuf->headBatch == scanPos->batch); > + > + scanBatch = index_scan_batch(scan, scanPos->batch); > + > + if (index_scan_pos_advance(direction, scanBatch, scanPos)) > + return heapam_batch_return_tid(scan, hscan, direction, > + scanBatch, scanPos, > + all_visible); > + } > + > + /* > + * Either ran out of items from our existing scanBatch, or it hasn't been > + * loaded yet (because this is the first call here for the entire scan). > + * Try to advance scanBatch to the next batch (or get the first batch). > + */ > + scanBatch = heapam_batch_getnext(scan, direction, scanBatch, scanPos); > + > + if (!scanBatch) > + { > + /* > + * We're done; no more batches in the current scan direction. > + * > + * Note: scanPos is generally still valid at this point. The scan > + * might still back up in the other direction. > + */ > + return NULL; > + } > + > + /* > + * Advanced scanBatch. Now position scanPos to the start of new > + * scanBatch. > + */ > + index_scan_pos_nextbatch(direction, scanBatch, scanPos); > + Assert(index_scan_batch(scan, scanPos->batch) == scanBatch); > + > + /* > + * Remove the head batch from the batch ring buffer (except when this new > + * scanBatch is our only one) > + */ > + if (batchringbuf->headBatch != scanPos->batch) > + { > + IndexScanBatch headBatch = index_scan_batch(scan, > + batchringbuf->headBatch); > + > + /* free obsolescent head batch (unless it is scan's markBatch) */ > + tableam_util_free_batch(scan, headBatch); > + > + /* Remove the batch from the ring buffer */ > + batchringbuf->headBatch++; > + } > + > + /* In practice scanBatch will always be the ring buffer's headBatch */ > + Assert(batchringbuf->headBatch == scanPos->batch); > + > + return heapam_batch_return_tid(scan, hscan, direction, scanBatch, scanPos, > + all_visible); > +} > + A lot of this also seems like it should be generic code. Where it actually starts to be heapam specific is in heapam_batch_return_tid() - and there's two calls to that. So maybe there should be generic helper for the bulk of heapam_batch_getnext_tid() that's called by the heapam version, which either returns a NULL upwards or does a single heapam_batch_return_tid(). > +/* ---------------- > + * heapam_index_fetch_heap - get the scan's next heap tuple > + * > + * The result is a visible heap tuple associated with the index TID most > + * recently fetched by our caller in scan->xs_heaptid, or NULL if no more > + * matching tuples exist. (There can be more than one matching tuple because > + * of HOT chains, although when using an MVCC snapshot it should be impossible > + * for more than one such tuple to exist.) > + * > + * On success, the buffer containing the heap tup is pinned. The pin must be > + * dropped elsewhere. > + * ---------------- > + */ > +static pg_attribute_hot pg_attribute_always_inline bool