Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vwPoB-005PhD-1A for pgsql-hackers@arkaria.postgresql.org; Sat, 28 Feb 2026 19:17:00 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vwPo9-00BHcD-0j for pgsql-hackers@arkaria.postgresql.org; Sat, 28 Feb 2026 19:16:57 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vwPo8-00BHc5-1t for pgsql-hackers@lists.postgresql.org; Sat, 28 Feb 2026 19:16:56 +0000 Received: from mail-wm1-x333.google.com ([2a00:1450:4864:20::333]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vwPo3-00000001pHE-0DDi for pgsql-hackers@lists.postgresql.org; Sat, 28 Feb 2026 19:16:55 +0000 Received: by mail-wm1-x333.google.com with SMTP id 5b1f17b1804b1-483770e0b25so27417825e9.0 for ; Sat, 28 Feb 2026 11:16:51 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1772306209; cv=none; d=google.com; s=arc-20240605; b=HL/la5phHcTOV5YXe9IIa2dMV0DrRWgwHM6iFFfyxpVeN8KZeaaTfZsPTlbTcACIbI CkRvVVeXdkHJKJt1CH4XbeJkQM7pRtEvQTyLMuDD2HxHomevgwZpJKkIIcgpLghuok4H x6OI5YlcCET76lkAszZxP9hmBIZYaEoTm6UEvSSkfIxVJRu8f9J3HYjS0IExuXYKlDGg d+2pDlAQWt9j4fhBGHnbRRe5mqek/fRD7B5MB+svetvXNUyZTbBm/P7t91flaNn3KaW6 hzr8r1oDYKL+gSHanaii8byEes89TSEwWWbW4diwKlXVE3GvhjCS3diL86fArJBZecSn i/rA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=raNn1YkEtsrW/Kb7EDZcyYaJpGL1t51IXsq6OreVl5I=; fh=LK45qHILvhxIXGnQmKOn+5hrWo8LDFlO6YKDn/xACyQ=; b=Ab6vXuH1TVwtIFo96jk5iAaLNTQieeg2j6gq9QkBqGUQD6GLTr3lqznR60EQqhkgxM oafU0SysoAWi1BtIUuz6PzX7qOdNeIeqwAqYQBa8dr63FjgggrPDTzZlkx0WBiXI1Tuz lHnJBVP023mT0sQLW9nZTDdCL0WBlCpX9Tapa5opUnz6gtERF0WqzkF0F+3y3sr1XxUA y/ctz6cuez8qRdVIpRVpnNW+JrKwL6HUc2iNCs1hzLUvlPFpcrQGoXPBv2UAxik4+A1c 6iKBihgXaPfX634MtY8kkF2T+gjZS/JZROC112wZKS8LUpUVKDx9BQYEEsAqMOADuTvj PyuA==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bowt-ie.20230601.gappssmtp.com; s=20230601; t=1772306209; x=1772911009; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=raNn1YkEtsrW/Kb7EDZcyYaJpGL1t51IXsq6OreVl5I=; b=zXj/13kl2MwcdE8tc7x7dxwQE59vTLmbzkdC93lSA3bEvM70Dy/V3rIgHFe3h/N3VB aasc4ZChr8X3c02l0e1kVWTVOjc5qKvD08obHVqlP4isCXTUBL9dWM/K+WVsy6GYkelm YOwUrZNKP0gxeosHKrwPliVclJo9aZkf7ThjvZrrQtnuxPsKxu6ss+CyIgQ7dM/K7vaP 2OmykG2OtSapY4QW5HDkS4DEEY59K6HGnxiKlL7eCSi1RERoGFpSpd4eSz36+zZ5HzrM QOeaPHPueIJRD2bSicblHUxXVGtHSDalMZw3TI9MZyBDHfyhKPbybB7IZfxga51IXyhs zrog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1772306209; x=1772911009; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=raNn1YkEtsrW/Kb7EDZcyYaJpGL1t51IXsq6OreVl5I=; b=XiBxSDGR20P/QRZWRkk7bI0gZtiz+T07hn2yu3GcXlR7nmFzAU/GWOPQvR1NAdtVi8 uR4qyT6nHWp0Hn7TL+pE3rgZ1qAcJx4gDSHOWUB8IAB2srF63grK/ae4Pr85K9cGaf7y ulwu5dU8l5an6QoYApLP0E2/SfeCiHA56sZ3PYFucNdTz/R0gR1vmboqDq/Dw4T+0n88 mYhS2eAPHewgKBmoaO2S8h4XkDyNiXa0vcX3YQET4neEdkDv7vIScKdgE2c8MDouX4q4 YXpVoATT/H5bS22+uQjGayHLtYZXU1+5E5qydpJ1+eo8ZLcBVN0DX86inRulKqr4Lbd0 6kqQ== X-Forwarded-Encrypted: i=1; AJvYcCW/0AAAcYrdhcln5DHUWgfdb4j0v+on9r6jJbR4Y/nikKPRQpHf+gt8s0hvt5D+nwy2Rfw1PiZpBCDpeouq@lists.postgresql.org X-Gm-Message-State: AOJu0YyZFb1f/f62VVjwNCVftndKSy5wYGVXL13xjcqyrP6S3lGFd/9w VsisbVa6duW9EMnY5+TIqdb6ohO9gy9/oX/5pok/fb0zdSPxPqcCt4S6qgWSbrWvzboraFCwgwf BP5ZI9MS3pZiWSLqiZ+25OQ4h0P2GQt5MZvyKKtNZWQ== X-Gm-Gg: ATEYQzw8Tu5VBiFIY/G+R4ZRT8T+xam5AlpHvJQyAHFa6A1+pQKM1q76K+fyZSVGtQw 3nnsUbxtoCP+xHUIiTfYluuKOQqjVhz4gzaTmScZ4p1sf/bZ09X4fEcKEDEIqJkKAbkGg5YG6Rg LKHv7aDVJrCEXAWclseY8t9hKuZcIjuwZFSyAi1KomAmxkdLer89cqod2PBHj2N+ujh9ltbYkIq nmyts10bJuIB8wbwb4BFMJrYRa6WpK77nT67byiLoxQlLIV1PD34R93SRsI2HuX5WfwDH6f0ENr zRqo7RM= X-Received: by 2002:a05:600c:444d:b0:482:eec4:772 with SMTP id 5b1f17b1804b1-483c9c2d742mr107328135e9.32.1772306208397; Sat, 28 Feb 2026 11:16:48 -0800 (PST) MIME-Version: 1.0 References: <64a2re223ajj4popowsyu4xekbuvvyfwkrihn5yzyrkwsmsuvp@2lls3tpww5dl> <52512325-b1f2-4fff-819e-f68122b2e427@vondra.me> <64mfcfv7iihc4pmqlxarii4esnmqry52ckz5m7lmwylnfnuxuz@oxh4ioxkjtep> In-Reply-To: From: Peter Geoghegan Date: Sat, 28 Feb 2026 14:16:22 -0500 X-Gm-Features: AaiRm50n-yQuS1SJw3LknHweoflLT9MvaodoaDT-3H60Uxqpp6Wf0AZBB_Jq2_g Message-ID: Subject: Re: index prefetching To: Andres Freund Cc: Tomas Vondra , Alexandre Felipe , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Fri, Feb 27, 2026 at 6:52=E2=80=AFPM Andres Freund = wrote: > I know you just moved this code, but isn't it kinda bogus to use if > (RelationIsPermanent()) here? If the fake LSN is used on a durable page,= it > wouldn't be safe to just use XLogAssignLSN(), as it'd not emit an FPI for= the > page. > > All the gist callers use if (!RelationNeedsWAL()) to protect calls to thi= s, so > it's not a bug, but still seems somewhat odd. Will fix. > I'd probably not list the AMs here, because as of this commit, it'd not b= e > true yet. I'll fix this too. > Other than these points, I think it's probably worth pushing this soon? Great. I think that it makes sense to push the first 2 patches together. I'll work on doing that towards the end of the next week (so around Friday March 6). > BufferGetLSNAtomic()'s comment about not needing a lock if hint bits aren= 't > used looks a bit bogus to me with this use-case - it makes it sound like = we > don't rely on the correctness of the return value if hint bits aren't > enabled. But that's bogus, we do. The relevant difference afaict is that > without WAL logging of hint bits we will never set the LSN of a page that= is > read locked, and therefore any caller that has at least a read lock doesn= 't > need GetLSNAtomic() unless XLogHintBitIsNeeded(). Do you think this needs to be fixed in the first patch of this series? > > diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtr= ee/README > > index 53d4a61dc..cb921ca2e 100644 > > --- a/src/backend/access/nbtree/README > > +++ b/src/backend/access/nbtree/README > > @@ -485,9 +485,8 @@ We handle this kill_prior_tuple race condition by h= aving affected index > > scans conservatively assume that any change to the leaf page at all > > implies that it was reached by btbulkdelete in the interim period when= no > > buffer pin was held. This is implemented by not setting any LP_DEAD b= its > > -on the leaf page at all when the page's LSN has changed. (That won't = work > > -with an unlogged index, so for now we don't ever apply the "don't hold > > -onto pin" optimization there.) > > +on the leaf page at all when the page's LSN has changed. (This is why= we > > +implement "fake" LSNs for unlogged index relations.) > > > > Does the reference to btbulkdelete actually cover all the relevant cases? > Seems like a e.g. a concurrent _bt_delete_or_dedup_one_page() would also = be a > problem? It's a bit complicated, though the exact details don't seem particularly important to me. Currently, it is barely possible for a query to call _bt_readpage to read a page's items, and subsequently successfully set some LP_DEAD bits on that same page even when a concurrent deduplication pass occurred on the same page in the interim. In practice this can only happen during a !dropPin nbtree index scan -- so it cannot ever happen during most index scans. So, to answer your question: no, it isn't strictly necessary to give up on setting LP_DEAD bits from within _bt_killitems when there has been a concurrent _bt_delete_or_dedup_one_page call. I think this might work unreliably, its possibility is more an artefact of how things worked before the dropPin behavior was added back in 2015. The amgetbatch interface *requires* that all amgetbatch index AMs use nbtree's dropPin behavior for all scans (though we no longer call it dropPin; it's just how amgetbatch operates). I think that it's still the case that we only really need to worry about concurrently TID recycing hazards of the kind that can be avoided by requiring btbulkdelete to acquire a cleanup lock. > One thing that's somewhat sad about this logic is that the > latestlsn =3D BufferGetLSNAtomic(buf); > Assert(so->currPos.lsn <=3D latestlsn); > if (so->currPos.lsn !=3D latestlsn) > > check will trigger even if all the changes were done due to hint bit > FPIs. E.g. due to another backend hinting other index entries on the same= page > as dirty... But anyway, that's not the "fault" of your change. Yeah, that is suboptimal. But it's worked that way for most index scans for over 10 years now. I complained about one aspect of how this could be ineffective back in 2017: https://www.postgresql.org/message-id/flat/CAH2-Wz%3DSfAKVMv1x9Jh19EJ8am8TZ= n9f-yECipS9HrrRqSswnA%40mail.gmail.com#b20ead9675225f12b6a80e53e19eed9d The problem is greatly ameliorated by the fact that the deletion process opportunistically checks TIDs from index tuples not marked LP_DEAD when it is cheap to do so in passing (when the TID points to a heap block we need to access anyway). IIRC, when the regression tests are run, the ntbree tuple deletion process deletes many more non-LP_DEAD-marked index tuples than LP_DEAD marked tuples (obviously this is workload dependent). > This is somewhat orthogonal and just quick idea: > > I wonder if the worry about the overhead of BufferGetLSNAtomic() could be > addressed by adding a ReleaseBuffer() version that returns the current LS= N. I'm open to anything that can help. > Have you done any testing to see if the added BufferGetLSNAtomic() calls > introduce a measurable overhead when using unlogged tables? Yes. It definitely matters. That's why I've been treating that patch of Andreas Karlsson's to make BufferGetLSNAtomic use an atomic op as a dependency (though without posting it to this thread). My testing seems to show that the patch almost eliminates the problem altogether. This is particularly important for avoiding regressions in hash index scans. It would be good to get the BufferGetLSNAtomic atomic op patch out of the way soon. Since it's clearly independently useful. > If so, could the > BufferGetLSNAtomic() be skipped if there currently the scan doesn't curre= ntly > have any killed items? Not really, no. There's no way to know whether the page will have any index tuples that can be LP_DEAD marked. We could perhaps invent heuristics that try to predict it based on past trends, but that seems rather unappealing. Especially considering that there are good ways of making BufferGetLSNAtomic a lot cheaper. > This is a huge change. Is there a chance we can break it up into more > manageable chunks? I'm open to any ideas about that (like the one you go on to suggest about introducing IndexFetchHeapData.xs_blk in its own commit). > A first note: Seems like this needs to update doc/src/sgml/indexam.sgml? I arbitrarily decided to put all of the sgml doc updates into the later prefetching commit. Purely because it was hard to avoid talking about prefetching in the docs. I can put minimal sgml doc updates in the amgetbatch commit. > > +/* > > + * Location of a BatchMatchingItem that appears in a IndexScanBatch re= turned > > + * by (and subsequently passed to) an amgetbatch routine > > + */ > > +typedef struct BatchRingItemPos > > +{ > > + /* Position references a valid BatchRingBuffer.batches[] entry? *= / > > + bool valid; > > + > > + /* BatchRingBuffer.batches[]-wise index to relevant IndexScanBatc= h */ > > + uint8 batch; > > + > > + /* IndexScanBatch.items[]-wise index to relevant BatchMatchingIte= m */ > > + int item; > > +} BatchRingItemPos; > > Does item actually need to be a 32bit int? This way there's a hole and > presumably IndexScanBatch.items[] has a limited size? We don't allocate more than one of these BatchRingItemPos structs anywhere, and always do so on the stack. I think that we rely on them being signed within _bt_readpage, when it generates its return value, here: """ return (newbatch->firstItem <=3D newbatch->lastItem); """ During a forwards scan _bt_readpage that returns no items, we'll return false when "newbatch->lastItem =3D itemIndex - 1;" makes newbatch->lastItem have the value -1 (note that newbatch->firstItem is always 0 during a _bt_readpage in the forwards scan direction). > > +/* > > + * Matching item returned by amgetbatch (in returned IndexScanBatch) d= uring an > > + * index scan. Used by table AM to locate relevant matching table tup= le. > > + */ > > +typedef struct BatchMatchingItem > > +{ > > + ItemPointerData heapTid; /* TID of referenced heap item */ > > If we're generalizing this, I'd be inclined to rename it away from heap t= o > table. WFM. > > + BlockNumber currPage; /* Index page with matching items= */ > > + BlockNumber prevPage; /* currPage's left link */ > > + BlockNumber nextPage; /* currPage's right link */ > > Hm. Are these, and some of the later fields, generic enough for all index > types? There IIRC are index types out there that don't use buffers or blo= cks > in that sense. > > I wonder if individual index AMs need a way to influence what is stored p= er > batch. How strongly do you feel about it? Many months ago, this struct provided a way for index AMs to provide their own state in each batch. But I ultimately concluded that the overhead of that approach was a problem. It's not as if there's no way that the index AM can manage its own state that describes the progress of the scan. For example, nbtree still manages its own array key state (and must provide its own amposreset function to provide indexbatch.c with a way to invalidate that state, such as when the scan direction changes). Index AMs aren't obligated to use fields like currPage and nextPage; they are explicitly private state. So I think it's perfectly possible for an index AM that doesn't use buffers or pages to use the amgetbatch interface, while fully managing its own progress state. It would be straightforward with an index type that didn't support scanning backwards. Even if the scan did support scrolling in either direction, it could map currPage to some kind of private state entry. > > + Buffer buf; /* currPage buf (invalid = means unpinned) */ > > + XLogRecPtr lsn; /* currPage's LSN */ > > + > > + /* scan direction when the index page was read */ > > + ScanDirection dir; > > + > > + /* > > + * knownEndLeft and knownEndRight are used by table AMs to track = whether > > + * there may be matching index entries to the left and right of c= urrPage, > > + * respectively. This helps them to avoid repeatedly calling amg= etbatch. > > + */ > > + bool knownEndLeft; > > + bool knownEndRight; > > If I understand correctly, these mean that that we'd reach the end of the= scan > in one direction? If so, why does it use left/right instead of > forward/backward? The names are based on the existing, similar moreLeft and moreRight fields used by both nbtree and hash. I'm not attached to any of these names. Feel free to suggest alternatives. > > + /* > > + * Matching items state for this batch. Output by index AM for t= able AM. > > + * > > + * The items array is always ordered in index order (ie, increasi= ng > > + * indexoffset). When scanning backwards it is convenient for in= dex AMs > > + * to fill the array back-to-front, so we start at the last slot = and fill > > + * downwards. Hence they need a first-valid-entry and a last-val= id-entry > > + * counter. > > + */ > > + int firstItem; /* first valid in= dex in items[] */ > > + int lastItem; /* last valid ind= ex in items[] */ > > I'd make these unsigned. They must be the same type as BatchRingItemPos.item offsets. So I could probably make them int16, but it wouldn't be easy to make them unsigned due to the aforementioned _bt_readpage issue. > Not sure what "indexoffset" really means in a generic abstraction. It means an offset into the currTuples buffer at which the table AM can find an IndexTuple to return, during an index-only scan. Can you suggest an alternative terminology and/or presentation? > > + /* info about killed items if any (killedItems is NULL if never u= sed) */ > > + int numKilled; /* number of curr= ently stored items */ > > + int *killedItems; /* indexes of killed item= s */ > > Signedness again. Another case where the underlying type used must be the same type as BatchRingItemPos.item offsets. > I'd specify what is being indexed by killedItems. Will fix. > What size is killedItems? In practice it is always IndexScanDescData.maxitemsbatch, when actually allocated. IOW, for nbtree it is MaxTIDsPerBTreePage (1358 items), while for hash it is MaxIndexTuplesPerPage (408 items). > I wonder if "killed" is really the right term in the generic code, it's > unclear whether it refers to table or index items. And the past tense is = a bit > odd too, since nothing may yet been "killed". Will come up with a better name for the next revision. > > + /* > > + * If we are doing an index-only scan, this array stores the per-= item > > + * visibility flags BATCH_VIS_CHECKED and BATCH_VIS_ALL_VISIBLE > > + */ > > + uint8 *visInfo; /* per-item visibility flags, or = NULL */ > > Why just for index only? Seems like it'd be rather useful to batch visib= ility > checks for plain index scans too? Obviously that doesn't have to be > implemented immediately, but I'd probably not preclude it in generic > infrastructure. Whether or not we allocate visInfo space from the generic indexbatch.c routines is determined by IndexScanDesc.xs_want_itup, though. > This seems pretty heap specific. I guess you didn't put it in > IndexFetchHeapData because it'd be somewhat complicated to per-batch stat= e in > there? Yes. And because we really want to keep the visInfo space right next to the corresponding BatchMatchingItem/items[] state. Allocating and releasing all batch state together is important. > Perhaps my earlier suggestion about having "private" per-batch state for = the > indexam is required for table AMs too. I guess we could store an offset = to > the "tableam private" and "indexam private" data in each batch. I think that would be painful. Especially if the goal is to allocate visInfo in memory that is exclusively managed by heapam. > > + /* > > + * If we are doing an index-only scan, this is the tuple storage = workspace > > + * for the matching tuples (tuples referenced by items[]). It is= of size > > + * BLCKSZ, so it can hold as much as a full page's worth of tuple= s. > > + * currTuples points into the trailing portion of this allocation= , just > > + * past items[]. It is NULL for plain index scans. > > + */ > > This was a bit hard to understand before, but in generic code it seems li= ke it > is too specific to the way it was used in nbtree. How so? > > + > > +/* > > + * Maximum number of batches (leaf pages) we can keep in memory. We n= eed a > > + * minimum of two, since we'll only consider releasing one batch when = another > > + * is read. > > + */ > > +#define INDEX_SCAN_MAX_BATCHES 2 > > +#define INDEX_SCAN_CACHE_BATCHES 2 > > Hm. These limits are increased in a later patch adding prefetching to > heapam. I guess you kept it low until then to avoid some overhead? Perha= ps > that's indicative that hardcoding it isn't great? The fact that I only increase INDEX_SCAN_MAX_BATCHES in the later prefetching patch was a fairly arbitrary choice. > Hm, it's somewhat confusing that we have this now, but still have > ->index_fetch_tuple() etc. I don't see a way around that. Not as long as there's a need to support table_index_fetch_tuple_check, which is fundamentally about passing a TID owned by the caller. > Don't really have an opinion about this, but it's not clear to me why thi= s > patch changed this? Really? You actually suggested it (albeit in a completely different context). It just felt like a good idea to make the allocation dynamic, now that I'm adding more stuff to the instrumentation struct. I can change it back, if you prefer. > If we want to change it, could we do that in a prereq patch? Sure. > This seems independent too. Let's just introduce xs_blk and do this chang= e > separately? The we can get it out of the way? Will do. The next revision will put these changes into their own patch. > > + * Note on Memory Ordering Effects > > + * ------------------------------- > > + * > > + * visibilitymap_get_status does not lock the visibility map buffer, a= nd > > + * therefore the result we read here could be slightly stale. However= , it > > + * can't be stale enough to matter. > > + * > > + * We need to detect clearing a VM bit due to an insert right away, be= cause > > + * the tuple is present in the index page but not visible. The readin= g of the > > + * TID by this scan (using a shared lock on the index buffer) is seria= lized > > + * with the insert of the TID into the index (using an exclusive lock = on the > > + * index buffer). Because the VM bit is cleared before updating the i= ndex, > > + * and locking/unlocking of the index page acts as a full memory barri= er, we > > + * are sure to see the cleared bit if we see a recently-inserted TID. > > I don't think the index page locking on the inserter side relevant - that= side > of the memory barrier paring is provided solely by visibilitymap_set(), w= hich > uses an exclusive lock and thus also provides a barrier (so does the heap > buffer locking). These comments were moved from nodeIndexScan.c. > > + * Deletes do not update the index page (only VACUUM will clear out th= e TID), > > Well, not just, right? We can only prune when inserting into the index? Again, these are direct copies of the existing comments. > > + * so the clearing of the VM bit by a delete is not serialized with th= is test > > + * below, and we may see a value that is significantly stale. However= , we > > + * don't care about the delete right away, because the tuple is still = visible > > + * until the deleting transaction commits or the statement ends (if it= 's our > > + * transaction). In either case, the lock on the VM buffer will have = been > > + * released (acting as a write barrier) after clearing the bit. And f= or us to > > + * have a snapshot that includes the deleting transaction (making the = tuple > > + * invisible), we must have acquired ProcArrayLock after that time, ac= ting as > > + * a read barrier. > > + * > > + * It's worth going through this complexity to avoid needing to lock t= he VM > > + * buffer, which could cause significant contention. > > + */ > > If we are concerned about read side ordering, it seems a memory barrier o= n the > read side would suffice (and would not cause contention). > > There's a much more recent memory barrier already than the ProcArrayLock > during snapshot acquisition - the pinning of the visibilitymap is a barri= er. > > Which is good, because we should really remove the ProcArrayLock acquisit= ion > when we are able to reuse snapshots... And it'd be nasty if that introduc= ed a > hard to hit memory ordering issue. Once again, these are direct copies of the existing comments. Can you suggest a replacement comment block that comprehensively addresses all your concerns? I think that these are all independent issues. As far as I know there's no reason to think that the rules in this area have changed (or need to change) due to our structural/layering revisions. > > +static void > > +heapam_batch_resolve_visibility(IndexScanDesc scan, IndexScanBatch bat= ch, > > + BatchRing= ItemPos *pos) > > +{ > > FWIW, for me this is not inlined into the caller, which means that there'= s a > function call even for the case where the visibility information would al= ready > be cached. > I'd at least mark this branch likely, so that the compiler can see that i= t'd > make sense to move this check into the callsites. At least on gcc that se= ems > to suffice for the hot path to not need the function call. I'll have something like that in the next revision. > Hm. It'll be pretty common for the block number of two consecutive items = to be > on the same page. The comments here suggest that the VM lookup is a rele= vant > performance factor, so why not check if the last VM_ALL_VISIBLE() was for= the > same block? > > While visibilitymap_get_status() isn't particularly expensive, it still i= s at > least two external function calls (visibilitymap_get_status() and > BufferGetBlockNumber()) and some nontrivial math. > > > Heh, indeed: A quick hacky implementation makes > SELECT count(*) FROM pgbench_accounts > about 16% faster. I'm aware of several ways to speed up the bulk VM lookups, but I held off because it didn't seem particularly crucial to this patch. For example, Matthias's bug fix for GiST index-only scans actually adds bulk VM lookups that use SIMD instructions. Seems like independent work? I can add a patch like this if you think that it's important. > > + /* > > + * It's safe to drop the batch's buffer pin as soon as we've reso= lved the > > + * visibility status of all of its items > > + */ > > + if (allbatchitemvisible && scan->MVCCScan) > > + { > > + Assert(batch->visInfo[batch->firstItem] & BATCH_VIS_CHECK= ED); > > + Assert(batch->visInfo[batch->lastItem] & BATCH_VIS_CHECKE= D); > > + > > + ReleaseBuffer(batch->buf); > > + batch->buf =3D InvalidBuffer; > > + } > > It doesn't seem like it can be right that heapam releases an index buffer > directly here. I think that it makes perfect sense that the AM table controls this. After all, holding on to buffer pins exists purely for heapam's benefit -- it has nothing to do with index AM considerations. (Actually, it kinda does with the _bt_killitems safety stuff, but IMV that's just a table AM problem disguised as an index AM problem. And one that can be fixed in a table-AM-agnostic way by always using the LSN trick.) The TID recycling hazards we're worried about here are inescapably tied to how heapam VACUUM sets LP_DEAD stub line pointers in heap pages to LP_UNUSED. It's entirely possible that another table AM will unconditionally free each batch's buffer pin (which is close to what heapam actually does). Or that such a table AM will have a completely different policy. > What, e.g., if the indexam requires multiple buffers to be > pinned for correctness? I agree that's not unusual. In fact, we often require this within hash index scans. That is handled in the later hash index patch by calling IncrBufferRefCount against a to-be-returned batch's buffer just before it is returned, as needed. This seems like a fairly elegant solution. It allows the index AM to hold its own pin reference, which can be released either before or after the table AM's reference, without any special care. > > +static pg_attribute_hot bool > > +heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction, > > + TupleTableSlot *slot) > > +{ > It might be worth putting this function into a static inline with a > usebatchring argument, and calling it with a constant true/false from the > top-level of heapam_index_getnext_slot(). Right now the code is a fair b= it > less tight due to both branches being supported inside the loop. > > Or even just storing scan->usebatchring in a local var (and passing it to > index_fetch_heap()), so the compiler at least can realize that it won't c= hange > between iterations. I experimented with these ideas yesterday, but for whatever reason didn't see much of any win. I'll revisit it, though. > > + tid =3D index_getnext_tid(scan, direction= ); > > + > > + if (tid !=3D NULL && scan->xs_want_itup) > > + scan->xs_visible =3D VM_ALL_VISIB= LE(scan->heapRelation, > > + = ItemPointerGetBlockNumber(tid), > > + = &hscan->vmbuf); > > I'm a bit confused about the need for xs_visible (and also xs_heaptid, bu= t > that's old). Afaict it's just used to store very transient information t= hat > could just be passed around upwards? Yes. I agree that that's a little weird. > Storing and then shortly after reading data can lead to increased latency= due > to store-forwarding. And reading from memory is more expensive than just = doing > it via the register. And the compiler can't optimize the memory writes ou= t in > this kinda code, because it won't be able to proof it's never read later. Again, I didn't see much success when I experimented with this yesterday. But it could be a question of needing the right query to see any benefit. > > + else > > + { > > + /* > > + * We didn't access the heap, so we'll ne= ed to take a > > + * predicate lock explicitly, as if we ha= d. For now we do > > + * that at page level. > > + */ > > + PredicateLockPage(hscan->xs_base.rel, > > + ItemPoi= nterGetBlockNumber(tid), > > + scan->x= s_snapshot); > > + } > > FWIW, this does show up in profiles here (partially due to overhead of th= e > call, partially due to it being annoyingly expensive to get the block fro= m a > tid). I've seen that too. > Seems we could amortise this from once-per-tuple to once-per-block for > the pretty common case of there being multiple tids for the same block in= a > row? It's not obvious that it'd be a win, since there's no convenient choke point. We can't use IndexFetchHeapData.xs_blk for this, since that is only set when the item was not all-visible (whereas this code path is only used when it is). So we'd need to add a new field to IndexFetchHeapData, just for this. We'd need to call ItemPointerGetBlockNumber, to check if there's been any change in this "all-visible heap block" field. You might think that we could piggy-back some of this work on the read stream callback. But this code is only used in the happy path, where tuples are all-visible -- where we don't expect the read stream callback to be called. The most promising approach might also be the simplest: We could gate the call to PredicateLockPage using something similar to "MySerializableXact =3D=3D InvalidSerializableXact)". Maybe it'd be possible to store that information only once, in IndexFetchHeapData, or maybe in IndexScanDesc? > > + /* > > + * Return matching index tuple now set in scan->x= s_itup (or return > > + * matching heap tuple now set in scan->xs_hitup)= . > > + * > > + * Note: we won't usually have fetched a heap tup= le into caller's > > + * table slot. This is per the table_index_getne= xt_slot contract > > + * for scan->xs_want_itup callers. > > + */ > > + return true; > > That contract is ... not very clear to me. table_index_getnext_slot() sa= ys: > > * Fetch the next tuple from an index scan into `slot`, scanning in the > * specified direction. Returns true if a tuple was found, false otherwis= e. > * > * The index scan should have been started via table_index_fetch_begin(). > * Callers must check scan->xs_recheck and recheck scan keys if required. > * > * Index-only scan callers (that pass xs_want_itup=3Dtrue to index_begins= can) > * can consume index tuple results by examining IndexScanDescData fields = such > * as xs_itup and xs_hitup. The table AM won't usually fetch a heap tupl= e > * into the provided slot in the case of xs_want_itup=3Dtrue callers. > > > I don't know what "won't usually" here really means and waht the caller i= s > supposed to do with that. Most table_index_getnext_slot callers don't need to do anything about this. But nodeIndexonlyscan.c and similar callers (basically all index-only scan style callers) expect to examine certain IndexScanDesc fields. They want to examine IndexScanDesc.xs_itup for the "slot that was set by table_index_getnext_slot", and other similar items. This is admittedly a bit odd. These callers do really need a table slot -- that's required when heap accesses turn out to be needed on the heapam side. But they don't really use the table slot directly themselves. Maybe it would make sense to invent a new table_index_getnext_slot-like function, used only by the IndexScanDesc.xs_itup callers. Passing the same table slot again and again wouldn't be required with this function, which would be clearer (the underlying heapam.c code would still require a table slot, but we wouldn't be pretending to return heap tuples using that slot anymore). > It's decidedly not free to store a tuple in a slot, so it seems a bit sil= ly > that we do so unnecessarily if we have to verify tuple visibility. We're not actually doing so for these callers (IndexScanDescData.xs_want_itup callers), except when we need to do heap fetches. > > From 34ad62ef8f43550ccc2c0b2e2c41f30205d37716 Mon Sep 17 00:00:00 2001 > > From: Peter Geoghegan > > Date: Thu, 12 Feb 2026 14:06:47 -0500 > > Subject: [PATCH v11 04/12] Don't allocate _bt_search stack > > IIC we've talked about this one for a while. Seems we could pull it ahead= of > the larger 03? Not pointlessly allocating memory only needed for modific= ation > during read-only searching seems like a hard-to-argue about improvement. Will do. Maybe I can commit this one right after the earlier patches. > > Subject: [PATCH v11 05/12] Limit get_actual_variable_range to scan one = index > > leaf page. > > I skipped looking at this, as it seems like a separate axis. I have marke= d it > as something to look at in the other thread though. It is at least somewhat related to the IndexScanDescData.xs_want_itup callers I just went into. > > From 3fbfb87b6f3ca3c4246c1d3145b452b25f398421 Mon Sep 17 00:00:00 2001 > > From: Thomas Munro > > Date: Tue, 13 Jan 2026 20:44:14 +0100 > > Subject: [PATCH v11 06/12] Introduce read_stream_{pause,resume,yield}()= . > > > > Note: I (pgeoghegan) have extended read_stream_yield to support yieldin= g > > that avoids affecting the io_combine_limit mechanism. > > I doubt that goes far enough into fixing the problem with yielding > early. Submitting a pending read before yielding makes sense, but it does= n't > really help with actually issuing IO early enough to actually get benefit= s > from AIO. I agree. > > + /* > > + * Finally, remember new scan direction. > > + * > > + * Note: we needed to set xs_read_stream_dir to NoMovementScanDir= ection > > + * momentarily to avoid spuriously prefetching more blocks from w= ithin the > > + * read stream callback. Once we return, the read stream can be = used to > > + * fetch blocks in the opposite scan direction. > > I don't follow, if we are resetting, nothing should look at the > xs_read_stream_dir. Ah, I see that you elsewhere have a comment saying = this > is just needed due to a bug. I fixed that bug in the past few days. It'll be in the next revision. > Minor nit: I find it pretty odd that we check for yielding if xs_yield_c= heck > is false. > Where is that 4 coming from? It doesn't seem like having processed more t= han 4 > batches is a particularly good proxy for anything? > I don't really understand what logic behind yielding is. (...Various other well-justified criticisms of yielding omitted...) I like your idea of fixing the problems in this area (in the most recent email, that I have yet to begin to respond to) by inventing a new READ_STREAM_SLOW_START, and using it selectively, as indicated by a hint passed in by the executor. Since I'm going to pursue that approach next, there isn't much point in responding to your specific points about yielding now. > What's the workload showing the need for yielding most extremely? Is that= the > merge join case we talked about before? I'd like to experiment with a few > other approaches. Yes. > Ran out of time & energy for the moment, more later. Thanks for the review! -- Peter Geoghegan