Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vwnab-00Dpw0-0J for pgsql-hackers@arkaria.postgresql.org; Sun, 01 Mar 2026 20:40:33 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vwnaZ-00E50p-2H for pgsql-hackers@arkaria.postgresql.org; Sun, 01 Mar 2026 20:40:31 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vwnaZ-00E50h-0j for pgsql-hackers@lists.postgresql.org; Sun, 01 Mar 2026 20:40:31 +0000 Received: from mail-wm1-x32c.google.com ([2a00:1450:4864:20::32c]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vwnaV-00000001v3p-3dOQ for pgsql-hackers@lists.postgresql.org; Sun, 01 Mar 2026 20:40:30 +0000 Received: by mail-wm1-x32c.google.com with SMTP id 5b1f17b1804b1-48379a42f76so30057135e9.0 for ; Sun, 01 Mar 2026 12:40:27 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1772397622; cv=none; d=google.com; s=arc-20240605; b=Et+0GAOvuOjc0ftoGg2z3GitRicnLiQdWKKjnpU1F96pnLeTbiePYBW9d6VTGg8zxG CPm7S0bT4ZhZdjfNUvGem1nfCHGdY4XgLl7iIaEJc1+kVvOS/m65j5McV/RP5SNMwO+6 Y5mJ+jPI3pi2WlInyA6uCcVpCcgpKwIsu0E7NKXHWVLDRGo9tldOxwNqLDFTAi8rXD1m LkERye8fqUC8SAO3BWwUqrXy5D/6yWgp1E8fo6lqhbtaeSdhWS2U4O1LWT6zBDqClHuW g/hOEeRK4Xyil1cLhTQnUS6PJm7s1E47FqnN2w1AAGkLet3YaOZXUGKQPhZvkTut9Hdv XPfQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=7tb+wzUV5U1RSBdPBKtaYJ2s1wql6jFCcDasWR8QT9s=; fh=zHC7Lf0pHdviX9xGwzsJQkpFZIPS3Ixj2opVoz8B/Fs=; b=IdZC3NeNjZCxMpp/wkN3V4uKB3bQbmqNooLMPO9P1TgFBLb0ZfDbbf/aGJRtQhRol1 hSKJ44oqHTCgRPOc5W/AqiBAS2QwlJdfVRr/k6k+ZcLE6XHgoUNqr6YPLsy5ojDlVXxc crfbU/9Hb0Wk6rcaWr6B3enTixwUnOyRlx02lesOc/P1oxJk2atFOetuWLckZ4nfNNoo G89ylX4KFmGJ43abEFBpoVzMJ7nOZ2sAETCUGfZkuPpco389hA7L7xHZD0er288P3SQS DyCouF6tOWp0byVK2YGljKorsCarHTKxtpxNFnqRbuQrBbSfPHCaWopirMVwqdyup2nI /6og==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bowt-ie.20230601.gappssmtp.com; s=20230601; t=1772397622; x=1773002422; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=7tb+wzUV5U1RSBdPBKtaYJ2s1wql6jFCcDasWR8QT9s=; b=N7sCZ5h09bqEU8uW/Pg8yvYnnKUg2YdShFHVP3WDRd7RIbd5NYo1uEDUEDu1b/Zjd6 IYJWn7SBOgCf1UrNNixB72i0W+dUJZeBT+nTH+GNruVQnFIAPG7zVwfVotOqYBbJuK2m PwIV7tU1nMgdDC5ohS95i/Ed67BpazCuKB+Zsn9vEU44kzlQDa/JkRvVkGkvmjuGcwBY GJqYn85Vq4TYMPg3zvHFWsRpXN6O9nyUvhZxmjygqYAPpSAQOkHFGlBjzAmtkK4M+G5j rnrhkgtHzJTQ17ITPFsSCBwV9cnsd6Ztf36x7R+mey5v2X9H3Blrlahy7mAViWsnI8B+ 9/cQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1772397622; x=1773002422; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=7tb+wzUV5U1RSBdPBKtaYJ2s1wql6jFCcDasWR8QT9s=; b=cT2hlA2JleO+mxMJsa7aNUNI+FMbS9r8eb1lhv/10uQb0+LQFidIrykIaEX+GJonJP d8++V0qQXnIRYjJly9X7X8WpeFd/E1BrQ8tUt0c6MRCKng0YDNZndbH1aC1y9Sp3OnvX EkfkUQUEfcVypYF+/0cOqH+TWE626onp116RUWrojT3UJpf2yc0gZxQNUPG/0WGxRoJv yPdjSGlI5gZ5uwinqCcmjR1u2Ctaf+krVxKHXhrxS6q8T0o8OKO93TMKDqMnDItriJku k725m1AkQ7fnxzz1Er3iZ1PzZj3hy9nPv26vLKNEFIxYUUc14d3u1ObZM7f623NXvDox chVQ== X-Forwarded-Encrypted: i=1; AJvYcCUKWLcyR54/IYehULOfOihjPHMRFBBsj+WzRbV8isHeeSFgPEOs+6TmStTr2pZXHyoGvTSawPc3OtBNdg2k@lists.postgresql.org X-Gm-Message-State: AOJu0YyZvVlDu5tG8DR9DTdK6U0PgfQE7bxzBYR+tv4w/aZ/wxXZcPmB DXe89OwRGJeha7tFVjktLw9rKPfxWLdX2bnPVv0EZIk+gHiH4EDatvazERufhyzlWvNH/i632Wx aNDtoantpzArofWsAMxHbz07FZVqHzxgwtjJXfdRcAg== X-Gm-Gg: ATEYQzxA7dZVhGfduiYx3XMtVMhaji896U5o2BiLl8xb9XXq624lIPZGdHHr23DNx/K oKkjkEe4Qwt4zPKykqIbZ4M+WM/FWkqIeB/WNGJrZowpFfAzQMKXjMGRA+If16y+3eoyMfaqTcT WuB9YQs2vXrNQbYtIXGLptBvms864/TLKLDNgItP/Y8mRxEmyumCdr9iNy08uyyntoc/FGoW65N GUA1USrY3UyS4UmCwgZ647aTIYptUCoo7SS829ZuQ0O7u0yiift1SrDspJOYUy0OVzJw/Aakg09 Kthu6lA= X-Received: by 2002:a05:600c:828c:b0:483:8f0f:36fe with SMTP id 5b1f17b1804b1-483c9bb0c9fmr171998155e9.1.1772397621936; Sun, 01 Mar 2026 12:40:21 -0800 (PST) MIME-Version: 1.0 References: <52512325-b1f2-4fff-819e-f68122b2e427@vondra.me> <64mfcfv7iihc4pmqlxarii4esnmqry52ckz5m7lmwylnfnuxuz@oxh4ioxkjtep> In-Reply-To: From: Peter Geoghegan Date: Sun, 1 Mar 2026 15:39:55 -0500 X-Gm-Features: AaiRm533-tND3Qn6O9JXXS9EssUo1t1z9osHT1rTPwKdZ-vModuBLf3KkYxYFTI Message-ID: Subject: Re: index prefetching To: Andres Freund Cc: Tomas Vondra , Alexandre Felipe , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Sat, Feb 28, 2026 at 5:08=E2=80=AFPM Andres Freund = wrote: > The indexscan stats changes, if we want them, are also pretty verbose and > pretty independent. Then I'll remove them from the next revision. We can re-review that decision at some point in the weeks ahead. > It'd be great if the move of indexonly visibility check handling from > nodeIndexonlyscan.c into the table AM could be split out, but it might be= too > churn-y. But getting it out of the way first would be nice. That seems harder, but I'll see what I can do. > This actually reminds me that it's not entirely clear to me what firstIte= m, > lastItem mean: > > /* > * Matching items state for this batch. Output by index AM for t= able AM. > * > * The items array is always ordered in index order (ie, increasi= ng > * indexoffset). When scanning backwards it is convenient for in= dex AMs > * to fill the array back-to-front, so we start at the last slot = and fill > * downwards. Hence they need a first-valid-entry and a last-val= id-entry > * counter. > */ > int firstItem; /* first valid in= dex in items[] */ > int lastItem; /* last valid ind= ex in items[] */ > > I think it should at very least state that -1 is used to indicate there a= re no > items. It's a bit awkward because by definition it isn't possible for a batch returned by amgetbatch to contain no items. amgetbatch always returns a batch for the next index page that has at least one matching item (or it returns NULL, indicating that the scan has completely run out of matching items). But I'll add a note about some index AMs needing these to be signed to represent no-matching-items internally. > I take it that if firstItem =3D=3D lastItem, there's exactly one item? That's right. > > Many months ago, this struct provided a way for index AMs to provide > > their own state in each batch. But I ultimately concluded that the > > overhead of that approach was a problem. > > With the approach that I hand waved at, namely to use one allocation for = both > the AM private state and the generic state, that should not be a real iss= ue? > There should be maybe 2-3 cheap instructions to translate? I've expanded= on > that further down. I'll work towards that, then. > > The names are based on the existing, similar moreLeft and moreRight > > fields used by both nbtree and hash. I'm not attached to any of these > > names. Feel free to suggest alternatives. > > knownEnd{Forward, Backward} or isEnd{Forward,Bckward} seems like it'd do = the > trick? WFM. > /* > * If we are doing an index-only scan, these are the tuple storag= e > * workspaces for the matching tuples (tuples referenced by items= []). Each > * is of size BLCKSZ, so it can hold as much as a full page's wor= th of > * tuples. > */ > char *currTuples; /* tuple storage for items[] */ > BatchMatchingItem items[FLEXIBLE_ARRAY_MEMBER]; /* matching items= */ > > I previously read the "these are" to reference currTuples and items, but = I > think the plural is actually just an anachronism, from when currTuples wa= s > followed by markTuples? > > It says that it's the tuple storage for items[], but I think that may jus= t be > trying to say that it's the tuple storage corresponding the items in ->it= ems[]. You're right. Will fix. > I agree it shouldn't be exclusively managed by heapam and that it needs t= o be > stored together with the batch. But it could be a size determined at the= time > of index_beginscan() as part of table_index_fetch_begin(heapRelation), e.= g. a > field in the returned IndexFetchTableData. > > I'm basically thinking that each batch would be stored like this: > > [table am specific data] > [index am specific data] > struct IndexScanBatchData > [variable amount of items] > > With the offset from the *IndexScanBatchData to the index/table data stor= ed > inside IndexScanDesc. > > Accessing the heap specific portion of an IndexScanBatch could then be > something like > ((char*) batch - iscan->batch_table_offset) > (wrapped in a helper, of course) > > The index AM conversion could be made even cheaper with the above, becaus= e for > the index AM the offset would typically be known at compile time > (i.e. something like MAXALIGN(sizeof(BTIndexBatchData))). That seems reasonable. I'll at least work towards that for the next revision (might make sense to produce a revision without those changes to fix bit rot from committing earlier patches). > Hm. It looks like the only remaining users (_bt_check_unique(), > unique_key_recheck()) don't care about actually getting the tuple, they j= ust > want to check if it exists. > > For those the API is actually unnecessarily expensive, we don't need to r= eturn > the tuple, therefore don't need to create a slot. So once we get the mai= n > stuff in, I think we should evolve the API to be a bit more catered > specifically to this usecase. I think it'll be a lot less confusing that= way. Noted. > To be clear, I was just suggesting to elide repeated VM lookups for index > entries that all point to the same table page. Not to actually do any bat= ching > or such. It's just four more lines or so. I prototyped this. You were right; it's a no-brainer. It'll definitely be in the next version. > > I think that it makes perfect sense that the AM table controls this. > > After all, holding on to buffer pins exists purely for heapam's > > benefit -- it has nothing to do with index AM considerations. > > I think it's right that the table AM controls *when* the pins are release= d, > what doesn't seem right that it knows *which* specific buffer being relea= sed > is the right thing and that's done. The fact is that the "Index Locking Considerations" section from "Chapter 63. Index Access Method Interface Definition" in the docs describes things that only make sense when using heapam, or a table AM very much like it. Any design derived from the existing one will likely retain some of that. For example, the entire concept of ambuldelete requiring a cleanup lock is unrelated to index AM considerations. If (say) nbtree only MVCC snapshot plain index scans and bitmap index scans, then there'd be no need for btbulkdelete to acquire cleanup locks, and no need for any scan to ever hold on to a buffer pin. To me it seems natural to make such buffer pins (those that specifically exist to form an interlock against unsafe concurrent TID recycling by VACUUM) the responsibility of the table AM -- that's the only place where it actually matters. That's the main reason the AM table releases the batch's buffer pin in the current design of index prefetching (though another important factor is nestloop anti-joins with inner index-only scans, as I detail below). Here are some points I imagine we agree on already: * It is definitely a good idea to drop index page buffer pins during or right after amgetbatch returns. Doing so avoids unintended interactions with the read stream. * As a practical matter it makes sense to be maximally strict, by requiring (at least during prefetching) that the index AM immediately releases any buffer pins it acquires (acquires from an `amgetbatch` call originating from the read stream callback). That way it simply isn't possible for the read stream to be affected by these buffer pins in the first place. IOW while a paranoid policy might not be strictly necessary, it ends up being simpler that way. * It is necessary to avoid bulk-loading a batch's visibility information immediately (before prefetching begins) to prevent new regressions in things like nestloop anti-joins with an inner index-only scan. My microbenchmarking shows this clearly. Specifically for index-only scans, the only way to drop a batch/index page pin at the earliest possible opportunity like this is to bulk save visibility information for the batch's items before dropping the index page pin. This avoids concurrent TID recycling races with VACUUM that could otherwise lead to wrong answers. At the same time, index-only scans get the usual strong guarantee about not holding on to buffer pins that could adversely affect the read stream in unpredictable ways (just like plain index scans). Note that I'm only discussing pins held by a batch; I'm not discussing other pin types that certain index AMs (not including nbtree) must retain for their own reasons. To me, the "batch pin vs index AM owned pin" distinction is important; I try to be clear about the pin type involved when discussing these issues. Here's where you might disagree with me, or at least where I need greater clarity to actually understand your position: I think it follows from what I said that some piece of code (either in the index AM or the table AM) must be responsible for saving visibility info for a batch using the visibility map, and only then releasing the associated buffer pin. I chose to do this in the table AM, partly because heapam needs the flexibility to *not* release the batch/index buffer pin right away. ISTM that heapam needs the authority to choose whether or not to bulk load all visibility info from each batch immediately. This is necessary for the first batch returned during an index-only scan with (say) a LIMIT 5, where loading all of the visibility information immediately would significantly hurt performance. How should I resolve the tension among all these goals? Is your concern that heapam knows about buffer pins specifically, as opposed to an abstract interlock concept? > What if the index AM is one outside of > postgres' buffer pool? Then it also cannot support WAL-logging, I think. Where does that leave the LSN trick? > That helps with multiple pins for the same buffer, but not having to pin > multiple *different* buffers. Sure, the index AM could just hold onto th= e > additional buffers until the scan ends or such, but that's not really an > option. Why is it not an option? Is this purely because of the potential to disrupt the read stream's management of the backend's buffer pin limit? The problem with that position is that it just isn't compatible with how some non-nbtree index AMs work and have always worked. Most notably, the hash index AM feels entitled to hold on to up to 2 extra buffer pins for its own reasons. If hash cannot continue doing that, and has no reasonable alternative, doesn't that mean that hash is fundamentally incompatible with prefetching? No matter how we design the abstractions? I believe hash indexes aren't generally important, so I certainly wouldn't mind dropping hash index support for Postgres 19. Having such strict rules about index AMs holding onto buffer pins won't affect nbtree at all, but a maximally strict rule about it seems pretty severe to me. Can't we just invent a way for ambeginscan to advise the table AM that it might need (say) as many as 2 extra buffer pins? My assumption up until now (which the docs do mention) is that it's okay for index AMs to hold on to a small number of buffer pins for their own purposes. Admittedly my understanding of this isn't rigorous -- I guess I figured that some small amount of that had to be permitted, since a number of existing index AMs fundamentally expect it. I meant to ask you to help me put that on a more rigorous footing. > > Again, I didn't see much success when I experimented with this > > yesterday. But it could be a question of needing the right query to > > see any benefit. > > I'd say that even if it were not to improve perf (pretty sure I saw some > though), not introducing fields that have more complicated lifetimes is j= ust > good for human understanding too. I agree that creating a new IndexScanDesc.xs_visible field for transient state like this is ugly. I'll incorporate changes along these lines. Thanks again --=20 Peter Geoghegan