MIME-Version: 1.0
References: <8ec36f51-b863-60e3-20e2-b9c981c5ce5e@enterprisedb.com>
 <CAAKRu_ZPDhNwwFxQwS8NdeTFkycM1c=tNLKdU0J-M6KxCjdEmQ@mail.gmail.com>
 <56176b8d-956c-487e-ab09-310db4581c07@enterprisedb.com> <CAAKRu_ad8cywU4_X+5e4A9Wy_PZmrCx2aoLpdHPBRvb9inGDgQ@mail.gmail.com>
 <4867452a-b853-4813-a6da-9bb06a336f8b@enterprisedb.com> <CAAKRu_Y=go4u99udRaj3k0vuF6EAfWqM86BONhxB4MV9X4FRpQ@mail.gmail.com>
 <CAAKRu_ZPuaO5XwXfM7wowf4sSmPSJ2LT9+zfmD5=LQ=WhV_j=Q@mail.gmail.com>
 <4f5f16ef-df1e-4e09-9b3f-2e0961ab5117@enterprisedb.com> <CAH2-WznsJqDgr_0yUwApgYXi3cRZQbimFkiYRqqXhpMcw4s8ZQ@mail.gmail.com>
 <c332fe83-c187-48ea-8520-458d4738f3a1@enterprisedb.com> <CAH2-WznBuxhvsEgX3mYDjxKhQk9GFdF46vMfE2ugU6SUekHp_A@mail.gmail.com>
 <CAAKRu_ZKp1BCT+V324jENxKTfsetxJwxh309rJGWxebSggPisw@mail.gmail.com>
 <CAH2-Wzkrej9cXjERrA5p8pgD9QfR0LZwCCcgPPu6wiRgFpYVQQ@mail.gmail.com>
 <4736207c-8ea6-40cb-ac52-41af00b58bbc@enterprisedb.com> <CAH2-Wz=gMnsLQph1KM_xxTu-ZFRFqbDbK9tFBPTKcfXB1Z8=og@mail.gmail.com>
 <f0fda1a3-23ef-4cbe-b6d5-48f0bd9432cb@enterprisedb.com>
In-Reply-To: <f0fda1a3-23ef-4cbe-b6d5-48f0bd9432cb@enterprisedb.com>
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 15 Feb 2024 12:53:10 -0500
Message-ID: <CAH2-WzkpNN1+sovB8G=5dVwYW25=J6Qj4V9L7DzD26NTVQWM2w@mail.gmail.com>
Subject: Re: index prefetching
To: Tomas Vondra <tomas.vondra@enterprisedb.com>
Cc: Melanie Plageman <melanieplageman@gmail.com>, Robert Haas <robertmhaas@gmail.com>, 
	Andres Freund <andres@anarazel.de>, 
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>, Georgios <gkokolatos@protonmail.com>, 
	Thomas Munro <thomas.munro@gmail.com>, Konstantin Knizhnik <knizhnik@garret.ru>, 
	Dilip Kumar <dilipbalaut@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/CAH2-WzkpNN1%2BsovB8G%3D5dVwYW25%3DJ6Qj4V9L7DzD26NTVQWM2w%40mail.gmail.com>
Precedence: bulk

On Thu, Feb 15, 2024 at 12:26=E2=80=AFPM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> I may be missing something, but it seems fairly self-evident to me an
> entry at the beginning of an index page won't get prefetched (assuming
> the page-at-a-time thing).

Sure, if the first item on the page is also the first item that we
need the scan to return (having just descended the tree), then it
won't get prefetched under a scheme that sticks with the current
page-at-a-time behavior (at least in v1). Just like when the first
item that we need the scan to return is from the middle of the page,
or more towards the end of the page.

It is of course also true that we can't prefetch the next page's
first item until we actually visit the next page -- clearly that's
suboptimal. Just like we can't prefetch any other, later tuples from
the next page (until such time as we have determined for sure that
there really will be a next page, and have called _bt_readpage for
that next page.)

This is why I don't think that the tuples with lower page offset
numbers are in any way significant here.  The significant part is
whether or not you'll actually need to visit more than one leaf page
in the first place (plus the penalty from not being able to reorder
the work across page boundaries in your initial v1 of prefetching).

> If I understand your point about boundary cases / suffix truncation,
> that helps us by (a) picking the split in a way to minimize a single key
> spanning multiple pages, if possible and (b) increasing the number of
> entries that fit onto a single index page.

More like it makes the boundaries between leaf pages (i.e. high keys)
align with the "natural boundaries of the key space". Simple point
queries should practically never require more than a single leaf page
access as a result. Even somewhat complicated index scans that are
reasonably selective (think tens to low hundreds of matches) don't
tend to need to read more than a single leaf page match, at least with
equality type scan keys for the index qual.

> That's certainly true / helpful, and it makes the "first entry" issue
> much less common. But the issue is still there. Of course, this says
> nothing about the importance of the issue - the impact may easily be so
> small it's not worth worrying about.

Right. And I want to be clear: I'm really *not* sure how much it
matters. I just doubt that it's worth worrying about in v1 -- time
grows short. Although I agree that we should commit a v1 that leaves
the door open to improving matters in this area in v2.

> One case I've been thinking about is sorting using index, where we often
> read large part of the index.

That definitely seems like a case where reordering
work/desynchronization of the heap and index scans might be relatively
important.

> > I think that there is no question that this will need to not
> > completely disable kill_prior_tuple -- I'd be surprised if one single
> > person disagreed with me on this point. There is also a more nuanced
> > way of describing this same restriction, but we don't necessarily need
> > to agree on what exactly that is right now.
> >
>
> Even for the page-at-a-time approach? Or are you talking about the v2?

I meant that the current kill_prior_tuple behavior isn't sacred, and
can be revised in v2, for the benefit of lifting the restriction on
prefetching. But that's going to involve a trade-off of some kind. And
not a particularly simple one.

> Yeah. The basic idea was that by moving this above index AM it will work
> for all indexes automatically - but given the current discussion about
> kill_prior_tuple, locking etc. I'm not sure that's really feasible.
>
> The index AM clearly needs to have more control over this.

Cool. I think that that makes the layering question a lot clearer, then.


--
Peter Geoghegan