Re: index prefetching

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: index prefetching
4+ messages / 3 participants
[nested] [flat]

* Re: index prefetching
@ 2024-02-14 21:45  Melanie Plageman <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Melanie Plageman @ 2024-02-14 21:45 UTC (permalink / raw)
  To: Peter Geoghegan <[email protected]>; +Cc: Tomas Vondra <[email protected]>; Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]>

On Wed, Feb 14, 2024 at 1:21 PM Peter Geoghegan <[email protected]> wrote:
>
> On Wed, Feb 14, 2024 at 8:34 AM Tomas Vondra
> <[email protected]> wrote:
> > > Another thing that argues against doing this is that we might not need
> > > to visit any more B-Tree leaf pages when there is a LIMIT n involved.
> > > We could end up scanning a whole extra leaf page (including all of its
> > > tuples) for want of the ability to "push down" a LIMIT to the index AM
> > > (that's not what happens right now, but it isn't really needed at all
> > > right now).
> > >
> >
> > I'm not quite sure I understand what is "this" that you argue against.
> > Are you saying we should not separate the two scans? If yes, is there a
> > better way to do this?
>
> What I'm concerned about is the difficulty and complexity of any
> design that requires revising "63.4. Index Locking Considerations",
> since that's pretty subtle stuff. In particular, if prefetching
> "de-synchronizes" (to use your term) the index leaf page level scan
> and the heap page scan, then we'll probably have to totally revise the
> basic API.

So, a pin on the index leaf page is sufficient to keep line pointers
from being reused? If we stick to prefetching heap blocks referred to
by index tuples in a single index leaf page, and we keep that page
pinned, will we still have a problem?

> > The LIMIT problem is not very clear to me either. Yes, if we get close
> > to the end of the leaf page, we may need to visit the next leaf page.
> > But that's kinda the whole point of prefetching - reading stuff ahead,
> > and reading too far ahead is an inherent risk. Isn't that a problem we
> > have even without LIMIT? The prefetch distance ramp up is meant to limit
> > the impact.
>
> Right now, the index AM doesn't know anything about LIMIT at all. That
> doesn't matter, since the index AM can only read/scan one full leaf
> page before returning control back to the executor proper. The
> executor proper can just shut down the whole index scan upon finding
> that we've already returned N tuples for a LIMIT N.
>
> We don't do prefetching right now, but we also don't risk reading a
> leaf page that'll just never be needed. Those two things are in
> tension, but I don't think that that's quite the same thing as the
> usual standard prefetching tension/problem. Here there is uncertainty
> about whether what we're prefetching will *ever* be required -- not
> uncertainty about when exactly it'll be required. (Perhaps this
> distinction doesn't mean much to you. I'm just telling you how I think
> about it, in case it helps move the discussion forward.)

I don't think that the LIMIT problem is too different for index scans
than heap scans. We will need some advice from planner to come down to
prevent over-eager prefetching in all cases.

> Another factor that complicates things here is mark/restore
> processing. The design for that has the idea of processing one page at
> a time baked-in. Kinda like with the kill_prior_tuple issue.

Yes, I mentioned this in my earlier email. I think we can resolve
mark/restore by resetting the prefetch and TID queues and restoring
the last used heap TID in the index scan descriptor.

> It's certainly possible that you could figure out various workarounds
> for each of these issues (plus the kill_prior_tuple issue) with a
> prefetching design that "de-synchronizes" the index access and the
> heap access. But it might well be better to extend the existing design
> in a way that just avoids all these problems in the first place. Maybe
> "de-synchronization" really can pay for itself (because the benefits
> will outweigh these costs), but if you go that way then I'd really
> prefer it that way.

Forcing each index access to be synchronous and interleaved with each
table access seems like an unprincipled design constraint. While it is
true that we rely on that in our current implementation (when using
non-MVCC snapshots), it doesn't seem like a principle inherent to
accessing indexes and tables.

> > > I think that it makes sense to put the index AM in control here --
> > > that almost follows from what I said about the index AM API. The index
> > > AM already needs to be in control, in about the same way, to deal with
> > > kill_prior_tuple (plus it helps with the  LIMIT issue I described).
> > >
> >
> > In control how? What would be the control flow - what part would be
> > managed by the index AM?
>
> ISTM that prefetching for an index scan is about the index scan
> itself, first and foremost. The heap accesses are usually the dominant
> cost, of course, but sometimes the index leaf page accesses really do
> make up a significant fraction of the overall cost of the index scan.
> Especially with an expensive index qual. So if you just assume that
> the TIDs returned by the index scan are the only thing that matters,
> you might have a model that's basically correct on average, but is
> occasionally very wrong. That's one reason for "putting the index AM
> in control".

I don't think the fact that it would also be valuable to do index
prefetching is a reason not to do prefetching of heap pages. And,
while it is true that were you to add index interior or leaf page
prefetching, it would impact the heap prefetching, at the end of the
day, the table AM needs some TID or TID-equivalents that whose blocks
it can go fetch. The index AM has to produce something that the table
AM will consume. So, if we add prefetching of heap pages and get the
table AM input right, it shouldn't require a full redesign to add
index page prefetching later.

You could argue that my suggestion to have the index AM manage and
populate a queue of TIDs for use by the table AM puts the index AM in
control. I do think having so many members of the IndexScanDescriptor
which imply a one-at-a-time (xs_heaptid, xs_itup, etc) synchronous
interplay between fetching an index tuple and fetching a heap tuple is
confusing and error prone.

> As I said back in June, we should probably be marrying information
> from the index scan with information from the heap. This is something
> that is arguably a modularity violation. But it might just be that you
> really do need to take information from both places to consistently
> make the right trade-off.

Agreed that we are going to need to mix information from both places.

> If you take a look at _bt_killitems(), you'll see that it actually has
> two fairly different strategies for avoiding TID recycling race
> condition issues, applied in each of two different cases:
>
> 1. Cases where we really have held onto a buffer pin, per the index AM
> API -- the "inde AM orthodox" approach.  (The aforementioned issue
> with unlogged indexes exists because with an unlogged index we must
> use approach 1, per the nbtree README section [1]).
>
> 2. Cases where we drop the pin as an optimization (also per [1]), and
> now have to detect the possibility of concurrent modifications by
> VACUUM (that could have led to concurrent TID recycling). We
> conservatively do nothing (don't mark any index tuples LP_DEAD),
> unless the LSN is exactly the same as it was back when the page was
> scanned/read by _bt_readpage().

Re 2: so the LSN could have been changed by some other process (i.e.
not vacuum), so how often in practice is the LSN actually the same as
when the page was scanned/read? Do you think we would catch a
meaningful number of kill prior tuple opportunities if we used an LSN
tracking method like this? Something that let us drop the pin on the
page would obviously be better.

- Melanie






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: index prefetching
@ 2024-02-14 23:06  Peter Geoghegan <[email protected]>
  parent: Melanie Plageman <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Peter Geoghegan @ 2024-02-14 23:06 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Tomas Vondra <[email protected]>; Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]>

On Wed, Feb 14, 2024 at 4:46 PM Melanie Plageman
<[email protected]> wrote:
> So, a pin on the index leaf page is sufficient to keep line pointers
> from being reused? If we stick to prefetching heap blocks referred to
> by index tuples in a single index leaf page, and we keep that page
> pinned, will we still have a problem?

That's certainly one way of dealing with it. Obviously, there are
questions about how you do that in a way that consistently avoids
creating new problems.

> I don't think that the LIMIT problem is too different for index scans
> than heap scans. We will need some advice from planner to come down to
> prevent over-eager prefetching in all cases.

I think that I'd rather use information at execution time instead, if
at all possible (perhaps in addition to a hint given by the planner).
But it seems a bit premature to discuss this problem now, except to
say that it might indeed be a problem.

> > It's certainly possible that you could figure out various workarounds
> > for each of these issues (plus the kill_prior_tuple issue) with a
> > prefetching design that "de-synchronizes" the index access and the
> > heap access. But it might well be better to extend the existing design
> > in a way that just avoids all these problems in the first place. Maybe
> > "de-synchronization" really can pay for itself (because the benefits
> > will outweigh these costs), but if you go that way then I'd really
> > prefer it that way.
>
> Forcing each index access to be synchronous and interleaved with each
> table access seems like an unprincipled design constraint. While it is
> true that we rely on that in our current implementation (when using
> non-MVCC snapshots), it doesn't seem like a principle inherent to
> accessing indexes and tables.

There is nothing sacred about the way plain index scans work right now
-- especially the part about buffer pins as an interlock.

If the pin thing really was sacred, then we could never have allowed
nbtree to selectively opt-out in cases where it's possible to provide
an equivalent correctness guarantee without holding onto buffer pins,
which, as I went into, is how it actually works in nbtree's
_bt_killitems() today (see commit 2ed5b87f96 for full details). And so
in principle I have no problem with the idea of revising the basic
definition of plain index scans -- especially if it's to make the
definition more abstract, without fundamentally changing it (e.g., to
make it no longer reference buffer pins, making life easier for
prefetching, while at the same time still implying the same underlying
guarantees sufficient to allow nbtree to mostly work the same way as
today).

All I'm really saying is:

1. The sort of tricks that we can do in nbtree's _bt_killitems() are
quite useful, and ought to be preserved in something like their
current form, even when prefetching is in use.

This seems to push things in the direction of centralizing control of
the process in index scan code. For example, it has to understand that
_bt_killitems() will be called at some regular cadence that is well
defined and sensible from an index point of view.

2. Are you sure that the leaf-page-at-a-time thing is such a huge
hindrance to effective prefetching?

I suppose that it might be much more important than I imagine it is
right now, but it'd be nice to have something a bit more concrete to
go on.

3. Even if it is somewhat important, do you really need to get that
part working in v1?

Tomas' original prototype worked with the leaf-page-at-a-time thing,
and that still seemed like a big improvement to me. While being less
invasive, in effect. If we can agree that something like that
represents a useful step in the right direction (not an evolutionary
dead end), then we can make good incremental progress within a single
release.

> I don't think the fact that it would also be valuable to do index
> prefetching is a reason not to do prefetching of heap pages. And,
> while it is true that were you to add index interior or leaf page
> prefetching, it would impact the heap prefetching, at the end of the
> day, the table AM needs some TID or TID-equivalents that whose blocks
> it can go fetch.

I wasn't really thinking of index page prefetching at all. Just the
cost of applying index quals to read leaf pages that might never
actually need to be read, due to the presence of a LIMIT. That is kind
of a new problem created by eagerly reading (without actually
prefetching) leaf pages.

> You could argue that my suggestion to have the index AM manage and
> populate a queue of TIDs for use by the table AM puts the index AM in
> control. I do think having so many members of the IndexScanDescriptor
> which imply a one-at-a-time (xs_heaptid, xs_itup, etc) synchronous
> interplay between fetching an index tuple and fetching a heap tuple is
> confusing and error prone.

But that's kinda how amgettuple is supposed to work -- cursors need it
to work that way. Having some kind of general notion of scan order is
also important to avoid returning duplicate TIDs to the scan. In
contrast, GIN heavily relies on the fact that it only supports bitmap
scans -- that allows it to not have to reason about returning
duplicate TIDs (when dealing with a concurrently merged pending list,
and other stuff like that).

And so nbtree (and basically every other index AM that supports plain
index scans) kinda pretends to process a single tuple at a time, in
some fixed order that's convenient for the scan to work with (that's
how the executor thinks of things). In reality these index AMs
actually process batches consisting of a single leaf page worth of
tuples.

I don't see how the IndexScanDescData side of things makes life any
harder for this patch -- ISTM that you'll always need to pretend to
return one tuple at a time from the index scan, regardless of what
happens under the hood, with pins and whatnot. The page-at-a-time
thing is more or less an implementation detail that's private to index
AMs (albeit in a way that follows certain standard conventions across
index AMs) -- it's a leaky abstraction only due to the interactions
with VACUUM/TID recycle safety.

> Re 2: so the LSN could have been changed by some other process (i.e.
> not vacuum), so how often in practice is the LSN actually the same as
> when the page was scanned/read?

It seems very hard to make generalizations about that sort of thing.

It doesn't help that we now have batching logic inside
_bt_simpledel_pass() that will make up for the problem of not setting
as many LP_DEAD bits as we could in many important cases. (I recall
that that was one factor that allowed the bug that Andres fixed in
commit 90c885cd to go undetected for months. I recall discussing the
issue with Andres around that time.)

> Do you think we would catch a
> meaningful number of kill prior tuple opportunities if we used an LSN
> tracking method like this? Something that let us drop the pin on the
> page would obviously be better.

Quite possibly, yes. But it's hard to say for sure without far more
detailed analysis. Plus you have problems with things like unlogged
indexes not having an LSN to use as a canary condition, which makes it
a bit messy (it's already kind of weird that we treat unlogged indexes
differently here IMV).

-- 
Peter Geoghegan

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: index prefetching
@ 2024-02-15 14:36  Tomas Vondra <[email protected]>
  parent: Peter Geoghegan <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Tomas Vondra @ 2024-02-15 14:36 UTC (permalink / raw)
  To: Peter Geoghegan <[email protected]>; Melanie Plageman <[email protected]>; +Cc: Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]>

On 2/15/24 00:06, Peter Geoghegan wrote:
> On Wed, Feb 14, 2024 at 4:46 PM Melanie Plageman
> <[email protected]> wrote:
>
>> ...
> 
> 2. Are you sure that the leaf-page-at-a-time thing is such a huge
> hindrance to effective prefetching?
> 
> I suppose that it might be much more important than I imagine it is
> right now, but it'd be nice to have something a bit more concrete to
> go on.
> 

This probably depends on which corner cases are considered important.

The page-at-a-time approach essentially means index items at the
beginning of the page won't get prefetched (or vice versa, prefetch
distance drops to 0 when we get to end of index page).

That may be acceptable, considering we can usually fit 200+ index items
on a single page. Even then it limits what effective_io_concurrency
values are sensible, but in my experience quickly diminish past ~32.

> 3. Even if it is somewhat important, do you really need to get that
> part working in v1?
> 
> Tomas' original prototype worked with the leaf-page-at-a-time thing,
> and that still seemed like a big improvement to me. While being less
> invasive, in effect. If we can agree that something like that
> represents a useful step in the right direction (not an evolutionary
> dead end), then we can make good incremental progress within a single
> release.
> 

It certainly was a great improvement, no doubt about that. I dislike the
restriction, but that's partially for aesthetic reasons - it just seems
it'd be nice to not have this.

That being said, I'd be OK with having this restriction if it makes v1
feasible. For me, the big question is whether it'd mean we're stuck with
this restriction forever, or whether there's a viable way to improve
this in v2.

And I don't have answer to that :-( I got completely lost in the ongoing
discussion about the locking implications (which I happily ignored while
working on the PoC patch), layering tensions and questions which part
should be "in control".

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: index prefetching
@ 2024-02-15 16:42  Peter Geoghegan <[email protected]>
  parent: Tomas Vondra <[email protected]>
  0 siblings, 0 replies; 4+ messages in thread

From: Peter Geoghegan @ 2024-02-15 16:42 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Melanie Plageman <[email protected]>; Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]>

On Thu, Feb 15, 2024 at 9:36 AM Tomas Vondra
<[email protected]> wrote:
> On 2/15/24 00:06, Peter Geoghegan wrote:
> > I suppose that it might be much more important than I imagine it is
> > right now, but it'd be nice to have something a bit more concrete to
> > go on.
> >
>
> This probably depends on which corner cases are considered important.
>
> The page-at-a-time approach essentially means index items at the
> beginning of the page won't get prefetched (or vice versa, prefetch
> distance drops to 0 when we get to end of index page).

I don't think that's true. At least not for nbtree scans.

As I went into last year, you'd get the benefit of the work I've done
on "boundary cases" (most recently in commit c9c0589f from just a
couple of months back), which helps us get the most out of suffix
truncation. This maximizes the chances of only having to scan a single
index leaf page in many important cases. So I can see no reason why
index items at the beginning of the page are at any particular
disadvantage (compared to those from the middle or the end of the
page).

Where you might have a problem is cases where it's just inherently
necessary to visit more than a single leaf page, despite the best
efforts of the nbtsplitloc.c logic -- cases where the scan just
inherently needs to return tuples that "straddle the boundary between
two neighboring pages". That isn't a particularly natural restriction,
but it's also not obvious that it's all that much of a disadvantage in
practice.

> It certainly was a great improvement, no doubt about that. I dislike the
> restriction, but that's partially for aesthetic reasons - it just seems
> it'd be nice to not have this.
>
> That being said, I'd be OK with having this restriction if it makes v1
> feasible. For me, the big question is whether it'd mean we're stuck with
> this restriction forever, or whether there's a viable way to improve
> this in v2.

I think that there is no question that this will need to not
completely disable kill_prior_tuple -- I'd be surprised if one single
person disagreed with me on this point. There is also a more nuanced
way of describing this same restriction, but we don't necessarily need
to agree on what exactly that is right now.

> And I don't have answer to that :-( I got completely lost in the ongoing
> discussion about the locking implications (which I happily ignored while
> working on the PoC patch), layering tensions and questions which part
> should be "in control".

Honestly, I always thought that it made sense to do things on the
index AM side. When you went the other way I was surprised. Perhaps I
should have said more about that, sooner, but I'd already said quite a
bit at that point, so...

Anyway, I think that it's pretty clear that "naive desynchronization"
is just not acceptable, because that'll disable kill_prior_tuple
altogether. So you're going to have to do this in a way that more or
less preserves something like the current kill_prior_tuple behavior.
It's going to have some downsides, but those can be managed. They can
be managed from within the index AM itself, a bit like the
_bt_killitems() no-pin stuff does things already.

Obviously this interpretation suggests that doing things at the index
AM level is indeed the right way to go, layering-wise. Does it make
sense to you, though?

-- 
Peter Geoghegan

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

end of thread, other threads:[~2024-02-15 16:42 UTC | newest]

Thread overview: 4+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-02-14 21:45 Re: index prefetching Melanie Plageman <[email protected]>
2024-02-14 23:06 ` Peter Geoghegan <[email protected]>
2024-02-15 14:36   ` Tomas Vondra <[email protected]>
2024-02-15 16:42     ` Peter Geoghegan <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox