public inbox for [email protected]help / color / mirror / Atom feed
Re: index prefetching 4+ messages / 3 participants [nested] [flat]
* Re: index prefetching @ 2024-02-14 21:45 Melanie Plageman <[email protected]> 0 siblings, 1 reply; 4+ messages in thread From: Melanie Plageman @ 2024-02-14 21:45 UTC (permalink / raw) To: Peter Geoghegan <[email protected]>; +Cc: Tomas Vondra <[email protected]>; Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]> On Wed, Feb 14, 2024 at 1:21 PM Peter Geoghegan <[email protected]> wrote: > > On Wed, Feb 14, 2024 at 8:34 AM Tomas Vondra > <[email protected]> wrote: > > > Another thing that argues against doing this is that we might not need > > > to visit any more B-Tree leaf pages when there is a LIMIT n involved. > > > We could end up scanning a whole extra leaf page (including all of its > > > tuples) for want of the ability to "push down" a LIMIT to the index AM > > > (that's not what happens right now, but it isn't really needed at all > > > right now). > > > > > > > I'm not quite sure I understand what is "this" that you argue against. > > Are you saying we should not separate the two scans? If yes, is there a > > better way to do this? > > What I'm concerned about is the difficulty and complexity of any > design that requires revising "63.4. Index Locking Considerations", > since that's pretty subtle stuff. In particular, if prefetching > "de-synchronizes" (to use your term) the index leaf page level scan > and the heap page scan, then we'll probably have to totally revise the > basic API. So, a pin on the index leaf page is sufficient to keep line pointers from being reused? If we stick to prefetching heap blocks referred to by index tuples in a single index leaf page, and we keep that page pinned, will we still have a problem? > > The LIMIT problem is not very clear to me either. Yes, if we get close > > to the end of the leaf page, we may need to visit the next leaf page. > > But that's kinda the whole point of prefetching - reading stuff ahead, > > and reading too far ahead is an inherent risk. Isn't that a problem we > > have even without LIMIT? The prefetch distance ramp up is meant to limit > > the impact. > > Right now, the index AM doesn't know anything about LIMIT at all. That > doesn't matter, since the index AM can only read/scan one full leaf > page before returning control back to the executor proper. The > executor proper can just shut down the whole index scan upon finding > that we've already returned N tuples for a LIMIT N. > > We don't do prefetching right now, but we also don't risk reading a > leaf page that'll just never be needed. Those two things are in > tension, but I don't think that that's quite the same thing as the > usual standard prefetching tension/problem. Here there is uncertainty > about whether what we're prefetching will *ever* be required -- not > uncertainty about when exactly it'll be required. (Perhaps this > distinction doesn't mean much to you. I'm just telling you how I think > about it, in case it helps move the discussion forward.) I don't think that the LIMIT problem is too different for index scans than heap scans. We will need some advice from planner to come down to prevent over-eager prefetching in all cases. > Another factor that complicates things here is mark/restore > processing. The design for that has the idea of processing one page at > a time baked-in. Kinda like with the kill_prior_tuple issue. Yes, I mentioned this in my earlier email. I think we can resolve mark/restore by resetting the prefetch and TID queues and restoring the last used heap TID in the index scan descriptor. > It's certainly possible that you could figure out various workarounds > for each of these issues (plus the kill_prior_tuple issue) with a > prefetching design that "de-synchronizes" the index access and the > heap access. But it might well be better to extend the existing design > in a way that just avoids all these problems in the first place. Maybe > "de-synchronization" really can pay for itself (because the benefits > will outweigh these costs), but if you go that way then I'd really > prefer it that way. Forcing each index access to be synchronous and interleaved with each table access seems like an unprincipled design constraint. While it is true that we rely on that in our current implementation (when using non-MVCC snapshots), it doesn't seem like a principle inherent to accessing indexes and tables. > > > I think that it makes sense to put the index AM in control here -- > > > that almost follows from what I said about the index AM API. The index > > > AM already needs to be in control, in about the same way, to deal with > > > kill_prior_tuple (plus it helps with the LIMIT issue I described). > > > > > > > In control how? What would be the control flow - what part would be > > managed by the index AM? > > ISTM that prefetching for an index scan is about the index scan > itself, first and foremost. The heap accesses are usually the dominant > cost, of course, but sometimes the index leaf page accesses really do > make up a significant fraction of the overall cost of the index scan. > Especially with an expensive index qual. So if you just assume that > the TIDs returned by the index scan are the only thing that matters, > you might have a model that's basically correct on average, but is > occasionally very wrong. That's one reason for "putting the index AM > in control". I don't think the fact that it would also be valuable to do index prefetching is a reason not to do prefetching of heap pages. And, while it is true that were you to add index interior or leaf page prefetching, it would impact the heap prefetching, at the end of the day, the table AM needs some TID or TID-equivalents that whose blocks it can go fetch. The index AM has to produce something that the table AM will consume. So, if we add prefetching of heap pages and get the table AM input right, it shouldn't require a full redesign to add index page prefetching later. You could argue that my suggestion to have the index AM manage and populate a queue of TIDs for use by the table AM puts the index AM in control. I do think having so many members of the IndexScanDescriptor which imply a one-at-a-time (xs_heaptid, xs_itup, etc) synchronous interplay between fetching an index tuple and fetching a heap tuple is confusing and error prone. > As I said back in June, we should probably be marrying information > from the index scan with information from the heap. This is something > that is arguably a modularity violation. But it might just be that you > really do need to take information from both places to consistently > make the right trade-off. Agreed that we are going to need to mix information from both places. > If you take a look at _bt_killitems(), you'll see that it actually has > two fairly different strategies for avoiding TID recycling race > condition issues, applied in each of two different cases: > > 1. Cases where we really have held onto a buffer pin, per the index AM > API -- the "inde AM orthodox" approach. (The aforementioned issue > with unlogged indexes exists because with an unlogged index we must > use approach 1, per the nbtree README section [1]). > > 2. Cases where we drop the pin as an optimization (also per [1]), and > now have to detect the possibility of concurrent modifications by > VACUUM (that could have led to concurrent TID recycling). We > conservatively do nothing (don't mark any index tuples LP_DEAD), > unless the LSN is exactly the same as it was back when the page was > scanned/read by _bt_readpage(). Re 2: so the LSN could have been changed by some other process (i.e. not vacuum), so how often in practice is the LSN actually the same as when the page was scanned/read? Do you think we would catch a meaningful number of kill prior tuple opportunities if we used an LSN tracking method like this? Something that let us drop the pin on the page would obviously be better. - Melanie ^ permalink raw reply [nested|flat] 4+ messages in thread
* Re: index prefetching @ 2024-02-14 23:06 Peter Geoghegan <[email protected]> parent: Melanie Plageman <[email protected]> 0 siblings, 1 reply; 4+ messages in thread From: Peter Geoghegan @ 2024-02-14 23:06 UTC (permalink / raw) To: Melanie Plageman <[email protected]>; +Cc: Tomas Vondra <[email protected]>; Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]> On Wed, Feb 14, 2024 at 4:46 PM Melanie Plageman <[email protected]> wrote: > So, a pin on the index leaf page is sufficient to keep line pointers > from being reused? If we stick to prefetching heap blocks referred to > by index tuples in a single index leaf page, and we keep that page > pinned, will we still have a problem? That's certainly one way of dealing with it. Obviously, there are questions about how you do that in a way that consistently avoids creating new problems. > I don't think that the LIMIT problem is too different for index scans > than heap scans. We will need some advice from planner to come down to > prevent over-eager prefetching in all cases. I think that I'd rather use information at execution time instead, if at all possible (perhaps in addition to a hint given by the planner). But it seems a bit premature to discuss this problem now, except to say that it might indeed be a problem. > > It's certainly possible that you could figure out various workarounds > > for each of these issues (plus the kill_prior_tuple issue) with a > > prefetching design that "de-synchronizes" the index access and the > > heap access. But it might well be better to extend the existing design > > in a way that just avoids all these problems in the first place. Maybe > > "de-synchronization" really can pay for itself (because the benefits > > will outweigh these costs), but if you go that way then I'd really > > prefer it that way. > > Forcing each index access to be synchronous and interleaved with each > table access seems like an unprincipled design constraint. While it is > true that we rely on that in our current implementation (when using > non-MVCC snapshots), it doesn't seem like a principle inherent to > accessing indexes and tables. There is nothing sacred about the way plain index scans work right now -- especially the part about buffer pins as an interlock. If the pin thing really was sacred, then we could never have allowed nbtree to selectively opt-out in cases where it's possible to provide an equivalent correctness guarantee without holding onto buffer pins, which, as I went into, is how it actually works in nbtree's _bt_killitems() today (see commit 2ed5b87f96 for full details). And so in principle I have no problem with the idea of revising the basic definition of plain index scans -- especially if it's to make the definition more abstract, without fundamentally changing it (e.g., to make it no longer reference buffer pins, making life easier for prefetching, while at the same time still implying the same underlying guarantees sufficient to allow nbtree to mostly work the same way as today). All I'm really saying is: 1. The sort of tricks that we can do in nbtree's _bt_killitems() are quite useful, and ought to be preserved in something like their current form, even when prefetching is in use. This seems to push things in the direction of centralizing control of the process in index scan code. For example, it has to understand that _bt_killitems() will be called at some regular cadence that is well defined and sensible from an index point of view. 2. Are you sure that the leaf-page-at-a-time thing is such a huge hindrance to effective prefetching? I suppose that it might be much more important than I imagine it is right now, but it'd be nice to have something a bit more concrete to go on. 3. Even if it is somewhat important, do you really need to get that part working in v1? Tomas' original prototype worked with the leaf-page-at-a-time thing, and that still seemed like a big improvement to me. While being less invasive, in effect. If we can agree that something like that represents a useful step in the right direction (not an evolutionary dead end), then we can make good incremental progress within a single release. > I don't think the fact that it would also be valuable to do index > prefetching is a reason not to do prefetching of heap pages. And, > while it is true that were you to add index interior or leaf page > prefetching, it would impact the heap prefetching, at the end of the > day, the table AM needs some TID or TID-equivalents that whose blocks > it can go fetch. I wasn't really thinking of index page prefetching at all. Just the cost of applying index quals to read leaf pages that might never actually need to be read, due to the presence of a LIMIT. That is kind of a new problem created by eagerly reading (without actually prefetching) leaf pages. > You could argue that my suggestion to have the index AM manage and > populate a queue of TIDs for use by the table AM puts the index AM in > control. I do think having so many members of the IndexScanDescriptor > which imply a one-at-a-time (xs_heaptid, xs_itup, etc) synchronous > interplay between fetching an index tuple and fetching a heap tuple is > confusing and error prone. But that's kinda how amgettuple is supposed to work -- cursors need it to work that way. Having some kind of general notion of scan order is also important to avoid returning duplicate TIDs to the scan. In contrast, GIN heavily relies on the fact that it only supports bitmap scans -- that allows it to not have to reason about returning duplicate TIDs (when dealing with a concurrently merged pending list, and other stuff like that). And so nbtree (and basically every other index AM that supports plain index scans) kinda pretends to process a single tuple at a time, in some fixed order that's convenient for the scan to work with (that's how the executor thinks of things). In reality these index AMs actually process batches consisting of a single leaf page worth of tuples. I don't see how the IndexScanDescData side of things makes life any harder for this patch -- ISTM that you'll always need to pretend to return one tuple at a time from the index scan, regardless of what happens under the hood, with pins and whatnot. The page-at-a-time thing is more or less an implementation detail that's private to index AMs (albeit in a way that follows certain standard conventions across index AMs) -- it's a leaky abstraction only due to the interactions with VACUUM/TID recycle safety. > Re 2: so the LSN could have been changed by some other process (i.e. > not vacuum), so how often in practice is the LSN actually the same as > when the page was scanned/read? It seems very hard to make generalizations about that sort of thing. It doesn't help that we now have batching logic inside _bt_simpledel_pass() that will make up for the problem of not setting as many LP_DEAD bits as we could in many important cases. (I recall that that was one factor that allowed the bug that Andres fixed in commit 90c885cd to go undetected for months. I recall discussing the issue with Andres around that time.) > Do you think we would catch a > meaningful number of kill prior tuple opportunities if we used an LSN > tracking method like this? Something that let us drop the pin on the > page would obviously be better. Quite possibly, yes. But it's hard to say for sure without far more detailed analysis. Plus you have problems with things like unlogged indexes not having an LSN to use as a canary condition, which makes it a bit messy (it's already kind of weird that we treat unlogged indexes differently here IMV). -- Peter Geoghegan ^ permalink raw reply [nested|flat] 4+ messages in thread
* Re: index prefetching @ 2024-02-15 14:36 Tomas Vondra <[email protected]> parent: Peter Geoghegan <[email protected]> 0 siblings, 1 reply; 4+ messages in thread From: Tomas Vondra @ 2024-02-15 14:36 UTC (permalink / raw) To: Peter Geoghegan <[email protected]>; Melanie Plageman <[email protected]>; +Cc: Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]> On 2/15/24 00:06, Peter Geoghegan wrote: > On Wed, Feb 14, 2024 at 4:46 PM Melanie Plageman > <[email protected]> wrote: > >> ... > > 2. Are you sure that the leaf-page-at-a-time thing is such a huge > hindrance to effective prefetching? > > I suppose that it might be much more important than I imagine it is > right now, but it'd be nice to have something a bit more concrete to > go on. > This probably depends on which corner cases are considered important. The page-at-a-time approach essentially means index items at the beginning of the page won't get prefetched (or vice versa, prefetch distance drops to 0 when we get to end of index page). That may be acceptable, considering we can usually fit 200+ index items on a single page. Even then it limits what effective_io_concurrency values are sensible, but in my experience quickly diminish past ~32. > 3. Even if it is somewhat important, do you really need to get that > part working in v1? > > Tomas' original prototype worked with the leaf-page-at-a-time thing, > and that still seemed like a big improvement to me. While being less > invasive, in effect. If we can agree that something like that > represents a useful step in the right direction (not an evolutionary > dead end), then we can make good incremental progress within a single > release. > It certainly was a great improvement, no doubt about that. I dislike the restriction, but that's partially for aesthetic reasons - it just seems it'd be nice to not have this. That being said, I'd be OK with having this restriction if it makes v1 feasible. For me, the big question is whether it'd mean we're stuck with this restriction forever, or whether there's a viable way to improve this in v2. And I don't have answer to that :-( I got completely lost in the ongoing discussion about the locking implications (which I happily ignored while working on the PoC patch), layering tensions and questions which part should be "in control". regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company ^ permalink raw reply [nested|flat] 4+ messages in thread
* Re: index prefetching @ 2024-02-15 16:42 Peter Geoghegan <[email protected]> parent: Tomas Vondra <[email protected]> 0 siblings, 0 replies; 4+ messages in thread From: Peter Geoghegan @ 2024-02-15 16:42 UTC (permalink / raw) To: Tomas Vondra <[email protected]>; +Cc: Melanie Plageman <[email protected]>; Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]> On Thu, Feb 15, 2024 at 9:36 AM Tomas Vondra <[email protected]> wrote: > On 2/15/24 00:06, Peter Geoghegan wrote: > > I suppose that it might be much more important than I imagine it is > > right now, but it'd be nice to have something a bit more concrete to > > go on. > > > > This probably depends on which corner cases are considered important. > > The page-at-a-time approach essentially means index items at the > beginning of the page won't get prefetched (or vice versa, prefetch > distance drops to 0 when we get to end of index page). I don't think that's true. At least not for nbtree scans. As I went into last year, you'd get the benefit of the work I've done on "boundary cases" (most recently in commit c9c0589f from just a couple of months back), which helps us get the most out of suffix truncation. This maximizes the chances of only having to scan a single index leaf page in many important cases. So I can see no reason why index items at the beginning of the page are at any particular disadvantage (compared to those from the middle or the end of the page). Where you might have a problem is cases where it's just inherently necessary to visit more than a single leaf page, despite the best efforts of the nbtsplitloc.c logic -- cases where the scan just inherently needs to return tuples that "straddle the boundary between two neighboring pages". That isn't a particularly natural restriction, but it's also not obvious that it's all that much of a disadvantage in practice. > It certainly was a great improvement, no doubt about that. I dislike the > restriction, but that's partially for aesthetic reasons - it just seems > it'd be nice to not have this. > > That being said, I'd be OK with having this restriction if it makes v1 > feasible. For me, the big question is whether it'd mean we're stuck with > this restriction forever, or whether there's a viable way to improve > this in v2. I think that there is no question that this will need to not completely disable kill_prior_tuple -- I'd be surprised if one single person disagreed with me on this point. There is also a more nuanced way of describing this same restriction, but we don't necessarily need to agree on what exactly that is right now. > And I don't have answer to that :-( I got completely lost in the ongoing > discussion about the locking implications (which I happily ignored while > working on the PoC patch), layering tensions and questions which part > should be "in control". Honestly, I always thought that it made sense to do things on the index AM side. When you went the other way I was surprised. Perhaps I should have said more about that, sooner, but I'd already said quite a bit at that point, so... Anyway, I think that it's pretty clear that "naive desynchronization" is just not acceptable, because that'll disable kill_prior_tuple altogether. So you're going to have to do this in a way that more or less preserves something like the current kill_prior_tuple behavior. It's going to have some downsides, but those can be managed. They can be managed from within the index AM itself, a bit like the _bt_killitems() no-pin stuff does things already. Obviously this interpretation suggests that doing things at the index AM level is indeed the right way to go, layering-wise. Does it make sense to you, though? -- Peter Geoghegan ^ permalink raw reply [nested|flat] 4+ messages in thread
end of thread, other threads:[~2024-02-15 16:42 UTC | newest] Thread overview: 4+ messages (download: mbox mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2024-02-14 21:45 Re: index prefetching Melanie Plageman <[email protected]> 2024-02-14 23:06 ` Peter Geoghegan <[email protected]> 2024-02-15 14:36 ` Tomas Vondra <[email protected]> 2024-02-15 16:42 ` Peter Geoghegan <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox