Re: index prefetching

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: index prefetching
25+ messages / 6 participants
[nested] [flat]

* Re: index prefetching
@ 2023-12-20 19:09 Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  0 siblings, 1 reply; 25+ messages in thread

From: Robert Haas @ 2023-12-20 19:09 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>

On Tue, Dec 19, 2023 at 8:41 PM Tomas Vondra
<[email protected]> wrote:
> Whatever the right abstraction is, it probably needs to do these VM
> checks only once.

Makes sense.

> Yeah, after you pointed out the "leaky" abstraction, I also started to
> think about customizing the behavior using a callback. Not sure what
> exactly you mean by "fully transparent" but as I explained above I think
> we need to allow passing some information between the prefetcher and the
> executor - for example results of the visibility map checks in IOS.

Agreed.

> I have imagined something like this:
>
> nodeIndexscan / index_getnext_slot()
> -> no callback, all TIDs are prefetched
>
> nodeIndexonlyscan / index_getnext_tid()
> -> callback checks VM for the TID, prefetches if not all-visible
> -> the VM check result is stored in the queue with the VM (but in an
>    extensible way, so that other callback can store other stuff)
> -> index_getnext_tid() also returns this extra information
>
> So not that different from the WIP patch, but in a "generic" and
> extensible way. Instead of hard-coding the all-visible flag, there'd be
> a something custom information. A bit like qsort_r() has a void* arg to
> pass custom context.
>
> Or if envisioned something different, could you elaborate a bit?

I can't totally follow the sketch you give above, but I think we're
thinking along similar lines, at least.

> I think if the code stays in indexam.c, it's sensible to keep the index_
> prefix, but then also have a more appropriate rest of the name. For
> example it might be index_prefetch_heap_pages() or something like that.

Yeah, that's not a bad idea.

> > index_prefetch_is_sequential() makes me really nervous
> > because it seems to depend an awful lot on whether the OS is doing
> > prefetching, and how the OS is doing prefetching, and I think those
> > might not be consistent across all systems and kernel versions.
>
> If the OS does not have read-ahead, or it's not configured properly,
> then the patch does not perform worse than what we have now. I'm far
> more concerned about the opposite issue, i.e. causing regressions with
> OS-level read-ahead. And the check handles that well, I think.

I'm just not sure how much I believe that it's going to work well
everywhere. I mean, I have no evidence that it doesn't, it just kind
of looks like guesswork to me. For instance, the behavior of the
algorithm depends heavily on PREFETCH_QUEUE_HISTORY and
PREFETCH_SEQ_PATTERN_BLOCKS, but those are just magic numbers. Who is
to say that on some system or workload you didn't test the required
values aren't entirely different, or that the whole algorithm doesn't
need rethinking? Maybe we can't really answer that question perfectly,
but the patch doesn't really explain the reasoning behind this choice
of algorithm.

> > Similarly with index_prefetch(). There's a lot of "magical"
> > assumptions here. Even index_prefetch_add_cache() has this problem --
> > the function assumes that it's OK if we sometimes fail to detect a
> > duplicate prefetch request, which makes sense, but under what
> > circumstances is it necessary to detect duplicates and in what cases
> > is it optional? The function comments are silent about that, which
> > makes it hard to assess whether the algorithm is good enough.
>
> I don't quite understand what problem with duplicates you envision here.
> Strictly speaking, we don't need to detect/prevent duplicates - it's
> just that if you do posix_fadvise() for a block that's already in
> memory, it's overhead / wasted time. The whole point is to not do that
> very often. In this sense it's entirely optional, but desirable.

Right ... but the patch sets up some data structure that will
eliminate duplicates in some circumstances and fail to eliminate them
in others. So it's making a judgement that the things it catches are
the cases that are important enough that we need to catch them, and
the things that it doesn't catch are cases that aren't particularly
important to catch. Here again, PREFETCH_LRU_SIZE and
PREFETCH_LRU_COUNT seem like they will have a big impact, but why
these values? The comments suggest that it's because we want to cover
~8MB of data, but it's not clear why that should be the right amount
of data to cover. My naive thought is that we'd want to avoid
prefetching a block during the time between we had prefetched it and
when we later read it, but then the value that is here magically 8MB
should really be replaced by the operative prefetch distance.

> I really don't want to have multiple knobs. At this point we have three
> GUCs, each tuning prefetching for a fairly large part of the system:
>
>   effective_io_concurrency = regular queries
>   maintenance_io_concurrency = utility commands
>   recovery_prefetch = recovery / PITR
>
> This seems sensible, but I really don't want many more GUCs tuning
> prefetching for different executor nodes or something like that.
>
> If we have issues with how effective_io_concurrency works (and I'm not
> sure that's actually true), then perhaps we should fix that rather than
> inventing new GUCs.

Well, that would very possibly be a good idea, but I still think using
the same GUC for two different purposes is likely to cause trouble. I
think what effective_io_concurrency currently controls is basically
the heap prefetch distance for bitmap scans, and what you want to
control here is the heap prefetch distance for index scans. If those
are necessarily related in some understandable way (e.g. always the
same, one twice the other, one the square of the other) then it's fine
to use the same parameter for both, but it's not clear to me that this
is the case. I fear someone will find that if they crank up
effective_io_concurrency high enough to get the amount of prefetching
they want for bitmap scans, it will be too much for index scans, or
the other way around.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
@ 2023-12-21 12:30 ` Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  0 siblings, 1 reply; 25+ messages in thread

From: Tomas Vondra @ 2023-12-21 12:30 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>

On 12/20/23 20:09, Robert Haas wrote:
> On Tue, Dec 19, 2023 at 8:41 PM Tomas Vondra
> ...
>> I have imagined something like this:
>>
>> nodeIndexscan / index_getnext_slot()
>> -> no callback, all TIDs are prefetched
>>
>> nodeIndexonlyscan / index_getnext_tid()
>> -> callback checks VM for the TID, prefetches if not all-visible
>> -> the VM check result is stored in the queue with the VM (but in an
>>    extensible way, so that other callback can store other stuff)
>> -> index_getnext_tid() also returns this extra information
>>
>> So not that different from the WIP patch, but in a "generic" and
>> extensible way. Instead of hard-coding the all-visible flag, there'd be
>> a something custom information. A bit like qsort_r() has a void* arg to
>> pass custom context.
>>
>> Or if envisioned something different, could you elaborate a bit?
> 
> I can't totally follow the sketch you give above, but I think we're
> thinking along similar lines, at least.
> 

Yeah, it's hard to discuss vague descriptions of code that does not
exist yet. I'll try to do the actual patch, then we can discuss.

>>> index_prefetch_is_sequential() makes me really nervous
>>> because it seems to depend an awful lot on whether the OS is doing
>>> prefetching, and how the OS is doing prefetching, and I think those
>>> might not be consistent across all systems and kernel versions.
>>
>> If the OS does not have read-ahead, or it's not configured properly,
>> then the patch does not perform worse than what we have now. I'm far
>> more concerned about the opposite issue, i.e. causing regressions with
>> OS-level read-ahead. And the check handles that well, I think.
> 
> I'm just not sure how much I believe that it's going to work well
> everywhere. I mean, I have no evidence that it doesn't, it just kind
> of looks like guesswork to me. For instance, the behavior of the
> algorithm depends heavily on PREFETCH_QUEUE_HISTORY and
> PREFETCH_SEQ_PATTERN_BLOCKS, but those are just magic numbers. Who is
> to say that on some system or workload you didn't test the required
> values aren't entirely different, or that the whole algorithm doesn't
> need rethinking? Maybe we can't really answer that question perfectly,
> but the patch doesn't really explain the reasoning behind this choice
> of algorithm.
> 

You're right a lot of this is a guesswork. I don't think we can do much
better, because it depends on stuff that's out of our control - each OS
may do things differently, or perhaps it's just configured differently.

But I don't think this is really a serious issue - all the read-ahead
implementations need to work about the same, because they are meant to
work in a transparent way.

So it's about deciding at which point we think this is a sequential
pattern. Yes, the OS may use a slightly different threshold, but the
exact value does not really matter - in the worst case we prefetch a
couple more/fewer blocks.

The OS read-ahead can't really prefetch anything except sequential
cases, so the whole question is "When does the access pattern get
sequential enough?". I don't think there's a perfect answer, and I don't
think we need a perfect one - we just need to be reasonably close.

Also, while I don't want to lazily dismiss valid cases that might be
affected by this, I think that sequential access for index paths is not
that common (with the exception of clustered indexes).

FWIW bitmap index scans have exactly the same "problem" except that no
one cares about it because that's how it worked from the start, so it's
not considered a regression.

>>> Similarly with index_prefetch(). There's a lot of "magical"
>>> assumptions here. Even index_prefetch_add_cache() has this problem --
>>> the function assumes that it's OK if we sometimes fail to detect a
>>> duplicate prefetch request, which makes sense, but under what
>>> circumstances is it necessary to detect duplicates and in what cases
>>> is it optional? The function comments are silent about that, which
>>> makes it hard to assess whether the algorithm is good enough.
>>
>> I don't quite understand what problem with duplicates you envision here.
>> Strictly speaking, we don't need to detect/prevent duplicates - it's
>> just that if you do posix_fadvise() for a block that's already in
>> memory, it's overhead / wasted time. The whole point is to not do that
>> very often. In this sense it's entirely optional, but desirable.
> 
> Right ... but the patch sets up some data structure that will
> eliminate duplicates in some circumstances and fail to eliminate them
> in others. So it's making a judgement that the things it catches are
> the cases that are important enough that we need to catch them, and
> the things that it doesn't catch are cases that aren't particularly
> important to catch. Here again, PREFETCH_LRU_SIZE and
> PREFETCH_LRU_COUNT seem like they will have a big impact, but why
> these values? The comments suggest that it's because we want to cover
> ~8MB of data, but it's not clear why that should be the right amount
> of data to cover. My naive thought is that we'd want to avoid
> prefetching a block during the time between we had prefetched it and
> when we later read it, but then the value that is here magically 8MB
> should really be replaced by the operative prefetch distance.
> 

True. Ideally we'd not issue prefetch request for data that's already in
memory - either in shared buffers or page cache (or whatever). And we
already do that for shared buffers, but not for page cache. The preadv2
experiment was an attempt to do that, but it's too expensive to help.

So we have to approximate, and the only way I can think of is checking
if we recently prefetched that block. Which is the whole point of this
simple cache - remembering which blocks we prefetched, so that we don't
prefetch them over and over again.

I don't understand what you mean by "cases that are important enough".
In a way, all the blocks are equally important, with exactly the same
impact of making the wrong decision.

You're certainly right the 8MB is a pretty arbitrary value, though. It
seemed reasonable, so I used that, but I might just as well use 32MB or
some other sensible value. Ultimately, any hard-coded value is going to
be wrong, but the negative consequences are a bit asymmetrical. If the
cache is too small, we may end up doing prefetches for data that's
already in cache. If it's too large, we may not prefetch data that's not
in memory at that point.

Obviously, the latter case has much more severe impact, but it depends
on the exact workload / access pattern etc. The only "perfect" solution
would be to actually check the page cache, but well - that seems to be
fairly expensive.

What I was envisioning was something self-tuning, based on the I/O we
may do later. If the prefetcher decides to prefetch something, but finds
it's already in cache, we'd increase the distance, to remember more
blocks. Likewise, if a block is not prefetched but then requires I/O
later, decrease the distance. That'd make it adaptive, but I don't think
we actually have the info about I/O.

A bigger "flaw" is that these caches are per-backend, so there's no way
to check if a block was recently prefetched by some other backend. I
actually wonder if maybe this cache should be in shared memory, but I
haven't tried.

Alternatively, I was thinking about moving the prefetches into a
separate worker process (or multiple workers), so we'd just queue the
request and all the overhead would be done by the worker. The main
problem is the overhead of calling posix_fadvise() for blocks that are
already in memory, and this would just move it to a separate backend. I
wonder if that might even make the custom cache unnecessary / optional.

AFAICS this seems similar to some of the AIO patch, I wonder what that
plans to do. I need to check.

>> I really don't want to have multiple knobs. At this point we have three
>> GUCs, each tuning prefetching for a fairly large part of the system:
>>
>>   effective_io_concurrency = regular queries
>>   maintenance_io_concurrency = utility commands
>>   recovery_prefetch = recovery / PITR
>>
>> This seems sensible, but I really don't want many more GUCs tuning
>> prefetching for different executor nodes or something like that.
>>
>> If we have issues with how effective_io_concurrency works (and I'm not
>> sure that's actually true), then perhaps we should fix that rather than
>> inventing new GUCs.
> 
> Well, that would very possibly be a good idea, but I still think using
> the same GUC for two different purposes is likely to cause trouble. I
> think what effective_io_concurrency currently controls is basically
> the heap prefetch distance for bitmap scans, and what you want to
> control here is the heap prefetch distance for index scans. If those
> are necessarily related in some understandable way (e.g. always the
> same, one twice the other, one the square of the other) then it's fine
> to use the same parameter for both, but it's not clear to me that this
> is the case. I fear someone will find that if they crank up
> effective_io_concurrency high enough to get the amount of prefetching
> they want for bitmap scans, it will be too much for index scans, or
> the other way around.
> 

I understand, but I think we should really try to keep the number of
knobs as low as possible, unless we actually have very good arguments
for having separate GUCs. And I don't think we have that.

This is very much about how many concurrent requests the storage can
handle (or rather requires to benefit from the capabilities), and that's
pretty orthogonal to which operation is generating the requests.

I think this is pretty similar to what we do with work_mem - there's one
value for all possible parts of the query plan, no matter if it's sort,
group by, or something else. We do have separate limits for maintenance
commands, because that's a different matter, and we have the same for
the two I/O GUCs.

If we come to the realization that really need two GUCs, fine with me.
But at this point I don't see a reason to do that.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
@ 2023-12-21 13:43   ` Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  0 siblings, 1 reply; 25+ messages in thread

From: Andres Freund @ 2023-12-21 13:43 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Robert Haas <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>

Hi,

On 2023-12-21 13:30:42 +0100, Tomas Vondra wrote:
> You're right a lot of this is a guesswork. I don't think we can do much
> better, because it depends on stuff that's out of our control - each OS
> may do things differently, or perhaps it's just configured differently.
> 
> But I don't think this is really a serious issue - all the read-ahead
> implementations need to work about the same, because they are meant to
> work in a transparent way.
> 
> So it's about deciding at which point we think this is a sequential
> pattern. Yes, the OS may use a slightly different threshold, but the
> exact value does not really matter - in the worst case we prefetch a
> couple more/fewer blocks.
> 
> The OS read-ahead can't really prefetch anything except sequential
> cases, so the whole question is "When does the access pattern get
> sequential enough?". I don't think there's a perfect answer, and I don't
> think we need a perfect one - we just need to be reasonably close.

For the streaming read interface (initially backed by fadvise, to then be
replaced by AIO) we found that it's clearly necessary to avoid fadvises in
cases of actual sequential IO - the overhead otherwise leads to easily
reproducible regressions.  So I don't think we have much choice.


> Also, while I don't want to lazily dismiss valid cases that might be
> affected by this, I think that sequential access for index paths is not
> that common (with the exception of clustered indexes).

I think sequential access is common in other cases as well. There's lots of
indexes where heap tids are almost perfectly correlated with index entries,
consider insert only insert-only tables and serial PKs or inserted_at
timestamp columns.  Even leaving those aside, for indexes with many entries
for the same key, we sort by tid these days, which will also result in
"runs" of sequential access.


> Obviously, the latter case has much more severe impact, but it depends
> on the exact workload / access pattern etc. The only "perfect" solution
> would be to actually check the page cache, but well - that seems to be
> fairly expensive.

> What I was envisioning was something self-tuning, based on the I/O we
> may do later. If the prefetcher decides to prefetch something, but finds
> it's already in cache, we'd increase the distance, to remember more
> blocks. Likewise, if a block is not prefetched but then requires I/O
> later, decrease the distance. That'd make it adaptive, but I don't think
> we actually have the info about I/O.

How would the prefetcher know that hte data wasn't in cache?


> Alternatively, I was thinking about moving the prefetches into a
> separate worker process (or multiple workers), so we'd just queue the
> request and all the overhead would be done by the worker. The main
> problem is the overhead of calling posix_fadvise() for blocks that are
> already in memory, and this would just move it to a separate backend. I
> wonder if that might even make the custom cache unnecessary / optional.

The AIO patchset provides this.


> AFAICS this seems similar to some of the AIO patch, I wonder what that
> plans to do. I need to check.

Yes, most of this exists there.  The difference that with the AIO you don't
need to prefetch, as you can just initiate the IO for real, and wait for it to
complete.

Greetings,

Andres Freund






^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
@ 2023-12-21 15:20     ` Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  0 siblings, 1 reply; 25+ messages in thread

From: Tomas Vondra @ 2023-12-21 15:20 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Robert Haas <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>



On 12/21/23 14:43, Andres Freund wrote:
> Hi,
> 
> On 2023-12-21 13:30:42 +0100, Tomas Vondra wrote:
>> You're right a lot of this is a guesswork. I don't think we can do much
>> better, because it depends on stuff that's out of our control - each OS
>> may do things differently, or perhaps it's just configured differently.
>>
>> But I don't think this is really a serious issue - all the read-ahead
>> implementations need to work about the same, because they are meant to
>> work in a transparent way.
>>
>> So it's about deciding at which point we think this is a sequential
>> pattern. Yes, the OS may use a slightly different threshold, but the
>> exact value does not really matter - in the worst case we prefetch a
>> couple more/fewer blocks.
>>
>> The OS read-ahead can't really prefetch anything except sequential
>> cases, so the whole question is "When does the access pattern get
>> sequential enough?". I don't think there's a perfect answer, and I don't
>> think we need a perfect one - we just need to be reasonably close.
> 
> For the streaming read interface (initially backed by fadvise, to then be
> replaced by AIO) we found that it's clearly necessary to avoid fadvises in
> cases of actual sequential IO - the overhead otherwise leads to easily
> reproducible regressions.  So I don't think we have much choice.
> 

Yeah, the regression are pretty easy to demonstrate. In fact, I didn't
have such detection in the first patch, but after the first round of
benchmarks it became obvious it's needed.

> 
>> Also, while I don't want to lazily dismiss valid cases that might be
>> affected by this, I think that sequential access for index paths is not
>> that common (with the exception of clustered indexes).
> 
> I think sequential access is common in other cases as well. There's lots of
> indexes where heap tids are almost perfectly correlated with index entries,
> consider insert only insert-only tables and serial PKs or inserted_at
> timestamp columns.  Even leaving those aside, for indexes with many entries
> for the same key, we sort by tid these days, which will also result in
> "runs" of sequential access.
> 

True. I should have thought about those cases.

> 
>> Obviously, the latter case has much more severe impact, but it depends
>> on the exact workload / access pattern etc. The only "perfect" solution
>> would be to actually check the page cache, but well - that seems to be
>> fairly expensive.
> 
>> What I was envisioning was something self-tuning, based on the I/O we
>> may do later. If the prefetcher decides to prefetch something, but finds
>> it's already in cache, we'd increase the distance, to remember more
>> blocks. Likewise, if a block is not prefetched but then requires I/O
>> later, decrease the distance. That'd make it adaptive, but I don't think
>> we actually have the info about I/O.
> 
> How would the prefetcher know that hte data wasn't in cache?
> 

I don't think there's a good way to do that, unfortunately, or at least
I'm not aware of it. That's what I meant by "we don't have the info" at
the end. Which is why I haven't tried implementing it.

The only "solution" I could come up with was some sort of "timing" for
the I/O requests and deducing what was cached. Not great, of course.

> 
>> Alternatively, I was thinking about moving the prefetches into a
>> separate worker process (or multiple workers), so we'd just queue the
>> request and all the overhead would be done by the worker. The main
>> problem is the overhead of calling posix_fadvise() for blocks that are
>> already in memory, and this would just move it to a separate backend. I
>> wonder if that might even make the custom cache unnecessary / optional.
> 
> The AIO patchset provides this.
> 

OK, I guess it's time for me to take a look at the patch again.

> 
>> AFAICS this seems similar to some of the AIO patch, I wonder what that
>> plans to do. I need to check.
> 
> Yes, most of this exists there.  The difference that with the AIO you don't
> need to prefetch, as you can just initiate the IO for real, and wait for it to
> complete.
> 

Right, although the line where things stop being "prefetch" and becomes
"async" seems a bit unclear to me / perhaps more a point of view.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
@ 2023-12-21 15:43       ` Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  0 siblings, 1 reply; 25+ messages in thread

From: Andres Freund @ 2023-12-21 15:43 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Robert Haas <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>

Hi,

On 2023-12-21 16:20:45 +0100, Tomas Vondra wrote:
> On 12/21/23 14:43, Andres Freund wrote:
> >> AFAICS this seems similar to some of the AIO patch, I wonder what that
> >> plans to do. I need to check.
> > 
> > Yes, most of this exists there.  The difference that with the AIO you don't
> > need to prefetch, as you can just initiate the IO for real, and wait for it to
> > complete.
> > 
> 
> Right, although the line where things stop being "prefetch" and becomes
> "async" seems a bit unclear to me / perhaps more a point of view.

Agreed. What I meant with not needing prefetching was that you'd not use
fadvise(), because it's better to instead just asynchronously read data into
shared buffers. That way you don't have the doubling of syscalls and you don't
need to care less about the buffering rate in the kernel.

Greetings,

Andres Freund






^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
@ 2024-01-04 14:55         ` Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  0 siblings, 1 reply; 25+ messages in thread

From: Tomas Vondra @ 2024-01-04 14:55 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Robert Haas <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>

Hi,

Here's a somewhat reworked version of the patch. My initial goal was to
see if it could adopt the StreamingRead API proposed in [1], but that
turned out to be less straight-forward than I hoped, for two reasons:

(1) The StreamingRead API seems to be designed for pages, but the index
code naturally works with TIDs/tuples. Yes, the callbacks can associate
the blocks with custom data (in this case that'd be the TID), but it
seemed a bit strange ...

(2) The place adding requests to the StreamingRead queue is pretty far
from the place actually reading the pages - for prefetching, the
requests would be generated in nodeIndexscan, but the page reading
happens somewhere deep in index_fetch_heap/heapam_index_fetch_tuple.
Sure, the TIDs would come from a callback, so it's a bit as if the
requests were generated in heapam_index_fetch_tuple - but it has no idea
StreamingRead exists, so where would it get it.

We might teach it about it, but what if there are multiple places
calling index_fetch_heap()? Not all of which may be using StreamingRead
(only indexscans would do that). Or if there are multiple index scans,
there's need to be a separate StreamingRead queues, right?

In any case, I felt a bit out of my depth here, and I chose not to do
all this work without discussing the direction here. (Also, see the
point about cursors and xs_heap_continue a bit later in this post.)


I did however like the general StreamingRead API - how it splits the
work between the API and the callback. The patch used to do everything,
which meant it hardcoded a lot of the IOS-specific logic etc. I did plan
to have some sort of "callback" for reading from the queue, but that
didn't quite solve this issue - a lot of the stuff remained hard-coded.
But the StreamingRead API made me realize that having a callback for the
first phase (that adds requests to the queue) would fix that.

So I did that - there's now one simple callback in for index scans, and
a bit more complex callback for index-only scans. Thanks to this the
hard-coded stuff mostly disappears, which is good.

Perhaps a bigger change is that I decided to move this into a separate
API on top of indexam.c. The original idea was to integrate this into
index_getnext_tid/index_getnext_slot, so that all callers benefit from
the prefetching automatically. Which would be nice, but it also meant
it's need to happen in the indexam.c code, which seemed dirty.

This patch introduces an API similar to StreamingRead. It calls the
indexam.c stuff, but does all the prefetching on top of it, not in it.
If a place calling index_getnext_tid() wants to allow prefetching, it
needs to switch to IndexPrefetchNext(). (There's no function that would
replace index_getnext_slot, at the moment. Maybe there should be.)

Note 1: The IndexPrefetch name is a bit misleading, because it's used
even with prefetching disabled - all index reads from the index scan
happen through it. Maybe it should be called IndexReader or something
like that.

Note 2: I left the code in indexam.c for now, but in principle it could
(should) be moved to a different place.

I think this layering makes sense, and it's probably much closer to what
Andres meant when he said the prefetching should happen in the executor.
Even if the patch ends up using StreamingRead in the future, I guess
we'll want something like IndexPrefetch - it might use the StreamingRead
internally, but it would still need to do some custom stuff to detect
I/O patterns or something that does not quite fit into the StreamingRead.


Now, let's talk about two (mostly unrelated) problems I ran into.

Firstly, I realized there's a bit of a problem with cursors. The
prefetching works like this:

1) reading TIDs from the index
2) stashing them into a queue in IndexPrefetch
3) doing prefetches for the new TIDs added to the queue
4) returning the TIDs to the caller, one by one

And all of this works ... unless the direction of the scan changes.
Which for cursors can happen if someone does FETCH BACKWARD or stuff
like that. I'm not sure how difficult it'd be to make this work. I
suppose we could simply discard the prefetched entries and do the right
number of steps back for the index scan. But I haven't tried, and maybe
it's more complex than I'm imagining. Also, if the cursor changes the
direction a lot, it'd make the prefetching harmful.

The patch simply disables prefetching for such queries, using the same
logic that we do for parallelism. This may be over-zealous.

FWIW this is one of the things that probably should remain outside of
StreamingRead API - it seems pretty index-specific, and I'm not sure
we'd even want to support these "backward" movements in the API.


The other issue I'm aware of is handling xs_heap_continue. I believe it
works fine for "false" but I need to take a look at non-MVCC snapshots
(i.e. when xs_heap_continue=true).


I haven't done any benchmarks with this reworked API - there's a couple
more allocations etc. but it did not change in a fundamental way. I
don't expect any major difference.

regards



[1]
https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%...

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

  [text/x-patch] v20240103-0001-prefetch-2023-12-09.patch (45.2K, 2-v20240103-0001-prefetch-2023-12-09.patch)
  download | inline diff:
From 74bd0d6b70fa8ca3a1b26196de6b7a9cc670ac9b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Fri, 17 Nov 2023 23:54:19 +0100
Subject: [PATCH v20240103 1/2] prefetch 2023-12-09

Patch version shared on 2023/12/09.
---
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/backend/access/index/genam.c         |   4 +-
 src/backend/access/index/indexam.c       | 551 ++++++++++++++++++++++-
 src/backend/commands/explain.c           |  18 +
 src/backend/executor/execIndexing.c      |   6 +-
 src/backend/executor/execReplication.c   |   9 +-
 src/backend/executor/instrument.c        |   4 +
 src/backend/executor/nodeIndexonlyscan.c |  99 +++-
 src/backend/executor/nodeIndexscan.c     |  71 ++-
 src/backend/utils/adt/selfuncs.c         |   2 +-
 src/include/access/genam.h               | 115 ++++-
 src/include/executor/instrument.h        |   2 +
 src/include/nodes/execnodes.h            |   4 +
 13 files changed, 868 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7c28dafb728..26d3ec20b63 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -792,7 +792,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
+			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot, NULL))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4ca12006843..72e7c9f206c 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -509,7 +509,7 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
+		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot, NULL))
 		{
 			bool		shouldFree;
 
@@ -713,7 +713,7 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	HeapTuple	htup = NULL;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
+	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot, NULL))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index f23e0199f08..f96aeba1b39 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -49,16 +49,19 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/transam.h"
+#include "access/visibilitymap.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "catalog/pg_amproc.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
+#include "common/hashfn.h"
 #include "nodes/makefuncs.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "utils/lsyscache.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -108,6 +111,13 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
 
+static void index_prefetch_tids(IndexScanDesc scan, ScanDirection direction,
+								IndexPrefetch *prefetch);
+static ItemPointer index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction,
+										  IndexPrefetch *prefetch, bool *all_visible);
+static void index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
+						   ItemPointer tid, bool skip_all_visible, bool *all_visible);
+
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -536,8 +546,8 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
  * or NULL if no more matching tuples exist.
  * ----------------
  */
-ItemPointer
-index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
+static ItemPointer
+index_getnext_tid_internal(IndexScanDesc scan, ScanDirection direction)
 {
 	bool		found;
 
@@ -636,16 +646,21 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
  * ----------------
  */
 bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
+index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot,
+				   IndexPrefetch *prefetch)
 {
 	for (;;)
 	{
+		/* Do prefetching (if requested/enabled). */
+		index_prefetch_tids(scan, direction, prefetch);
+
 		if (!scan->xs_heap_continue)
 		{
-			ItemPointer tid;
+			ItemPointer	tid;
+			bool		all_visible;
 
 			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid(scan, direction);
+			tid = index_prefetch_get_tid(scan, direction, prefetch, &all_visible);
 
 			/* If we're out of index entries, we're done */
 			if (tid == NULL)
@@ -1003,3 +1018,529 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(&relopts, attoptions, validate);
 }
+
+/*
+ * index_prefetch_is_sequential
+ *		Track the block number and check if the I/O pattern is sequential,
+ *		or if the same block was just prefetched.
+ *
+ * Prefetching is cheap, but for some access patterns the benefits are small
+ * compared to the extra overhead. In particular, for sequential access the
+ * read-ahead performed by the OS is very effective/efficient. Doing more
+ * prefetching is just increasing the costs.
+ *
+ * This tries to identify simple sequential patterns, so that we can skip
+ * the prefetching request. This is implemented by having a small queue
+ * of block numbers, and checking it before prefetching another block.
+ *
+ * We look at the preceding PREFETCH_SEQ_PATTERN_BLOCKS blocks, and see if
+ * they are sequential. We also check if the block is the same as the last
+ * request (which is not sequential).
+ *
+ * Note that the main prefetch queue is not really useful for this, as it
+ * stores TIDs while we care about block numbers. Consider a sorted table,
+ * with a perfectly sequential pattern when accessed through an index. Each
+ * heap page may have dozens of TIDs, but we need to check block numbers.
+ * We could keep enough TIDs to cover enough blocks, but then we also need
+ * to walk those when checking the pattern (in hot path).
+ *
+ * So instead, we maintain a small separate queue of block numbers, and we use
+ * this instead.
+ *
+ * Returns true if the block is in a sequential pattern (and so should not be
+ * prefetched), or false (not sequential, should be prefetched).
+ *
+ * XXX The name is a bit misleading, as it also adds the block number to the
+ * block queue and checks if the block is the same as the last one (which
+ * does not require a sequential pattern).
+ */
+static bool
+index_prefetch_is_sequential(IndexPrefetch *prefetch, BlockNumber block)
+{
+	int			idx;
+
+	/*
+	 * If the block queue is empty, just store the block and we're done (it's
+	 * neither a sequential pattern, neither recently prefetched block).
+	 */
+	if (prefetch->blockIndex == 0)
+	{
+		prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+		prefetch->blockIndex++;
+		return false;
+	}
+
+	/*
+	 * Check if it's the same as the immediately preceding block. We don't
+	 * want to prefetch the same block over and over (which would happen for
+	 * well correlated indexes).
+	 *
+	 * In principle we could rely on index_prefetch_add_cache doing this using
+	 * the full cache, but this check is much cheaper and we need to look at
+	 * the preceding block anyway, so we just do it.
+	 *
+	 * XXX Notice we haven't added the block to the block queue yet, and there
+	 * is a preceding block (i.e. blockIndex-1 is valid).
+	 */
+	if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
+		return true;
+
+	/*
+	 * Add the block number to the queue.
+	 *
+	 * We do this before checking if the pattern, because we want to know
+	 * about the block even if we end up skipping the prefetch. Otherwise we'd
+	 * not be able to detect longer sequential pattens - we'd skip one block
+	 * but then fail to skip the next couple blocks even in a perfect
+	 * sequential pattern. This ocillation might even prevent the OS
+	 * read-ahead from kicking in.
+	 */
+	prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+	prefetch->blockIndex++;
+
+	/*
+	 * Check if the last couple blocks are in a sequential pattern. We look
+	 * for a sequential pattern of PREFETCH_SEQ_PATTERN_BLOCKS (4 by default),
+	 * so we look for patterns of 5 pages (40kB) including the new block.
+	 *
+	 * XXX Perhaps this should be tied to effective_io_concurrency somehow?
+	 *
+	 * XXX Could it be harmful that we read the queue backwards? Maybe memory
+	 * prefetching works better for the forward direction?
+	 */
+	for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
+	{
+		/*
+		 * Are there enough requests to confirm a sequential pattern? We only
+		 * consider something to be sequential after finding a sequence of
+		 * PREFETCH_SEQ_PATTERN_BLOCKS blocks.
+		 *
+		 * FIXME Better to move this outside the loop.
+		 */
+		if (prefetch->blockIndex < i)
+			return false;
+
+		/*
+		 * Calculate index of the earlier block (we need to do -1 as we
+		 * already incremented the index when adding the new block to the
+		 * queue).
+		 */
+		idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
+
+		/*
+		 * For a sequential pattern, blocks "k" step ago needs to have block
+		 * number by "k" smaller compared to the current block.
+		 */
+		if (prefetch->blockItems[idx] != (block - i))
+			return false;
+	}
+
+	return true;
+}
+
+/*
+ * index_prefetch_add_cache
+ *		Add a block to the cache, check if it was recently prefetched.
+ *
+ * We don't want to prefetch blocks that we already prefetched recently. It's
+ * cheap but not free, and the overhead may have measurable impact.
+ *
+ * This check needs to be very cheap, even with fairly large caches (hundreds
+ * of entries, see PREFETCH_CACHE_SIZE).
+ *
+ * A simple queue would allow expiring the requests, but checking if it
+ * contains a particular block prefetched would be expensive (linear search).
+ * Another option would be a simple hash table, which has fast lookup but
+ * does not allow expiring entries cheaply.
+ *
+ * The cache does not need to be perfect, we can accept false
+ * positives/negatives, as long as the rate is reasonably low. We also need
+ * to expire entries, so that only "recent" requests are remembered.
+ *
+ * We use a hybrid cache that is organized as many small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table). The LRU caches are tiny (e.g. 8 entries), and the expiration
+ * happens at the level of a single LRU (by tracking only the 8 most recent requests).
+ *
+ * This allows quick searches and expiration, but with false negatives (when a
+ * particular LRU has too many collisions, we may evict entries that are more
+ * recent than some other LRU).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * prefetch request in total (these are the default parameters.)
+ *
+ * The recency is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint64, so it should
+ * not wrap (125 zebibytes, would take ~4 million years at 1GB/s).
+ *
+ * To check if a block was prefetched recently, we calculate hash(block),
+ * and then linearly search if the tiny LRU has entry for the same block
+ * and request less than PREFETCH_CACHE_SIZE ago.
+ *
+ * At the same time, we either update the entry (for the queried block) if
+ * found, or replace the oldest/empty entry.
+ *
+ * If the block was not recently prefetched (i.e. we want to prefetch it),
+ * we increment the counter.
+ *
+ * Returns true if the block was recently prefetched (and thus we don't
+ * need to prefetch it again), or false (should do a prefetch).
+ *
+ * XXX It's a bit confusing these return values are inverse compared to
+ * what index_prefetch_is_sequential does.
+ */
+static bool
+index_prefetch_add_cache(IndexPrefetch *prefetch, BlockNumber block)
+{
+	IndexPrefetchCacheEntry *entry;
+
+	/* map the block number the the LRU */
+	int			lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
+
+	/* age/index of the oldest entry in the LRU, to maybe use */
+	uint64		oldestRequest = PG_UINT64_MAX;
+	int			oldestIndex = -1;
+
+	/*
+	 * First add the block to the (tiny) top-level LRU cache and see if it's
+	 * part of a sequential pattern. In this case we just ignore the block and
+	 * don't prefetch it - we expect read-ahead to do a better job.
+	 *
+	 * XXX Maybe we should still add the block to the hybrid cache, in case we
+	 * happen to access it later? That might help if we first scan a lot of
+	 * the table sequentially, and then randomly. Not sure that's very likely
+	 * with index access, though.
+	 */
+	if (index_prefetch_is_sequential(prefetch, block))
+	{
+		prefetch->countSkipSequential++;
+		return true;
+	}
+
+	/*
+	 * See if we recently prefetched this block - we simply scan the LRU
+	 * linearly. While doing that, we also track the oldest entry, so that we
+	 * know where to put the block if we don't find a matching entry.
+	 */
+	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+	{
+		entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+		/* Is this the oldest prefetch request in this LRU? */
+		if (entry->request < oldestRequest)
+		{
+			oldestRequest = entry->request;
+			oldestIndex = i;
+		}
+
+		/*
+		 * If the entry is unused (identified by request being set to 0),
+		 * we're done. Notice the field is uint64, so empty entry is
+		 * guaranteed to be the oldest one.
+		 */
+		if (entry->request == 0)
+			continue;
+
+		/* Is this entry for the same block as the current request? */
+		if (entry->block == block)
+		{
+			bool		prefetched;
+
+			/*
+			 * Is the old request sufficiently recent? If yes, we treat the
+			 * block as already prefetched.
+			 *
+			 * XXX We do add the cache size to the request in order not to
+			 * have issues with uint64 underflows.
+			 */
+			prefetched = ((entry->request + PREFETCH_CACHE_SIZE) >= prefetch->prefetchReqNumber);
+
+			/* Update the request number. */
+			entry->request = ++prefetch->prefetchReqNumber;
+
+			prefetch->countSkipCached += (prefetched) ? 1 : 0;
+
+			return prefetched;
+		}
+	}
+
+	/*
+	 * We didn't find the block in the LRU, so store it either in an empty
+	 * entry, or in the "oldest" prefetch request in this LRU.
+	 */
+	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+	/* FIXME do a nice macro */
+	entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + oldestIndex];
+
+	entry->block = block;
+	entry->request = ++prefetch->prefetchReqNumber;
+
+	/* not in the prefetch cache */
+	return false;
+}
+
+/*
+ * index_prefetch
+ *		Prefetch the TID, unless it's sequential or recently prefetched.
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of rescans and number of
+ * items (TIDs) actually returned from the scan. Then we could calculate
+ * rows / rescan and use that to clamp prefetch target.
+ *
+ * That'd help with cases when a scan matches only very few rows, far less
+ * than the prefetchTarget, because the unnecessary prefetches are wasted
+ * I/O. Imagine a LIMIT on top of index scan, or something like that.
+ *
+ * Another option is to use the planner estimates - we know how many rows we're
+ * expecting to fetch (on average, assuming the estimates are reasonably
+ * accurate), so why not to use that?
+ *
+ * Of course, we could/should combine these two approaches.
+ *
+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a somewhat wrong). Also, maybe we should consider the filter selectivity
+ * (if the index-only filter is expected to eliminate only few rows, then
+ * the vm check is pointless). Maybe this could/should be auto-tuning too,
+ * i.e. we could track how many heap tuples were needed after all, and then
+ * we would consider this when deciding whether to prefetch all-visible
+ * pages or not (matters only for regular index scans, not IOS).
+ *
+ * XXX Maybe we could/should also prefetch the next index block, e.g. stored
+ * in BTScanPosData.nextPage.
+ *
+ * XXX Could we tune the cache size based on execution statistics? We have
+ * a cache of limited size (PREFETCH_CACHE_SIZE = 1024 by default), but
+ * how do we know it's the right size? Ideally, we'd have a cache large
+ * enough to track actually cached blocks. If the OS caches 10240 pages,
+ * then we may do 90% of prefetch requests unnecessarily. Or maybe there's
+ * a lot of contention, blocks are evicted quickly, and 90% of the blocks
+ * in the cache are not actually cached anymore? But we do have a concept
+ * of sequential request ID (PrefetchCacheEntry->request), which gives us
+ * information about "age" of the last prefetch. Now it's used only when
+ * evicting entries (to keep the more recent one), but maybe we could also
+ * use it when deciding if the page is cached. Right now any block that's
+ * in the cache is considered cached and not prefetched, but maybe we could
+ * have "max age", and tune it based on feedback from reading the blocks
+ * later. For example, if we find the block in cache and decide not to
+ * prefetch it, but then later find we have to do I/O, it means our cache
+ * is too large. And we could "reduce" the maximum age (measured from the
+ * current prefetchReqNumber value), so that only more recent blocks would
+ * be considered cached. Not sure about the opposite direction, where we
+ * decide to prefetch a block - AFAIK we don't have a way to determine if
+ * I/O was needed or not in this case (so we can't increase the max age).
+ * But maybe we could di that somehow speculatively, i.e. increase the
+ * value once in a while, and see what happens.
+ */
+static void
+index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
+			   ItemPointer tid, bool skip_all_visible, bool *all_visible)
+{
+	BlockNumber block;
+
+	/* by default not all visible (or we didn't check) */
+	*all_visible = false;
+
+	/*
+	 * No heap relation means bitmap index scan, which does prefetching at the
+	 * bitmap heap scan, so no prefetch here (we can't do it anyway, without
+	 * the heap)
+	 *
+	 * XXX But in this case we should have prefetchMaxTarget=0, because in
+	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
+	 * just check that.
+	 */
+	if (!prefetch)
+		return;
+
+	/*
+	 * If we got here, prefetching is enabled and it's a node that supports
+	 * prefetching (i.e. it can't be a bitmap index scan).
+	 */
+	Assert(scan->heapRelation);
+
+	block = ItemPointerGetBlockNumber(tid);
+
+	/*
+	 * When prefetching for IOS, we want to only prefetch pages that are not
+	 * marked as all-visible (because not fetching all-visible pages is the
+	 * point of IOS).
+	 *
+	 * XXX This is not great, because it releases the VM buffer for each TID
+	 * we consider to prefetch. We should reuse that somehow, similar to the
+	 * actual IOS code. Ideally, we should use the same ioss_VMBuffer (if
+	 * we can propagate it here). Or at least do it for a bulk of prefetches,
+	 * although that's not very useful - after the ramp-up we will prefetch
+	 * the pages one by one anyway.
+	 *
+	 * XXX Ideally we'd also propagate this to the executor, so that the
+	 * nodeIndexonlyscan.c doesn't need to repeat the same VM check (which
+	 * is measurable). But the index_getnext_tid() is not really well
+	 * suited for that, so the API needs a change.s
+	 */
+	if (skip_all_visible)
+	{
+		*all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+									  block,
+									  &prefetch->vmBuffer);
+
+		if (*all_visible)
+			return;
+	}
+
+	/*
+	 * Do not prefetch the same block over and over again,
+	 *
+	 * This happens e.g. for clustered or naturally correlated indexes (fkey
+	 * to a sequence ID). It's not expensive (the block is in page cache
+	 * already, so no I/O), but it's not free either.
+	 */
+	if (!index_prefetch_add_cache(prefetch, block))
+	{
+		prefetch->countPrefetch++;
+
+		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+		pgBufferUsage.blks_prefetches++;
+	}
+
+	prefetch->countAll++;
+}
+
+/* ----------------
+ * index_getnext_tid - get the next TID from a scan
+ *
+ * The result is the next TID satisfying the scan keys,
+ * or NULL if no more matching tuples exist.
+ *
+ * FIXME not sure this handles xs_heapfetch correctly.
+ * ----------------
+ */
+ItemPointer
+index_getnext_tid(IndexScanDesc scan, ScanDirection direction,
+				  IndexPrefetch *prefetch)
+{
+	bool		all_visible;	/* ignored */
+
+	/* Do prefetching (if requested/enabled). */
+	index_prefetch_tids(scan, direction, prefetch);
+
+	/* Read the TID from the queue (or directly from the index). */
+	return index_prefetch_get_tid(scan, direction, prefetch, &all_visible);
+}
+
+ItemPointer
+index_getnext_tid_vm(IndexScanDesc scan, ScanDirection direction,
+					 IndexPrefetch *prefetch, bool *all_visible)
+{
+	/* Do prefetching (if requested/enabled). */
+	index_prefetch_tids(scan, direction, prefetch);
+
+	/* Read the TID from the queue (or directly from the index). */
+	return index_prefetch_get_tid(scan, direction, prefetch, all_visible);
+}
+
+static void
+index_prefetch_tids(IndexScanDesc scan, ScanDirection direction,
+					IndexPrefetch *prefetch)
+{
+	/*
+	 * If the prefetching is still active (i.e. enabled and we still
+	 * haven't finished reading TIDs from the scan), read enough TIDs into
+	 * the queue until we hit the current target.
+	 */
+	if (PREFETCH_ACTIVE(prefetch))
+	{
+		/*
+		 * Ramp up the prefetch distance incrementally.
+		 *
+		 * Intentionally done as first, before reading the TIDs into the
+		 * queue, so that there's always at least one item. Otherwise we
+		 * might get into a situation where we start with target=0 and no
+		 * TIDs loaded.
+		 */
+		prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+									   prefetch->prefetchMaxTarget);
+
+		/*
+		 * Now read TIDs from the index until the queue is full (with
+		 * respect to the current prefetch target).
+		 */
+		while (!PREFETCH_FULL(prefetch))
+		{
+			ItemPointer tid;
+			bool		all_visible;
+
+			/* Time to fetch the next TID from the index */
+			tid = index_getnext_tid_internal(scan, direction);
+
+			/*
+			 * If we're out of index entries, we're done (and we mark the
+			 * the prefetcher as inactive).
+			 */
+			if (tid == NULL)
+			{
+				prefetch->prefetchDone = true;
+				break;
+			}
+
+			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+			/*
+			 * Issue the actuall prefetch requests for the new TID.
+			 *
+			 * XXX index_getnext_tid_prefetch is only called for IOS (for now),
+			 * so skip prefetching of all-visible pages.
+			 */
+			index_prefetch(scan, prefetch, tid, prefetch->indexonly, &all_visible);
+
+			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].tid = *tid;
+			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].all_visible = all_visible;
+			prefetch->queueEnd++;
+		}
+	}
+}
+
+static ItemPointer
+index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction,
+					   IndexPrefetch *prefetch, bool *all_visible)
+{
+	/*
+	 * With prefetching enabled (even if we already finished reading
+	 * all TIDs from the index scan), we need to return a TID from the
+	 * queue. Otherwise, we just get the next TID from the scan
+	 * directly.
+	 */
+	if (PREFETCH_ENABLED(prefetch))
+	{
+		/* Did we reach the end of the scan and the queue is empty? */
+		if (PREFETCH_DONE(prefetch))
+			return NULL;
+
+		scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].tid;
+		*all_visible = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].all_visible;
+		prefetch->queueIndex++;
+	}
+	else				/* not prefetching, just do the regular work  */
+	{
+		ItemPointer tid;
+
+		/* Time to fetch the next TID from the index */
+		tid = index_getnext_tid_internal(scan, direction);
+		*all_visible = false;
+
+		/* If we're out of index entries, we're done */
+		if (tid == NULL)
+			return NULL;
+
+		Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+	}
+
+	/* Return the TID of the tuple we found. */
+	return &scan->xs_heaptid;
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f1d71bc54e8..6810996edfd 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3568,6 +3568,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 										!INSTR_TIME_IS_ZERO(usage->local_blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
 												  has_local || has_temp ||
 												  has_shared_timing ||
@@ -3679,6 +3680,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			appendStringInfoChar(es->str, '\n');
 		}
 
+		/* As above, show only positive counter values. */
+		if (has_prefetches)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+				appendStringInfo(es->str, " blocks=%lld",
+								 (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+				appendStringInfo(es->str, " rounds=%lld",
+								 (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 2fa2118f3c2..0a136db6712 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -777,7 +777,11 @@ retry:
 	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
+	/*
+	 * XXX Would be nice to also benefit from prefetching here. All we need to
+	 * do is instantiate the prefetcher, I guess.
+	 */
+	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot, NULL))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 81f27042bc4..9498b00fa64 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -212,8 +212,13 @@ retry:
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
-	/* Try to find the tuple */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
+	/*
+	 * Try to find the tuple
+	 *
+	 * XXX Would be nice to also benefit from prefetching here. All we need to
+	 * do is instantiate the prefetcher, I guess.
+	 */
+	while (index_getnext_slot(scan, ForwardScanDirection, outslot, NULL))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index c383f34c066..0011d9f679c 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->local_blks_written += add->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+	dst->blks_prefetches += add->blks_prefetches;
 	INSTR_TIME_ADD(dst->shared_blk_read_time, add->shared_blk_read_time);
 	INSTR_TIME_ADD(dst->shared_blk_write_time, add->shared_blk_write_time);
 	INSTR_TIME_ADD(dst->local_blk_read_time, add->local_blk_read_time);
@@ -259,6 +261,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
 						  add->shared_blk_read_time, sub->shared_blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index f1db35665c8..a7eadaf3db2 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -43,7 +43,7 @@
 #include "storage/predicate.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
-
+#include "utils/spccache.h"
 
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
@@ -65,6 +65,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	IndexPrefetch  *prefetch;
+	bool			all_visible;
 
 	/*
 	 * extract necessary information from index scan node
@@ -78,6 +80,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir);
 	scandesc = node->ioss_ScanDesc;
+	prefetch = node->ioss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
@@ -116,7 +119,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while ((tid = index_getnext_tid_vm(scandesc, direction, prefetch, &all_visible)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
@@ -155,8 +158,11 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 *
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
+		 *
+		 * XXX Skip if we already know the page is all visible from prefetcher.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+		if (!all_visible &&
+			!VM_ALL_VISIBLE(scandesc->heapRelation,
 							ItemPointerGetBlockNumber(tid),
 							&node->ioss_VMBuffer))
 		{
@@ -353,6 +359,16 @@ ExecReScanIndexOnlyScan(IndexOnlyScanState *node)
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 
+	/* also reset the prefetcher, so that we start from scratch */
+	if (node->ioss_prefetch)
+	{
+		IndexPrefetch *prefetch = node->ioss_prefetch;
+
+		prefetch->queueIndex = 0;
+		prefetch->queueStart = 0;
+		prefetch->queueEnd = 0;
+	}
+
 	ExecScanReScan(&node->ss);
 }
 
@@ -380,6 +396,26 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 		node->ioss_VMBuffer = InvalidBuffer;
 	}
 
+	/* Release VM buffer pin from prefetcher, if any. */
+	if (node->ioss_prefetch)
+	{
+		IndexPrefetch *prefetch = node->ioss_prefetch;
+
+		/* XXX some debug info */
+		elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
+			 prefetch->countAll,
+			 prefetch->countPrefetch,
+			 prefetch->countPrefetch * 100.0 / prefetch->countAll,
+			 prefetch->countSkipCached,
+			 prefetch->countSkipSequential);
+
+		if (prefetch->vmBuffer != InvalidBuffer)
+		{
+			ReleaseBuffer(prefetch->vmBuffer);
+			prefetch->vmBuffer = InvalidBuffer;
+		}
+	}
+
 	/*
 	 * close the index relation (no-op if we didn't open it)
 	 */
@@ -604,6 +640,63 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 		indexstate->ioss_RuntimeContext = NULL;
 	}
 
+	/*
+	 * Also initialize index prefetcher.
+	 *
+	 * XXX No prefetching for direct I/O.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0)
+	{
+		int			prefetch_max;
+		Relation    heapRel = indexstate->ss.ss_currentRelation;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index. This is
+		 * essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 *
+		 * XXX Maybe reduce the value with parallel workers?
+		 */
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   indexstate->ss.ps.plan->plan_rows);
+
+		/*
+		 * We reach here if the index only scan is not parallel, or if we're
+		 * serially executing an index only scan that was planned to be
+		 * parallel.
+		 *
+		 * XXX Maybe we should enable prefetching, but prefetch only pages that
+		 * are not all-visible (but checking that from the index code seems like
+		 * a violation of layering etc).
+		 *
+		 * XXX This might lead to IOS being slower than plain index scan, if the
+		 * table has a lot of pages that need recheck.
+		 *
+		 * Remember this is index-only scan, because of prefetching. Not the most
+		 * elegant way to pass this info.
+		 */
+		if (prefetch_max > 0)
+		{
+			IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+
+			prefetch->queueIndex = 0;
+			prefetch->queueStart = 0;
+			prefetch->queueEnd = 0;
+
+			prefetch->prefetchTarget = 0;
+			prefetch->prefetchMaxTarget = prefetch_max;
+			prefetch->vmBuffer = InvalidBuffer;
+			prefetch->indexonly = true;
+
+			indexstate->ioss_prefetch = prefetch;
+		}
+	}
+
 	/*
 	 * all done.
 	 */
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 14b9c00217a..b3282ec5a75 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,6 +43,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 /*
  * When an ordering operator is used, tuples fetched from the index that
@@ -85,6 +86,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexPrefetch  *prefetch;
 
 	/*
 	 * extract necessary information from index scan node
@@ -98,6 +100,7 @@ IndexNext(IndexScanState *node)
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexScan *) node->ss.ps.plan)->indexorderdir);
 	scandesc = node->iss_ScanDesc;
+	prefetch = node->iss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
@@ -128,7 +131,7 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (index_getnext_slot(scandesc, direction, slot, prefetch))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -177,6 +180,7 @@ IndexNextWithReorder(IndexScanState *node)
 	Datum	   *lastfetched_vals;
 	bool	   *lastfetched_nulls;
 	int			cmp;
+	IndexPrefetch *prefetch;
 
 	estate = node->ss.ps.state;
 
@@ -193,6 +197,7 @@ IndexNextWithReorder(IndexScanState *node)
 	Assert(ScanDirectionIsForward(estate->es_direction));
 
 	scandesc = node->iss_ScanDesc;
+	prefetch = node->iss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
@@ -259,7 +264,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot, prefetch))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -588,6 +593,16 @@ ExecReScanIndexScan(IndexScanState *node)
 					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	node->iss_ReachedEnd = false;
 
+	/* also reset the prefetcher, so that we start from scratch */
+	if (node->iss_prefetch)
+	{
+		IndexPrefetch *prefetch = node->iss_prefetch;
+
+		prefetch->queueIndex = 0;
+		prefetch->queueStart = 0;
+		prefetch->queueEnd = 0;
+	}
+
 	ExecScanReScan(&node->ss);
 }
 
@@ -794,6 +809,19 @@ ExecEndIndexScan(IndexScanState *node)
 	indexRelationDesc = node->iss_RelationDesc;
 	indexScanDesc = node->iss_ScanDesc;
 
+	/* XXX nothing to free, but print some debug info */
+	if (node->iss_prefetch)
+	{
+		IndexPrefetch *prefetch = node->iss_prefetch;
+
+		elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
+			 prefetch->countAll,
+			 prefetch->countPrefetch,
+			 prefetch->countPrefetch * 100.0 / prefetch->countAll,
+			 prefetch->countSkipCached,
+			 prefetch->countSkipSequential);
+	}
+
 	/*
 	 * close the index relation (no-op if we didn't open it)
 	 */
@@ -1066,6 +1094,45 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 		indexstate->iss_RuntimeContext = NULL;
 	}
 
+	/*
+	 * Also initialize index prefetcher.
+	 *
+	 * XXX No prefetching for direct I/O.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0)
+	{
+		int	prefetch_max;
+		Relation    heapRel = indexstate->ss.ss_currentRelation;
+
+		/*
+		 * Determine number of heap pages to prefetch for this index scan. This
+		 * is essentially just effective_io_concurrency for the table (or the
+		 * tablespace it's in).
+		 *
+		 * XXX Should this also look at plan.plan_rows and maybe cap the target
+		 * to that? Pointless to prefetch more than we expect to use. Or maybe
+		 * just reset to that value during prefetching, after reading the next
+		 * index page (or rather after rescan)?
+		 */
+		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+						   indexstate->ss.ps.plan->plan_rows);
+
+		if (prefetch_max > 0)
+		{
+			IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+
+			prefetch->queueIndex = 0;
+			prefetch->queueStart = 0;
+			prefetch->queueEnd = 0;
+
+			prefetch->prefetchTarget = 0;
+			prefetch->prefetchMaxTarget = prefetch_max;
+			prefetch->vmBuffer = InvalidBuffer;
+
+			indexstate->iss_prefetch = prefetch;
+		}
+	}
+
 	/*
 	 * all done.
 	 */
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index e11d022827a..b5c79359425 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6297,7 +6297,7 @@ get_actual_variable_endpoint(Relation heapRel,
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
+	while ((tid = index_getnext_tid(index_scan, indexscandir, NULL)) != NULL)
 	{
 		BlockNumber block = ItemPointerGetBlockNumber(tid);
 
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 80dc8d54066..c0c46d7a05f 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -17,6 +17,7 @@
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
+#include "storage/bufmgr.h"
 #include "storage/lockdefs.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -128,6 +129,110 @@ typedef struct IndexOrderByDistance
 	bool		isnull;
 } IndexOrderByDistance;
 
+
+
+/*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches. Doesn't need to be perfectly accurate, but we
+ * aim to make false positives/negatives reasonably low.
+ */
+typedef struct IndexPrefetchCacheEntry {
+	BlockNumber		block;
+	uint64			request;
+} IndexPrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too
+ * small or too large. 1024 seems about right, it covers ~8MB of data.
+ * It's somewhat arbitrary, there's no particular formula saying it
+ * should not be higher/lower.
+ *
+ * The cache is structured as an array of small LRU caches, so the total
+ * size needs to be a multiple of LRU size. The LRU should be tiny to
+ * keep linear search cheap enough.
+ *
+ * XXX Maybe we could consider effective_cache_size or something?
+ */
+#define		PREFETCH_LRU_SIZE		8
+#define		PREFETCH_LRU_COUNT		128
+#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
+/*
+ * Used to detect sequential patterns (and disable prefetching).
+ */
+#define		PREFETCH_QUEUE_HISTORY			8
+#define		PREFETCH_SEQ_PATTERN_BLOCKS		4
+
+typedef struct IndexPrefetchEntry
+{
+	ItemPointerData		tid;
+	bool				all_visible;
+} IndexPrefetchEntry;
+
+typedef struct IndexPrefetch
+{
+	/*
+	 * XXX We need to disable this in some cases (e.g. when using index-only
+	 * scans, we don't want to prefetch pages). Or maybe we should prefetch
+	 * only pages that are not all-visible, that'd be even better.
+	 */
+	int			prefetchTarget;	/* how far we should be prefetching */
+	int			prefetchMaxTarget;	/* maximum prefetching distance */
+	int			prefetchReset;	/* reset to this distance on rescan */
+	bool		prefetchDone;	/* did we get all TIDs from the index? */
+
+	/* runtime statistics */
+	uint64		countAll;		/* all prefetch requests */
+	uint64		countPrefetch;	/* actual prefetches */
+	uint64		countSkipSequential;
+	uint64		countSkipCached;
+
+	/* used when prefetching index-only scans */
+	bool		indexonly;
+	Buffer		vmBuffer;
+
+	/*
+	 * Queue of TIDs to prefetch.
+	 *
+	 * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
+	 * than dynamically adjusting for custom values.
+	 */
+	IndexPrefetchEntry	queueItems[MAX_IO_CONCURRENCY];
+	uint64			queueIndex;	/* next TID to prefetch */
+	uint64			queueStart;	/* first valid TID in queue */
+	uint64			queueEnd;	/* first invalid (empty) TID in queue */
+
+	/*
+	 * A couple of last prefetched blocks, used to check for certain access
+	 * pattern and skip prefetching - e.g. for sequential access).
+	 *
+	 * XXX Separate from the main queue, because we only want to compare the
+	 * block numbers, not the whole TID. In sequential access it's likely we
+	 * read many items from each page, and we don't want to check many items
+	 * (as that is much more expensive).
+	 */
+	BlockNumber		blockItems[PREFETCH_QUEUE_HISTORY];
+	uint64			blockIndex;	/* index in the block (points to the first
+								 * empty entry)*/
+
+	/*
+	 * Cache of recently prefetched blocks, organized as a hash table of
+	 * small LRU caches.
+	 */
+	uint64				prefetchReqNumber;
+	IndexPrefetchCacheEntry	prefetchCache[PREFETCH_CACHE_SIZE];
+
+} IndexPrefetch;
+
+#define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
+#define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
+#define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
+#define PREFETCH_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
+#define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+#define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
+#define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
+
+
 /*
  * generalized index_ interface routines (in indexam.c)
  */
@@ -173,11 +278,17 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
-									 ScanDirection direction);
+									 ScanDirection direction,
+									 IndexPrefetch *prefetch);
+extern ItemPointer index_getnext_tid_vm(IndexScanDesc scan,
+										ScanDirection direction,
+										IndexPrefetch *prefetch,
+										bool *all_visible);
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   struct TupleTableSlot *slot);
+							   struct TupleTableSlot *slot,
+							   IndexPrefetch *prefetch);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index d5d69941c52..f53fb4a1e51 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
+	int64		blks_prefetch_rounds;	/* # of prefetch rounds */
+	int64		blks_prefetches;	/* # of buffers prefetched */
 	instr_time	shared_blk_read_time;	/* time spent reading shared blocks */
 	instr_time	shared_blk_write_time;	/* time spent writing shared blocks */
 	instr_time	local_blk_read_time;	/* time spent reading local blocks */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5d7f17dee07..8745453a5b4 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1529,6 +1529,7 @@ typedef struct
 	bool	   *elem_nulls;		/* array of num_elems is-null flags */
 } IndexArrayKeyInfo;
 
+
 /* ----------------
  *	 IndexScanState information
  *
@@ -1580,6 +1581,8 @@ typedef struct IndexScanState
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
+
+	IndexPrefetch *iss_prefetch;
 } IndexScanState;
 
 /* ----------------
@@ -1618,6 +1621,7 @@ typedef struct IndexOnlyScanState
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
+	IndexPrefetch *ioss_prefetch;
 } IndexOnlyScanState;
 
 /* ----------------
-- 
2.43.0



  [text/x-patch] v20240103-0002-switch-to-StreamingRead-like-API.patch (41.0K, 3-v20240103-0002-switch-to-StreamingRead-like-API.patch)
  download | inline diff:
From b9021c498bb273055f8cf8809030c4abc7848737 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Mon, 1 Jan 2024 21:50:47 +0100
Subject: [PATCH v20240103 2/2] switch to StreamingRead-like API

---
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/backend/access/index/README.prefetch |   3 +
 src/backend/access/index/genam.c         |   4 +-
 src/backend/access/index/indexam.c       | 254 ++++++++++++-----------
 src/backend/executor/execIndexing.c      |   6 +-
 src/backend/executor/execReplication.c   |   9 +-
 src/backend/executor/nodeIndexonlyscan.c | 149 +++++++------
 src/backend/executor/nodeIndexscan.c     | 105 ++++++----
 src/backend/optimizer/plan/createplan.c  |  27 ++-
 src/backend/utils/adt/selfuncs.c         |   2 +-
 src/include/access/genam.h               |  68 ++++--
 src/include/nodes/execnodes.h            |   4 +-
 src/include/nodes/plannodes.h            |   2 +
 13 files changed, 382 insertions(+), 253 deletions(-)
 create mode 100644 src/backend/access/index/README.prefetch

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 26d3ec20b63..7c28dafb728 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -792,7 +792,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		if (indexScan != NULL)
 		{
-			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot, NULL))
+			if (!index_getnext_slot(indexScan, ForwardScanDirection, slot))
 				break;
 
 			/* Since we used no scan keys, should never need to recheck */
diff --git a/src/backend/access/index/README.prefetch b/src/backend/access/index/README.prefetch
new file mode 100644
index 00000000000..2a6ac4a0eea
--- /dev/null
+++ b/src/backend/access/index/README.prefetch
@@ -0,0 +1,3 @@
+- index heap prefetch overview
+- 
+- callback - decision whether to prefetch, possibility to keep data
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 72e7c9f206c..4ca12006843 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -509,7 +509,7 @@ systable_getnext(SysScanDesc sysscan)
 
 	if (sysscan->irel)
 	{
-		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot, NULL))
+		if (index_getnext_slot(sysscan->iscan, ForwardScanDirection, sysscan->slot))
 		{
 			bool		shouldFree;
 
@@ -713,7 +713,7 @@ systable_getnext_ordered(SysScanDesc sysscan, ScanDirection direction)
 	HeapTuple	htup = NULL;
 
 	Assert(sysscan->irel);
-	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot, NULL))
+	if (index_getnext_slot(sysscan->iscan, direction, sysscan->slot))
 		htup = ExecFetchSlotHeapTuple(sysscan->slot, false, NULL);
 
 	/* See notes in systable_getnext */
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index f96aeba1b39..cdad3f4c6f9 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -49,7 +49,6 @@
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/transam.h"
-#include "access/visibilitymap.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "catalog/pg_amproc.h"
@@ -64,6 +63,7 @@
 #include "utils/lsyscache.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
+#include "utils/spccache.h"
 #include "utils/syscache.h"
 
 
@@ -111,13 +111,16 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation,
 											  int nkeys, int norderbys, Snapshot snapshot,
 											  ParallelIndexScanDesc pscan, bool temp_snap);
 
-static void index_prefetch_tids(IndexScanDesc scan, ScanDirection direction,
-								IndexPrefetch *prefetch);
-static ItemPointer index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction,
-										  IndexPrefetch *prefetch, bool *all_visible);
-static void index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
-						   ItemPointer tid, bool skip_all_visible, bool *all_visible);
-
+/* index prefetching of heap pages */
+static void index_prefetch_tids(IndexScanDesc scan,
+								IndexPrefetch *prefetch,
+								ScanDirection direction);
+static IndexPrefetchEntry *index_prefetch_get_entry(IndexScanDesc scan,
+												    IndexPrefetch *prefetch,
+													ScanDirection direction);
+static void index_prefetch_heap_page(IndexScanDesc scan,
+									 IndexPrefetch *prefetch,
+									 IndexPrefetchEntry *entry);
 
 /* ----------------------------------------------------------------
  *				   index_ interface functions
@@ -546,8 +549,8 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
  * or NULL if no more matching tuples exist.
  * ----------------
  */
-static ItemPointer
-index_getnext_tid_internal(IndexScanDesc scan, ScanDirection direction)
+ItemPointer
+index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 {
 	bool		found;
 
@@ -643,24 +646,22 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
  * Note: caller must check scan->xs_recheck, and perform rechecking of the
  * scan keys if required.  We do not do that here because we don't have
  * enough information to do it efficiently in the general case.
+ *
+ * XXX This does not support prefetching of heap pages. When such prefetching is
+ * desirable, use index_getnext_tid().
  * ----------------
  */
 bool
-index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot,
-				   IndexPrefetch *prefetch)
+index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
 {
 	for (;;)
 	{
-		/* Do prefetching (if requested/enabled). */
-		index_prefetch_tids(scan, direction, prefetch);
-
 		if (!scan->xs_heap_continue)
 		{
-			ItemPointer	tid;
-			bool		all_visible;
+			ItemPointer tid;
 
 			/* Time to fetch the next TID from the index */
-			tid = index_prefetch_get_tid(scan, direction, prefetch, &all_visible);
+			tid = index_getnext_tid(scan, direction);
 
 			/* If we're out of index entries, we're done */
 			if (tid == NULL)
@@ -1339,13 +1340,9 @@ index_prefetch_add_cache(IndexPrefetch *prefetch, BlockNumber block)
  * value once in a while, and see what happens.
  */
 static void
-index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
-			   ItemPointer tid, bool skip_all_visible, bool *all_visible)
+index_prefetch_heap_page(IndexScanDesc scan, IndexPrefetch *prefetch, IndexPrefetchEntry *entry)
 {
-	BlockNumber block;
-
-	/* by default not all visible (or we didn't check) */
-	*all_visible = false;
+	BlockNumber block = ItemPointerGetBlockNumber(&entry->tid);
 
 	/*
 	 * No heap relation means bitmap index scan, which does prefetching at the
@@ -1355,6 +1352,8 @@ index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
 	 * XXX But in this case we should have prefetchMaxTarget=0, because in
 	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
 	 * just check that.
+	 *
+	 * XXX Comment/check seems obsolete.
 	 */
 	if (!prefetch)
 		return;
@@ -1362,37 +1361,10 @@ index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
 	/*
 	 * If we got here, prefetching is enabled and it's a node that supports
 	 * prefetching (i.e. it can't be a bitmap index scan).
-	 */
-	Assert(scan->heapRelation);
-
-	block = ItemPointerGetBlockNumber(tid);
-
-	/*
-	 * When prefetching for IOS, we want to only prefetch pages that are not
-	 * marked as all-visible (because not fetching all-visible pages is the
-	 * point of IOS).
 	 *
-	 * XXX This is not great, because it releases the VM buffer for each TID
-	 * we consider to prefetch. We should reuse that somehow, similar to the
-	 * actual IOS code. Ideally, we should use the same ioss_VMBuffer (if
-	 * we can propagate it here). Or at least do it for a bulk of prefetches,
-	 * although that's not very useful - after the ramp-up we will prefetch
-	 * the pages one by one anyway.
-	 *
-	 * XXX Ideally we'd also propagate this to the executor, so that the
-	 * nodeIndexonlyscan.c doesn't need to repeat the same VM check (which
-	 * is measurable). But the index_getnext_tid() is not really well
-	 * suited for that, so the API needs a change.s
+	 * XXX Comment/check seems obsolete.
 	 */
-	if (skip_all_visible)
-	{
-		*all_visible = VM_ALL_VISIBLE(scan->heapRelation,
-									  block,
-									  &prefetch->vmBuffer);
-
-		if (*all_visible)
-			return;
-	}
+	Assert(scan->heapRelation);
 
 	/*
 	 * Do not prefetch the same block over and over again,
@@ -1412,42 +1384,12 @@ index_prefetch(IndexScanDesc scan, IndexPrefetch *prefetch,
 	prefetch->countAll++;
 }
 
-/* ----------------
- * index_getnext_tid - get the next TID from a scan
- *
- * The result is the next TID satisfying the scan keys,
- * or NULL if no more matching tuples exist.
- *
- * FIXME not sure this handles xs_heapfetch correctly.
- * ----------------
+/*
+ * index_prefetch_tids
+ *		Fill the prefetch queue and issue necessary prefetch requests.
  */
-ItemPointer
-index_getnext_tid(IndexScanDesc scan, ScanDirection direction,
-				  IndexPrefetch *prefetch)
-{
-	bool		all_visible;	/* ignored */
-
-	/* Do prefetching (if requested/enabled). */
-	index_prefetch_tids(scan, direction, prefetch);
-
-	/* Read the TID from the queue (or directly from the index). */
-	return index_prefetch_get_tid(scan, direction, prefetch, &all_visible);
-}
-
-ItemPointer
-index_getnext_tid_vm(IndexScanDesc scan, ScanDirection direction,
-					 IndexPrefetch *prefetch, bool *all_visible)
-{
-	/* Do prefetching (if requested/enabled). */
-	index_prefetch_tids(scan, direction, prefetch);
-
-	/* Read the TID from the queue (or directly from the index). */
-	return index_prefetch_get_tid(scan, direction, prefetch, all_visible);
-}
-
 static void
-index_prefetch_tids(IndexScanDesc scan, ScanDirection direction,
-					IndexPrefetch *prefetch)
+index_prefetch_tids(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
 {
 	/*
 	 * If the prefetching is still active (i.e. enabled and we still
@@ -1473,43 +1415,46 @@ index_prefetch_tids(IndexScanDesc scan, ScanDirection direction,
 		 */
 		while (!PREFETCH_FULL(prefetch))
 		{
-			ItemPointer tid;
-			bool		all_visible;
+			IndexPrefetchEntry *entry = prefetch->next_cb(scan, prefetch, direction);
 
-			/* Time to fetch the next TID from the index */
-			tid = index_getnext_tid_internal(scan, direction);
-
-			/*
-			 * If we're out of index entries, we're done (and we mark the
-			 * the prefetcher as inactive).
-			 */
-			if (tid == NULL)
+			/* no more entries in this index scan */
+			if (entry == NULL)
 			{
 				prefetch->prefetchDone = true;
-				break;
+				return;
 			}
 
-			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+			Assert(ItemPointerEquals(&entry->tid, &scan->xs_heaptid));
 
-			/*
-			 * Issue the actuall prefetch requests for the new TID.
-			 *
-			 * XXX index_getnext_tid_prefetch is only called for IOS (for now),
-			 * so skip prefetching of all-visible pages.
-			 */
-			index_prefetch(scan, prefetch, tid, prefetch->indexonly, &all_visible);
+			/* store the entry and then maybe issue the prefetch request */
+			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd++)] = *entry;
 
-			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].tid = *tid;
-			prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)].all_visible = all_visible;
-			prefetch->queueEnd++;
+			/* issue the prefetch request? */
+			if (entry->prefetch)
+				index_prefetch_heap_page(scan, prefetch, entry);
 		}
 	}
 }
 
-static ItemPointer
-index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction,
-					   IndexPrefetch *prefetch, bool *all_visible)
+/*
+ * index_prefetch_get_entry
+ *		Get the next entry from the prefetch queue (or from the index directly).
+ *
+ * If prefetching is enabled, get next entry from the prefetch queue (unless
+ * queue is empty). With prefetching disabled, read an entry directly from the
+ * index scan.
+ *
+ * XXX not sure this correctly handles xs_heap_continue - see index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this? Although, that
+ * should be in the indexscan next_cb callback, probably.
+ *
+ * XXX If xs_heap_continue=true, we need to return the last TID.
+ */
+static IndexPrefetchEntry *
+index_prefetch_get_entry(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
 {
+	IndexPrefetchEntry *entry = NULL;
+
 	/*
 	 * With prefetching enabled (even if we already finished reading
 	 * all TIDs from the index scan), we need to return a TID from the
@@ -1522,25 +1467,98 @@ index_prefetch_get_tid(IndexScanDesc scan, ScanDirection direction,
 		if (PREFETCH_DONE(prefetch))
 			return NULL;
 
-		scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].tid;
-		*all_visible = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].all_visible;
+		entry = palloc(sizeof(IndexPrefetchEntry));
+
+		entry->tid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].tid;
+		entry->data = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].data;
+
 		prefetch->queueIndex++;
+
+		scan->xs_heaptid = entry->tid;
 	}
 	else				/* not prefetching, just do the regular work  */
 	{
 		ItemPointer tid;
 
 		/* Time to fetch the next TID from the index */
-		tid = index_getnext_tid_internal(scan, direction);
-		*all_visible = false;
+		tid = index_getnext_tid(scan, direction);
 
 		/* If we're out of index entries, we're done */
 		if (tid == NULL)
 			return NULL;
 
 		Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+		entry = palloc(sizeof(IndexPrefetchEntry));
+
+		entry->tid = scan->xs_heaptid;
+		entry->data = NULL;
 	}
 
-	/* Return the TID of the tuple we found. */
-	return &scan->xs_heaptid;
+	return entry;
+}
+
+int
+index_heap_prefetch_target(Relation heapRel, double plan_rows, bool allow_prefetch)
+{
+	/*
+	 * XXX No prefetching for direct I/O.
+	 *
+	 * XXX Shouldn't we do prefetching even for direct I/O? We would only pretend
+	 * doing it now, ofc, because we'd not do posix_fadvise(), but once the code
+	 * starts loading into shared buffers, that'd work.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0)
+		return 0;
+
+	/* disable prefetching for cursors etc. */
+	if (!allow_prefetch)
+		return 0;
+
+	/*
+	 * Determine number of heap pages to prefetch for this index. This is
+	 * essentially just effective_io_concurrency for the table (or the
+	 * tablespace it's in).
+	 *
+	 * XXX Should this also look at plan.plan_rows and maybe cap the target
+	 * to that? Pointless to prefetch more than we expect to use. Or maybe
+	 * just reset to that value during prefetching, after reading the next
+	 * index page (or rather after rescan)?
+	 *
+	 * XXX Maybe reduce the value with parallel workers?
+	 */
+	return Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+			   plan_rows);
+}
+
+IndexPrefetch *
+IndexPrefetchAlloc(IndexPrefetchNextCB next_cb, int prefetch_max, void *data)
+{
+	IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+
+	prefetch->queueIndex = 0;
+	prefetch->queueStart = 0;
+	prefetch->queueEnd = 0;
+
+	prefetch->prefetchTarget = 0;
+	prefetch->prefetchMaxTarget = prefetch_max;
+
+	/*
+	 * Customize the prefetch to also check visibility map and keep
+	 * the result so that IOS does not need to repeat it.
+	 */
+	prefetch->next_cb = next_cb;
+	prefetch->data = data;
+
+	return prefetch;
+}
+
+IndexPrefetchEntry *
+IndexPrefetchNext(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	/* Do prefetching (if requested/enabled). */
+	index_prefetch_tids(scan, prefetch, direction);
+
+	/* Read the TID from the queue (or directly from the index). */
+	return index_prefetch_get_entry(scan, prefetch, direction);
 }
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 0a136db6712..2fa2118f3c2 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -777,11 +777,7 @@ retry:
 	index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
-	/*
-	 * XXX Would be nice to also benefit from prefetching here. All we need to
-	 * do is instantiate the prefetcher, I guess.
-	 */
-	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot, NULL))
+	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
 	{
 		TransactionId xwait;
 		XLTW_Oper	reason_wait;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 9498b00fa64..81f27042bc4 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -212,13 +212,8 @@ retry:
 
 	index_rescan(scan, skey, skey_attoff, NULL, 0);
 
-	/*
-	 * Try to find the tuple
-	 *
-	 * XXX Would be nice to also benefit from prefetching here. All we need to
-	 * do is instantiate the prefetcher, I guess.
-	 */
-	while (index_getnext_slot(scan, ForwardScanDirection, outslot, NULL))
+	/* Try to find the tuple */
+	while (index_getnext_slot(scan, ForwardScanDirection, outslot))
 	{
 		/*
 		 * Avoid expensive equality check if the index is primary key or
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index a7eadaf3db2..af7dd364f33 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -43,12 +43,13 @@
 #include "storage/predicate.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
-#include "utils/spccache.h"
 
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
 							TupleDesc itupdesc);
-
+static IndexPrefetchEntry *IndexOnlyPrefetchNext(IndexScanDesc scan,
+												 IndexPrefetch *prefetch,
+												 ScanDirection direction);
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -66,7 +67,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	TupleTableSlot *slot;
 	ItemPointer tid;
 	IndexPrefetch  *prefetch;
-	bool			all_visible;
+	IndexPrefetchEntry *entry;
 
 	/*
 	 * extract necessary information from index scan node
@@ -76,6 +77,12 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * Determine which direction to scan the index in based on the plan's scan
 	 * direction and the current direction of execution.
+	 *
+	 * XXX Could this be an issue for the prefetching? What if we prefetch something
+	 * but the direction changes before we get to the read? If that could happen,
+	 * maybe we should discard the prefetched data and go back? But can we even
+	 * do that, if we already fetched some TIDs from the index? I don't think
+	 * indexorderdir can't change, but es_direction maybe can?
 	 */
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir);
@@ -119,10 +126,15 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid_vm(scandesc, direction, prefetch, &all_visible)) != NULL)
+	while ((entry = IndexPrefetchNext(scandesc, prefetch, direction)) != NULL)
 	{
+		bool	   *all_visible = NULL;
 		bool		tuple_from_heap = false;
 
+		/* unpack the entry */
+		tid = &entry->tid;
+		all_visible = (bool *) entry->data;	/* result of visibility check */
+
 		CHECK_FOR_INTERRUPTS();
 
 		/*
@@ -161,7 +173,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 *
 		 * XXX Skip if we already know the page is all visible from prefetcher.
 		 */
-		if (!all_visible &&
+		if (!(all_visible && *all_visible) &&
 			!VM_ALL_VISIBLE(scandesc->heapRelation,
 							ItemPointerGetBlockNumber(tid),
 							&node->ioss_VMBuffer))
@@ -367,6 +379,9 @@ ExecReScanIndexOnlyScan(IndexOnlyScanState *node)
 		prefetch->queueIndex = 0;
 		prefetch->queueStart = 0;
 		prefetch->queueEnd = 0;
+
+		prefetch->prefetchDone = false;
+		prefetch->prefetchTarget = 0;
 	}
 
 	ExecScanReScan(&node->ss);
@@ -401,6 +416,8 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 	{
 		IndexPrefetch *prefetch = node->ioss_prefetch;
 
+		Buffer *buffer = (Buffer *) prefetch->data;
+
 		/* XXX some debug info */
 		elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
 			 prefetch->countAll,
@@ -409,10 +426,10 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 			 prefetch->countSkipCached,
 			 prefetch->countSkipSequential);
 
-		if (prefetch->vmBuffer != InvalidBuffer)
+		if (*buffer != InvalidBuffer)
 		{
-			ReleaseBuffer(prefetch->vmBuffer);
-			prefetch->vmBuffer = InvalidBuffer;
+			ReleaseBuffer(*buffer);
+			*buffer = InvalidBuffer;
 		}
 	}
 
@@ -512,6 +529,7 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	Relation	currentRelation;
 	LOCKMODE	lockmode;
 	TupleDesc	tupDesc;
+	int			prefetch_max;
 
 	/*
 	 * create state structure
@@ -641,61 +659,33 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Also initialize index prefetcher.
+	 * Also initialize index prefetcher. We do this even when prefetching is
+	 * not done (see index_heap_prefetch_calculate_target), because the
+	 * prefetcher is used for all index reads.
+	 *
+	 * We reach here if the index only scan is not parallel, or if we're
+	 * serially executing an index only scan that was planned to be
+	 * parallel.
 	 *
-	 * XXX No prefetching for direct I/O.
+	 * XXX Maybe we should enable prefetching, but prefetch only pages that
+	 * are not all-visible (but checking that from the index code seems like
+	 * a violation of layering etc).
+	 *
+	 * XXX This might lead to IOS being slower than plain index scan, if the
+	 * table has a lot of pages that need recheck.
+	 *
+	 * Remember this is index-only scan, because of prefetching. Not the most
+	 * elegant way to pass this info.
+	 *
+	 * XXX Maybe rename the object to "index reader" or something?
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0)
-	{
-		int			prefetch_max;
-		Relation    heapRel = indexstate->ss.ss_currentRelation;
-
-		/*
-		 * Determine number of heap pages to prefetch for this index. This is
-		 * essentially just effective_io_concurrency for the table (or the
-		 * tablespace it's in).
-		 *
-		 * XXX Should this also look at plan.plan_rows and maybe cap the target
-		 * to that? Pointless to prefetch more than we expect to use. Or maybe
-		 * just reset to that value during prefetching, after reading the next
-		 * index page (or rather after rescan)?
-		 *
-		 * XXX Maybe reduce the value with parallel workers?
-		 */
-		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
-						   indexstate->ss.ps.plan->plan_rows);
-
-		/*
-		 * We reach here if the index only scan is not parallel, or if we're
-		 * serially executing an index only scan that was planned to be
-		 * parallel.
-		 *
-		 * XXX Maybe we should enable prefetching, but prefetch only pages that
-		 * are not all-visible (but checking that from the index code seems like
-		 * a violation of layering etc).
-		 *
-		 * XXX This might lead to IOS being slower than plain index scan, if the
-		 * table has a lot of pages that need recheck.
-		 *
-		 * Remember this is index-only scan, because of prefetching. Not the most
-		 * elegant way to pass this info.
-		 */
-		if (prefetch_max > 0)
-		{
-			IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+	prefetch_max = index_heap_prefetch_target(indexstate->ss.ss_currentRelation,
+											  indexstate->ss.ps.plan->plan_rows,
+											  node->allow_prefetch);
 
-			prefetch->queueIndex = 0;
-			prefetch->queueStart = 0;
-			prefetch->queueEnd = 0;
-
-			prefetch->prefetchTarget = 0;
-			prefetch->prefetchMaxTarget = prefetch_max;
-			prefetch->vmBuffer = InvalidBuffer;
-			prefetch->indexonly = true;
-
-			indexstate->ioss_prefetch = prefetch;
-		}
-	}
+	indexstate->ioss_prefetch = IndexPrefetchAlloc(IndexOnlyPrefetchNext,
+												   prefetch_max,
+												   palloc0(sizeof(Buffer)));
 
 	/*
 	 * all done.
@@ -808,3 +798,42 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 }
+
+/*
+ * When prefetching for IOS, we want to only prefetch pages that are not
+ * marked as all-visible (because not fetching all-visible pages is the
+ * point of IOS).
+ *
+ * The buffer used by the VM_ALL_VISIBLE() check is reused, similarly to
+ * ioss_VMBuffer (maybe we could/should use it here too?). We also keep
+ * the result of the all_visible flag, so that the main loop does not to
+ * do it again.
+ */
+static IndexPrefetchEntry *
+IndexOnlyPrefetchNext(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	IndexPrefetchEntry *entry = NULL;
+	ItemPointer			tid;
+
+	if ((tid = index_getnext_tid(scan, direction)) != NULL)
+	{
+		BlockNumber	blkno = ItemPointerGetBlockNumber(tid);
+
+		bool	all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+											 blkno,
+											 (Buffer *) prefetch->data);
+
+		entry = palloc0(sizeof(IndexPrefetchEntry));
+
+		entry->tid = *tid;
+
+		/* prefetch only if not all visible */
+		entry->prefetch = !all_visible;
+
+		/* store the all_visible flag in the private part of the entry */
+		entry->data = palloc(sizeof(bool));
+		*(bool *) entry->data = all_visible;
+	}
+
+	return entry;
+}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index b3282ec5a75..bd65337270c 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,7 +43,6 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
-#include "utils/spccache.h"
 
 /*
  * When an ordering operator is used, tuples fetched from the index that
@@ -70,6 +69,9 @@ static void reorderqueue_push(IndexScanState *node, TupleTableSlot *slot,
 							  Datum *orderbyvals, bool *orderbynulls);
 static HeapTuple reorderqueue_pop(IndexScanState *node);
 
+static IndexPrefetchEntry *IndexScanPrefetchNext(IndexScanDesc scan,
+												 IndexPrefetch *prefetch,
+												 ScanDirection direction);
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -87,6 +89,7 @@ IndexNext(IndexScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	IndexPrefetch  *prefetch;
+	IndexPrefetchEntry  *entry;
 
 	/*
 	 * extract necessary information from index scan node
@@ -131,10 +134,19 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot, prefetch))
+	while ((entry = IndexPrefetchNext(scandesc, prefetch, direction)) != NULL)
 	{
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scandesc->xs_heaptid));
+		if (!index_fetch_heap(scandesc, slot))
+			continue;
+
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
 		 * the fetched tuple.
@@ -180,7 +192,6 @@ IndexNextWithReorder(IndexScanState *node)
 	Datum	   *lastfetched_vals;
 	bool	   *lastfetched_nulls;
 	int			cmp;
-	IndexPrefetch *prefetch;
 
 	estate = node->ss.ps.state;
 
@@ -197,7 +208,6 @@ IndexNextWithReorder(IndexScanState *node)
 	Assert(ScanDirectionIsForward(estate->es_direction));
 
 	scandesc = node->iss_ScanDesc;
-	prefetch = node->iss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
@@ -264,7 +274,7 @@ IndexNextWithReorder(IndexScanState *node)
 		 * Fetch next tuple from the index.
 		 */
 next_indextuple:
-		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot, prefetch))
+		if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
 		{
 			/*
 			 * No more tuples from the index.  But we still need to drain any
@@ -601,6 +611,9 @@ ExecReScanIndexScan(IndexScanState *node)
 		prefetch->queueIndex = 0;
 		prefetch->queueStart = 0;
 		prefetch->queueEnd = 0;
+
+		prefetch->prefetchDone = false;
+		prefetch->prefetchTarget = 0;
 	}
 
 	ExecScanReScan(&node->ss);
@@ -917,6 +930,7 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	IndexScanState *indexstate;
 	Relation	currentRelation;
 	LOCKMODE	lockmode;
+	int			prefetch_max;
 
 	/*
 	 * create state structure
@@ -1095,43 +1109,33 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Also initialize index prefetcher.
+	 * Also initialize index prefetcher. We do this even when prefetching is
+	 * not done (see index_heap_prefetch_calculate_target), because the
+	 * prefetcher is used for all index reads.
+	 *
+	 * We reach here if the index only scan is not parallel, or if we're
+	 * serially executing an index only scan that was planned to be
+	 * parallel.
+	 *
+	 * XXX Maybe we should enable prefetching, but prefetch only pages that
+	 * are not all-visible (but checking that from the index code seems like
+	 * a violation of layering etc).
 	 *
-	 * XXX No prefetching for direct I/O.
+	 * XXX This might lead to IOS being slower than plain index scan, if the
+	 * table has a lot of pages that need recheck.
+	 *
+	 * Remember this is index-only scan, because of prefetching. Not the most
+	 * elegant way to pass this info.
+	 *
+	 * XXX Maybe rename the object to "index reader" or something?
 	 */
-	if ((io_direct_flags & IO_DIRECT_DATA) == 0)
-	{
-		int	prefetch_max;
-		Relation    heapRel = indexstate->ss.ss_currentRelation;
-
-		/*
-		 * Determine number of heap pages to prefetch for this index scan. This
-		 * is essentially just effective_io_concurrency for the table (or the
-		 * tablespace it's in).
-		 *
-		 * XXX Should this also look at plan.plan_rows and maybe cap the target
-		 * to that? Pointless to prefetch more than we expect to use. Or maybe
-		 * just reset to that value during prefetching, after reading the next
-		 * index page (or rather after rescan)?
-		 */
-		prefetch_max = Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
-						   indexstate->ss.ps.plan->plan_rows);
-
-		if (prefetch_max > 0)
-		{
-			IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
-
-			prefetch->queueIndex = 0;
-			prefetch->queueStart = 0;
-			prefetch->queueEnd = 0;
+	prefetch_max = index_heap_prefetch_target(indexstate->ss.ss_currentRelation,
+											  indexstate->ss.ps.plan->plan_rows,
+											  node->allow_prefetch);
 
-			prefetch->prefetchTarget = 0;
-			prefetch->prefetchMaxTarget = prefetch_max;
-			prefetch->vmBuffer = InvalidBuffer;
-
-			indexstate->iss_prefetch = prefetch;
-		}
-	}
+	indexstate->iss_prefetch = IndexPrefetchAlloc(IndexScanPrefetchNext,
+												  prefetch_max,
+												  NULL);
 
 	/*
 	 * all done.
@@ -1795,3 +1799,26 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 					 node->iss_ScanKeys, node->iss_NumScanKeys,
 					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 }
+
+/*
+ * XXX not sure this correctly handles xs_heap_continue - see index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this?
+ */
+static IndexPrefetchEntry *
+IndexScanPrefetchNext(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	IndexPrefetchEntry *entry = NULL;
+	ItemPointer			tid;
+
+	if ((tid = index_getnext_tid(scan, direction)) != NULL)
+	{
+		entry = palloc0(sizeof(IndexPrefetchEntry));
+
+		entry->tid = *tid;
+
+		/* prefetch always */
+		entry->prefetch = true;
+	}
+
+	return entry;
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 34ca6d4ac21..0abbcd31ddd 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -184,13 +184,15 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 								 Oid indexid, List *indexqual, List *indexqualorig,
 								 List *indexorderby, List *indexorderbyorig,
 								 List *indexorderbyops,
-								 ScanDirection indexscandir);
+								 ScanDirection indexscandir,
+								 bool allow_prefetch);
 static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *recheckqual,
 										 List *indexorderby,
 										 List *indextlist,
-										 ScanDirection indexscandir);
+										 ScanDirection indexscandir,
+										 bool allow_prefetch);
 static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
 											  List *indexqual,
 											  List *indexqualorig);
@@ -3161,6 +3163,13 @@ create_indexscan_plan(PlannerInfo *root,
 		}
 	}
 
+	/*
+	 * XXX Only allow index prefetching when parallelModeOK=true. This is a bit
+	 * of a misuse of the flag, but we need to disable prefetching for cursors
+	 * (which might change direction), and parallelModeOK does that. But maybe
+	 * we might (or should) have a separate flag.
+	 */
+
 	/* Finally ready to build the plan node */
 	if (indexonly)
 		scan_plan = (Scan *) make_indexonlyscan(tlist,
@@ -3171,7 +3180,8 @@ create_indexscan_plan(PlannerInfo *root,
 												stripped_indexquals,
 												fixed_indexorderbys,
 												indexinfo->indextlist,
-												best_path->indexscandir);
+												best_path->indexscandir,
+												root->glob->parallelModeOK);
 	else
 		scan_plan = (Scan *) make_indexscan(tlist,
 											qpqual,
@@ -3182,7 +3192,8 @@ create_indexscan_plan(PlannerInfo *root,
 											fixed_indexorderbys,
 											indexorderbys,
 											indexorderbyops,
-											best_path->indexscandir);
+											best_path->indexscandir,
+											root->glob->parallelModeOK);
 
 	copy_generic_path_info(&scan_plan->plan, &best_path->path);
 
@@ -5522,7 +5533,8 @@ make_indexscan(List *qptlist,
 			   List *indexorderby,
 			   List *indexorderbyorig,
 			   List *indexorderbyops,
-			   ScanDirection indexscandir)
+			   ScanDirection indexscandir,
+			   bool allow_prefetch)
 {
 	IndexScan  *node = makeNode(IndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5539,6 +5551,7 @@ make_indexscan(List *qptlist,
 	node->indexorderbyorig = indexorderbyorig;
 	node->indexorderbyops = indexorderbyops;
 	node->indexorderdir = indexscandir;
+	node->allow_prefetch = allow_prefetch;
 
 	return node;
 }
@@ -5552,7 +5565,8 @@ make_indexonlyscan(List *qptlist,
 				   List *recheckqual,
 				   List *indexorderby,
 				   List *indextlist,
-				   ScanDirection indexscandir)
+				   ScanDirection indexscandir,
+				   bool allow_prefetch)
 {
 	IndexOnlyScan *node = makeNode(IndexOnlyScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5568,6 +5582,7 @@ make_indexonlyscan(List *qptlist,
 	node->indexorderby = indexorderby;
 	node->indextlist = indextlist;
 	node->indexorderdir = indexscandir;
+	node->allow_prefetch = allow_prefetch;
 
 	return node;
 }
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index b5c79359425..e11d022827a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6297,7 +6297,7 @@ get_actual_variable_endpoint(Relation heapRel,
 	index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 	/* Fetch first/next tuple in specified direction */
-	while ((tid = index_getnext_tid(index_scan, indexscandir, NULL)) != NULL)
+	while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
 	{
 		BlockNumber block = ItemPointerGetBlockNumber(tid);
 
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index c0c46d7a05f..f3452e8a799 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -129,7 +129,7 @@ typedef struct IndexOrderByDistance
 	bool		isnull;
 } IndexOrderByDistance;
 
-
+/* index prefetching - probably should be somewhere else, outside indexam */
 
 /*
  * Cache of recently prefetched blocks, organized as a hash table of
@@ -159,6 +159,8 @@ typedef struct IndexPrefetchCacheEntry {
 
 /*
  * Used to detect sequential patterns (and disable prefetching).
+ *
+ * XXX seems strange to have two separate values
  */
 #define		PREFETCH_QUEUE_HISTORY			8
 #define		PREFETCH_SEQ_PATTERN_BLOCKS		4
@@ -166,9 +168,38 @@ typedef struct IndexPrefetchCacheEntry {
 typedef struct IndexPrefetchEntry
 {
 	ItemPointerData		tid;
-	bool				all_visible;
+
+	/* should we prefetch heap page for this TID? */
+	bool				prefetch;
+
+	/*
+	 * If a callback is specified, it may store per-tid information. The
+	 * data has to be a single palloc-ed piece of data, so that it can
+	 * be easily pfreed.
+	 *
+	 * XXX We could relax this by providing another cleanup callback, but
+	 * that seems unnecessarily complex - we expect the information to be
+	 * very simple, like bool flags or something. Easy to do in a simple
+	 * struct, and perhaps even reuse without pfree/palloc.
+	 */
+	void			    *data;
 } IndexPrefetchEntry;
 
+/* needs to be before IndexPrefetchCallback typedef */
+typedef struct IndexPrefetch IndexPrefetch;
+
+/*
+ * custom callback, allowing the user code to determine which TID to read
+ *
+ * If there is no TID to prefetch, the return value is expected to be NULL.
+ *
+ * Otherwise the "tid" field is expected to contain the TID to prefetch, and
+ * "data" may be set to custom information the callback needs to pass outside.
+ */
+typedef IndexPrefetchEntry *(*IndexPrefetchNextCB) (IndexScanDesc scan,
+													IndexPrefetch *state,
+													ScanDirection direction);
+
 typedef struct IndexPrefetch
 {
 	/*
@@ -187,9 +218,18 @@ typedef struct IndexPrefetch
 	uint64		countSkipSequential;
 	uint64		countSkipCached;
 
-	/* used when prefetching index-only scans */
-	bool		indexonly;
-	Buffer		vmBuffer;
+	/*
+	 * If a callback is specified, it may store global state (for all TIDs).
+	 * For example VM buffer may be kept during IOS. This is similar to the
+	 * data field in IndexPrefetchEntry, but that's per-TID.
+	 */
+	void	   *data;
+
+	/*
+	 * Callback to customize the prefetch (decide which block need to be
+	 * prefetched, etc.)
+	 */
+	IndexPrefetchNextCB	next_cb;
 
 	/*
 	 * Queue of TIDs to prefetch.
@@ -224,14 +264,22 @@ typedef struct IndexPrefetch
 
 } IndexPrefetch;
 
+IndexPrefetch *IndexPrefetchAlloc(IndexPrefetchNextCB next_cb,
+								  int prefetch_max, void *data);
+
+IndexPrefetchEntry *IndexPrefetchNext(IndexScanDesc scan, IndexPrefetch *state, ScanDirection direction);
+
 #define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
 #define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
 #define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
 #define PREFETCH_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
 #define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+
+/* XXX easy to confuse with PREFETCH_ACTIVE */
 #define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
 #define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
 
+int index_heap_prefetch_target(Relation heapRel, double plan_rows, bool allow_prefetch);
 
 /*
  * generalized index_ interface routines (in indexam.c)
@@ -278,17 +326,11 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  Relation indexrel, int nkeys, int norderbys,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
-									 ScanDirection direction,
-									 IndexPrefetch *prefetch);
-extern ItemPointer index_getnext_tid_vm(IndexScanDesc scan,
-										ScanDirection direction,
-										IndexPrefetch *prefetch,
-										bool *all_visible);
+									 ScanDirection direction);
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
-							   struct TupleTableSlot *slot,
-							   IndexPrefetch *prefetch);
+							   struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 8745453a5b4..cc891d4fccf 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1529,7 +1529,6 @@ typedef struct
 	bool	   *elem_nulls;		/* array of num_elems is-null flags */
 } IndexArrayKeyInfo;
 
-
 /* ----------------
  *	 IndexScanState information
  *
@@ -1582,6 +1581,7 @@ typedef struct IndexScanState
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
 
+	/* prefetching */
 	IndexPrefetch *iss_prefetch;
 } IndexScanState;
 
@@ -1621,6 +1621,8 @@ typedef struct IndexOnlyScanState
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
+
+	/* prefetching */
 	IndexPrefetch *ioss_prefetch;
 } IndexOnlyScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d40af8e59fe..bc1029982cf 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -454,6 +454,7 @@ typedef struct IndexScan
 	List	   *indexorderbyorig;	/* the same in original form */
 	List	   *indexorderbyops;	/* OIDs of sort ops for ORDER BY exprs */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	bool		allow_prefetch;	/* allow prefetching of heap pages */
 } IndexScan;
 
 /* ----------------
@@ -496,6 +497,7 @@ typedef struct IndexOnlyScan
 	List	   *indexorderby;	/* list of index ORDER BY exprs */
 	List	   *indextlist;		/* TargetEntry list describing index's cols */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	bool		allow_prefetch;	/* allow prefetching of heap pages */
 } IndexOnlyScan;
 
 /* ----------------
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
@ 2024-01-09 20:31           ` Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
  0 siblings, 1 reply; 25+ messages in thread

From: Robert Haas @ 2024-01-09 20:31 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>

On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
<[email protected]> wrote:
> Here's a somewhat reworked version of the patch. My initial goal was to
> see if it could adopt the StreamingRead API proposed in [1], but that
> turned out to be less straight-forward than I hoped, for two reasons:

I guess we need Thomas or Andres or maybe Melanie to comment on this.

> Perhaps a bigger change is that I decided to move this into a separate
> API on top of indexam.c. The original idea was to integrate this into
> index_getnext_tid/index_getnext_slot, so that all callers benefit from
> the prefetching automatically. Which would be nice, but it also meant
> it's need to happen in the indexam.c code, which seemed dirty.

This patch is hard to review right now because there's a bunch of
comment updating that doesn't seem to have been done for the new
design. For instance:

+ * XXX This does not support prefetching of heap pages. When such
prefetching is
+ * desirable, use index_getnext_tid().

But not any more.

+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a somewhat wrong). Also, maybe we should consider the filter
selectivity

I'm not sure whether all the problems in this area are solved, but I
think you've solved enough of them that this at least needs rewording,
if not removing.

+     * XXX Comment/check seems obsolete.

This occurs in two places. I'm not sure if it's accurate or not.

+     * XXX Could this be an issue for the prefetching? What if we
prefetch something
+     * but the direction changes before we get to the read? If that
could happen,
+     * maybe we should discard the prefetched data and go back? But can we even
+     * do that, if we already fetched some TIDs from the index? I don't think
+     * indexorderdir can't change, but es_direction maybe can?

But your email claims that "The patch simply disables prefetching for
such queries, using the same logic that we do for parallelism." FWIW,
I think that's a fine way to handle that case.

+     * XXX Maybe we should enable prefetching, but prefetch only pages that
+     * are not all-visible (but checking that from the index code seems like
+     * a violation of layering etc).

Isn't this fixed now? Note this comment occurs twice.

+     * XXX We need to disable this in some cases (e.g. when using index-only
+     * scans, we don't want to prefetch pages). Or maybe we should prefetch
+     * only pages that are not all-visible, that'd be even better.

Here again.

And now for some comments on other parts of the patch, mostly other
XXX comments:

+ * XXX This does not support prefetching of heap pages. When such
prefetching is
+ * desirable, use index_getnext_tid().

There's probably no reason to write XXX here. The comment is fine.

+     * XXX Notice we haven't added the block to the block queue yet, and there
+     * is a preceding block (i.e. blockIndex-1 is valid).

Same here, possibly? If this XXX indicates a defect in the code, I
don't know what the defect is, so I guess it needs to be more clear.
If it is just explaining the code, then there's no reason for the
comment to say XXX.

+     * XXX Could it be harmful that we read the queue backwards? Maybe memory
+     * prefetching works better for the forward direction?

It does. But I don't know whether that matters here or not.

+             * XXX We do add the cache size to the request in order not to
+             * have issues with uint64 underflows.

I don't know what this means.

+ * XXX not sure this correctly handles xs_heap_continue - see
index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this?
Although, that
+ * should be in the indexscan next_cb callback, probably.
+ *
+ * XXX If xs_heap_continue=true, we need to return the last TID.

You've got a bunch of comments about xs_heap_continue here -- and I
don't fully understand what the issues are here with respect to this
particular patch, but I think that the general purpose of
xs_heap_continue is to handle the case where we need to return more
than one tuple from the same HOT chain. With an MVCC snapshot that
doesn't happen, but with say SnapshotAny or SnapshotDirty, it could.
As far as possible, the prefetcher shouldn't be involved at all when
xs_heap_continue is set, I believe, because in that case we're just
returning a bunch of tuples from the same page, and the extra fetches
from that heap page shouldn't trigger or require any further
prefetching.

+     * XXX Should this also look at plan.plan_rows and maybe cap the target
+     * to that? Pointless to prefetch more than we expect to use. Or maybe
+     * just reset to that value during prefetching, after reading the next
+     * index page (or rather after rescan)?

It seems questionable to use plan_rows here because (1) I don't think
we have existing cases where we use the estimated row count in the
executor for anything, we just carry it through so EXPLAIN can print
it and (2) row count estimates can be really far off, especially if
we're on the inner side of a nested loop, we might like to figure that
out eventually instead of just DTWT forever. But on the other hand
this does feel like an important case where we have a clue that
prefetching might need to be done less aggressively or not at all, and
it doesn't seem right to ignore that signal either. I wonder if we
want this shaped in some other way, like a Boolean that says
are-we-under-a-potentially-row-limiting-construct e.g. limit or inner
side of a semi-join or anti-join.

+     * We reach here if the index only scan is not parallel, or if we're
+     * serially executing an index only scan that was planned to be
+     * parallel.

Well, this seems sad.

+     * XXX This might lead to IOS being slower than plain index scan, if the
+     * table has a lot of pages that need recheck.

How?

+    /*
+     * XXX Only allow index prefetching when parallelModeOK=true. This is a bit
+     * of a misuse of the flag, but we need to disable prefetching for cursors
+     * (which might change direction), and parallelModeOK does that. But maybe
+     * we might (or should) have a separate flag.
+     */

I think the correct flag to be using here is execute_once, which
captures whether the executor could potentially be invoked a second
time for the same portal. Changes in the fetch direction are possible
if and only if !execute_once.

> Note 1: The IndexPrefetch name is a bit misleading, because it's used
> even with prefetching disabled - all index reads from the index scan
> happen through it. Maybe it should be called IndexReader or something
> like that.

My biggest gripe here is the capitalization. This version adds, inter
alia, IndexPrefetchAlloc, PREFETCH_QUEUE_INDEX, and
index_heap_prefetch_target, which seems like one or two too many
conventions. But maybe the PREFETCH_* macros don't even belong in a
public header.

I do like the index_heap_prefetch_* naming. Possibly that's too
verbose to use for everything, but calling this index-heap-prefetch
rather than index-prefetch seems clearer.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
@ 2024-01-12 16:42             ` Tomas Vondra <[email protected]>
  2024-01-12 16:52               ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-19 21:43               ` Re: index prefetching Melanie Plageman <[email protected]>
  0 siblings, 2 replies; 25+ messages in thread

From: Tomas Vondra @ 2024-01-12 16:42 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>

Hi,

Here's an improved version of this patch, finishing a lot of the stuff
that I alluded to earlier - moving the code from indexam.c, renaming a
bunch of stuff, etc. I've also squashed it into a single patch, to make
it easier to review.

I'll briefly go through the main changes in the patch, and then will
respond in-line to Robert's points.


1) I moved the code from indexam.c to (new) execPrefetch.c. All the
prototypes / typedefs now live in executor.h, with only minimal changes
in execnodes.h (adding it to scan descriptors).

I believe this finally moves the code to the right place - it feels much
nicer and cleaner than in indexam.c.  And it allowed me to hide a bunch
of internal structs and improve the general API, I think.

I'm sure there's stuff that could be named differently, but the layering
feels about right, I think.


2) A bunch of stuff got renamed to start with IndexPrefetch... to make
the naming consistent / clearer. I'm not entirely sure IndexPrefetch is
the right name, though - it's still a bit misleading, as it might seem
it's about prefetching index stuff, but really it's about heap pages
from indexes. Maybe IndexScanPrefetch() or something like that?


3) If there's a way to make this work with the streaming I/O API, I'm
not aware of it. But the overall design seems somewhat similar (based on
"next" callback etc.) so hopefully that'd make it easier to adopt it.


4) I initially relied on parallelModeOK to disable prefetching, which
kinda worked, but not really. Robert suggested to use the execute_once
flag directly, and I think that's much better - not only is it cleaner,
it also seems more appropriate (the parallel flag considers other stuff
that is not quite relevant to prefetching).

Thinking about this, I think it should be possible to make prefetching
work even for plans with execute_once=false. In particular, when the
plan changes direction it should be possible to simply "walk back" the
prefetch queue, to get to the "correct" place in in the scan. But I'm
not sure it's worth it, because plans that change direction often can't
really benefit from prefetches anyway - they'll often visit stuff they
accessed shortly before anyway. For plans that don't change direction
but may pause, we don't know if the plan pauses long enough for the
prefetched pages to get evicted or something. So I think it's OK that
execute_once=false means no prefetching.


5) I haven't done anything about the xs_heap_continue=true case yet.


6) I went through all the comments and reworked them considerably. The
main comment at execPrefetch.c start, with some overall design etc. And
then there are comments for each function, explaining that bit in more
detail. Or at least that's the goal - there's still work to do.

There's two trivial FIXMEs, but you can ignore those - it's not that
there's a bug, but that I'd like to rework something and just don't know
how yet.

There's also a couple of XXX comments. Some are a bit wild ideas for the
future, others are somewhat "open questions" to be discussed during a
review.

Anyway, there should be no outright obsolete comments - if there's
something I missed, let me know.


Now to Robert's message ...


On 1/9/24 21:31, Robert Haas wrote:
> On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
> <[email protected]> wrote:
>> Here's a somewhat reworked version of the patch. My initial goal was to
>> see if it could adopt the StreamingRead API proposed in [1], but that
>> turned out to be less straight-forward than I hoped, for two reasons:
> 
> I guess we need Thomas or Andres or maybe Melanie to comment on this.
> 

Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
streaming I/O stuff.

>> Perhaps a bigger change is that I decided to move this into a separate
>> API on top of indexam.c. The original idea was to integrate this into
>> index_getnext_tid/index_getnext_slot, so that all callers benefit from
>> the prefetching automatically. Which would be nice, but it also meant
>> it's need to happen in the indexam.c code, which seemed dirty.
> 
> This patch is hard to review right now because there's a bunch of
> comment updating that doesn't seem to have been done for the new
> design. For instance:
> 
> + * XXX This does not support prefetching of heap pages. When such
> prefetching is
> + * desirable, use index_getnext_tid().
> 
> But not any more.
> 

True. And this is now even more obsolete, as the prefetching was moved
from indexam.c layer to the executor.

> + * XXX The prefetching may interfere with the patch allowing us to evaluate
> + * conditions on the index tuple, in which case we may not need the heap
> + * tuple. Maybe if there's such filter, we should prefetch only pages that
> + * are not all-visible (and the same idea would also work for IOS), but
> + * it also makes the indexing a bit "aware" of the visibility stuff (which
> + * seems a somewhat wrong). Also, maybe we should consider the filter
> selectivity
> 
> I'm not sure whether all the problems in this area are solved, but I
> think you've solved enough of them that this at least needs rewording,
> if not removing.
> 
> +     * XXX Comment/check seems obsolete.
> 
> This occurs in two places. I'm not sure if it's accurate or not.
> 
> +     * XXX Could this be an issue for the prefetching? What if we
> prefetch something
> +     * but the direction changes before we get to the read? If that
> could happen,
> +     * maybe we should discard the prefetched data and go back? But can we even
> +     * do that, if we already fetched some TIDs from the index? I don't think
> +     * indexorderdir can't change, but es_direction maybe can?
> 
> But your email claims that "The patch simply disables prefetching for
> such queries, using the same logic that we do for parallelism." FWIW,
> I think that's a fine way to handle that case.
> 

True. I left behind this comment partly intentionally, to point out why
we disable the prefetching in these cases, but you're right the comment
now explains something that can't happen.

> +     * XXX Maybe we should enable prefetching, but prefetch only pages that
> +     * are not all-visible (but checking that from the index code seems like
> +     * a violation of layering etc).
> 
> Isn't this fixed now? Note this comment occurs twice.
> 
> +     * XXX We need to disable this in some cases (e.g. when using index-only
> +     * scans, we don't want to prefetch pages). Or maybe we should prefetch
> +     * only pages that are not all-visible, that'd be even better.
> 
> Here again.
> 

Sorry, you're right those comments (and a couple more nearby) were
stale. Removed / clarified.

> And now for some comments on other parts of the patch, mostly other
> XXX comments:
> 
> + * XXX This does not support prefetching of heap pages. When such
> prefetching is
> + * desirable, use index_getnext_tid().
> 
> There's probably no reason to write XXX here. The comment is fine.
> 
> +     * XXX Notice we haven't added the block to the block queue yet, and there
> +     * is a preceding block (i.e. blockIndex-1 is valid).
> 
> Same here, possibly? If this XXX indicates a defect in the code, I
> don't know what the defect is, so I guess it needs to be more clear.
> If it is just explaining the code, then there's no reason for the
> comment to say XXX.
> 

Yeah, removed the XXX / reworded a bit.

> +     * XXX Could it be harmful that we read the queue backwards? Maybe memory
> +     * prefetching works better for the forward direction?
> 
> It does. But I don't know whether that matters here or not.
> 
> +             * XXX We do add the cache size to the request in order not to
> +             * have issues with uint64 underflows.
> 
> I don't know what this means.
> 

There's a check that does this:

      (x + PREFETCH_CACHE_SIZE) >= y

it might also be done as "mathematically equivalent"

      x >= (y - PREFETCH_CACHE_SIZE)

but if the "y" is an uint64, and the value is smaller than the constant,
this would underflow. It'd eventually disappear, once the "y" gets large
enough, ofc.

> + * XXX not sure this correctly handles xs_heap_continue - see
> index_getnext_slot,
> + * maybe nodeIndexscan needs to do something more to handle this?
> Although, that
> + * should be in the indexscan next_cb callback, probably.
> + *
> + * XXX If xs_heap_continue=true, we need to return the last TID.
> 
> You've got a bunch of comments about xs_heap_continue here -- and I
> don't fully understand what the issues are here with respect to this
> particular patch, but I think that the general purpose of
> xs_heap_continue is to handle the case where we need to return more
> than one tuple from the same HOT chain. With an MVCC snapshot that
> doesn't happen, but with say SnapshotAny or SnapshotDirty, it could.
> As far as possible, the prefetcher shouldn't be involved at all when
> xs_heap_continue is set, I believe, because in that case we're just
> returning a bunch of tuples from the same page, and the extra fetches
> from that heap page shouldn't trigger or require any further
> prefetching.
> 

Yes, that's correct. The current code simply ignores that flag and just
proceeds to the next TID. Which is correct for xs_heap_continue=false,
and thus all MVCC snapshots work fine. But for the Any/Dirty case it
needs to work a bit differently.

> +     * XXX Should this also look at plan.plan_rows and maybe cap the target
> +     * to that? Pointless to prefetch more than we expect to use. Or maybe
> +     * just reset to that value during prefetching, after reading the next
> +     * index page (or rather after rescan)?
> 
> It seems questionable to use plan_rows here because (1) I don't think
> we have existing cases where we use the estimated row count in the
> executor for anything, we just carry it through so EXPLAIN can print
> it and (2) row count estimates can be really far off, especially if
> we're on the inner side of a nested loop, we might like to figure that
> out eventually instead of just DTWT forever. But on the other hand
> this does feel like an important case where we have a clue that
> prefetching might need to be done less aggressively or not at all, and
> it doesn't seem right to ignore that signal either. I wonder if we
> want this shaped in some other way, like a Boolean that says
> are-we-under-a-potentially-row-limiting-construct e.g. limit or inner
> side of a semi-join or anti-join.
> 

The current code actually does look at plan_rows when calculating the
prefetch target:

  prefetch_max = IndexPrefetchComputeTarget(node->ss.ss_currentRelation,
                                            node->ss.ps.plan->plan_rows,
                                            estate->es_use_prefetching);

but I agree maybe it should not, for the reasons you explain. I'm not
attached to this part.


> +     * We reach here if the index only scan is not parallel, or if we're
> +     * serially executing an index only scan that was planned to be
> +     * parallel.
> 
> Well, this seems sad.
> 

Stale comment, I believe. However, I didn't see much benefits with
parallel index scan during testing. Having I/O from multiple workers
generally had the same effect, I think.

> +     * XXX This might lead to IOS being slower than plain index scan, if the
> +     * table has a lot of pages that need recheck.
> 
> How?
> 

The comment is not particularly clear what "this" means, but I believe
this was about index-only scan with many not-all-visible pages. If it
didn't do prefetching, a regular index scan with prefetching may be way
faster. But the code actually allows doing prefetching even for IOS, by
checking the vm in the "next" callback.

> +    /*
> +     * XXX Only allow index prefetching when parallelModeOK=true. This is a bit
> +     * of a misuse of the flag, but we need to disable prefetching for cursors
> +     * (which might change direction), and parallelModeOK does that. But maybe
> +     * we might (or should) have a separate flag.
> +     */
> 
> I think the correct flag to be using here is execute_once, which
> captures whether the executor could potentially be invoked a second
> time for the same portal. Changes in the fetch direction are possible
> if and only if !execute_once.
> 

Right. The new patch version does that.

>> Note 1: The IndexPrefetch name is a bit misleading, because it's used
>> even with prefetching disabled - all index reads from the index scan
>> happen through it. Maybe it should be called IndexReader or something
>> like that.
> 
> My biggest gripe here is the capitalization. This version adds, inter
> alia, IndexPrefetchAlloc, PREFETCH_QUEUE_INDEX, and
> index_heap_prefetch_target, which seems like one or two too many
> conventions. But maybe the PREFETCH_* macros don't even belong in a
> public header.
> 
> I do like the index_heap_prefetch_* naming. Possibly that's too
> verbose to use for everything, but calling this index-heap-prefetch
> rather than index-prefetch seems clearer.
> 

Yeah. I renamed all the structs and functions to IndexPrefetchSomething,
to keep it consistent. And then the constants are all capital, ofc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

  [text/x-patch] v20240112-0001-Prefetch-heap-pages-during-index-scans.patch (54.3K, 2-v20240112-0001-Prefetch-heap-pages-during-index-scans.patch)
  download | inline diff:
From a3f99cc0aaa64ef94b09fc0a58bee709cd29add9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Fri, 17 Nov 2023 23:54:19 +0100
Subject: [PATCH v20240112] Prefetch heap pages during index scans

Index scans are a significant source of random I/O on the indexed heap,
but can't benefit from kernel read-ahead. For bitmap scans that is not
an issue, because they do prefetch explicitly, but for plain index scans
this is a major bottleneck - reading page at a time does not allow
saturating modern storage systems.

This enhances index scans (including index-only scans) to prefetch heap
pages. The scan maintains a queue of future TIDs received from an index,
prefetch the associated heap page, and then eventually pass the TID to
the caller.

To eliminate unnecessary prefetches, a small cache of recent prefetches
is maintained, and the prefetches are skipped. Furthermore, sequential
access patterns are detected and not prefetched, on the assumption that
the kernel read-ahead will do this more efficiently.

These optimizations are best-effort heuristics - we don't know if the
kernel will actually prefetch the pages on it's own, and we can't easily
check that. Moreover, different kernels (and kernel) versions may behave
differently.

Note: For shared buffers we can easily check if a page is cached, and
the PrefetchBuffer() function already takes care of that. These
optimizations are primarily about the page cache.

The prefetching is also disabled for plans that may not be executed only
once - these plans may change direction, interfering with the prefetch
queue. Consider scrollable cursors with backwards scans. This might get
improved to allow the prefetcher to handle direction changes, but it's
not clear if it's worth it.

Note: If a plan changes the scan direction, that inherently wastes the
issued prefetches. If the direction changes often, it likely means a lot
of the pages are still cached. Similarly, if a plan pauses for a long
time, the already prefetched pages may get evicted.
---
 src/backend/commands/explain.c           |  18 +
 src/backend/executor/Makefile            |   1 +
 src/backend/executor/execMain.c          |  12 +
 src/backend/executor/execPrefetch.c      | 884 +++++++++++++++++++++++
 src/backend/executor/instrument.c        |   4 +
 src/backend/executor/nodeIndexonlyscan.c | 113 ++-
 src/backend/executor/nodeIndexscan.c     |  68 +-
 src/include/executor/executor.h          |  52 ++
 src/include/executor/instrument.h        |   2 +
 src/include/nodes/execnodes.h            |  10 +
 src/tools/pgindent/typedefs.list         |   3 +
 11 files changed, 1162 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/executor/execPrefetch.c

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 3d590a6b9f5..9bbe270ab7d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3568,6 +3568,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 										!INSTR_TIME_IS_ZERO(usage->local_blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 									   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
 												  has_local || has_temp ||
 												  has_shared_timing ||
@@ -3679,6 +3680,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			appendStringInfoChar(es->str, '\n');
 		}
 
+		/* As above, show only positive counter values. */
+		if (has_prefetches)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+				appendStringInfo(es->str, " blocks=%lld",
+								 (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+				appendStringInfo(es->str, " rounds=%lld",
+								 (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..840f5a6596a 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -24,6 +24,7 @@ OBJS = \
 	execMain.o \
 	execParallel.o \
 	execPartition.o \
+	execPrefetch.o \
 	execProcnode.o \
 	execReplication.o \
 	execSRF.o \
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 13a9b7da83b..e3e9131bd62 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1645,6 +1645,18 @@ ExecutePlan(EState *estate,
 	 */
 	estate->es_direction = direction;
 
+	/*
+	 * Enable prefetching only if the plan is executed exactly once. We need
+	 * to disable prefetching for cases when the scan direction may change
+	 * (e.g. for scrollable cursors).
+	 *
+	 * XXX It might be possible to improve the prefetching code to handle this
+	 * by "walking back" the TID queue, but it's not clear if it's worth it.
+	 * And if there pauses in between the fetches, the prefetched pages may
+	 * get evicted, wasting the prefetch effort.
+	 */
+	estate->es_use_prefetching = execute_once;
+
 	/*
 	 * If the plan might potentially be executed multiple times, we must force
 	 * it to run without parallelism, because we might exit early.
diff --git a/src/backend/executor/execPrefetch.c b/src/backend/executor/execPrefetch.c
new file mode 100644
index 00000000000..000bb796d51
--- /dev/null
+++ b/src/backend/executor/execPrefetch.c
@@ -0,0 +1,884 @@
+/*-------------------------------------------------------------------------
+ *
+ * execPrefetch.c
+ *	  routines for prefetching heap pages for index scans.
+ *
+ * The IndexPrefetch node represents an "index prefetcher" which reads TIDs
+ * from an index scan, and prefetches the referenced heap pages. The basic
+ * API consists of these methods:
+ *
+ *	IndexPrefetchAlloc - allocate IndexPrefetch with custom callbacks
+ *	IndexPrefetchNext - read next TID from the index scan, do prefetches
+ *	IndexPrefetchReset - reset state of the prefetcher (for rescans)
+ *	IndexPrefetchEnd - release resources held by the prefetcher
+ *
+ * When allocating a prefetcher, the caller can supply two custom callbacks:
+ *
+ *	IndexPrefetchNextCB - reads the next TID from the index scan (required)
+ *	IndexPrefetchCleanupCB - release private prefetch data (optional)
+ *
+ * These callbacks allow customizing the behavior for different types of
+ * index scans - for exampel index-only scans may inspect visibility map,
+ * and adjust prefetches based on that.
+ *
+ *
+ * TID queue
+ * ---------
+ * The prefetcher maintains a simple queue of TIDs fetched from the index.
+ * The length of the queue (number of TIDs) is determined by the prefetch
+ * target, i.e. effective_io_concurrency. Adding entries to the queue is
+ * the responsibility of IndexPrefetchFillQueue(), depending on the state
+ * of the scan etc. It also prefetches the pages, if appropriate.
+ *
+ * Note: This prefetching applies only to heap pages from the indexed
+ * relation, not the internal index pages.
+ *
+ *
+ * pattern detection
+ * -----------------
+ * For certain access patterns, prefetching is inefficient. In particular,
+ * this applies to sequential access (where kernel read-ahead works fine)
+ * and for pages that are already in memory (prefetched recently). The
+ * prefetcher attempts to identify these two cases - sequential patterns
+ * are detected by IndexPrefetchBlockIsSequential, usign a tiny queue of
+ * recently prefetched blocks. Recently prefetched blocks are tracked in
+ * a "partitioned" LRU cache.
+ *
+ * Note: These are inherently best-effort heuristics. We don't know what
+ * the kernel algorithm/configuration is, or more precisely what already
+ * is in page cache.
+ *
+ *
+ * cache of recent prefetches
+ * --------------------------
+ * Cache of recently prefetched blocks, organized as a hash table of LRU
+ * LRU caches. Doesn't need to be perfectly accurate, but we aim to make
+ * false positives/negatives reasonably low. For more details see the
+ * comments at IndexPrefetchIsCached.
+ *
+ *
+ * prefetch request number
+ * -----------------------
+ * Prefetching works with the concept of "age" (e.g. "recently prefetched
+ * pages"). This relies on a simple prefetch counter, incremented every
+ * time a prefetch is issued. This is not exactly the same thing as time,
+ * as there may be arbitrary delays, it's good enough for this purpose.
+ *
+ *
+ * auto-tuning / self-adjustment
+ * -----------------------------
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of rescans and number of
+ * items (TIDs) actually returned from the scan. Then we could calculate
+ * rows / rescan and adjust the prefetch target accordingly. That'd help
+ * with cases when a scan matches only very few rows, far less than the
+ * prefetchTarget, because the unnecessary prefetches are wasted I/O.
+ * Imagine a LIMIT on top of index scan, or something like that.
+ *
+ * XXX Could we tune the cache size based on execution statistics? We have
+ * a cache of limited size (PREFETCH_CACHE_SIZE = 1024 by default), but
+ * how do we know it's the right size? Ideally, we'd have a cache large
+ * enough to track actually cached blocks. If the OS caches 10240 pages,
+ * then we may do 90% of prefetch requests unnecessarily. Or maybe there's
+ * a lot of contention, blocks are evicted quickly, and 90% of the blocks
+ * in the cache are not actually cached anymore? But we do have a concept
+ * of sequential request ID (PrefetchCacheEntry->request), which gives us
+ * information about "age" of the last prefetch. Now it's used only when
+ * evicting entries (to keep the more recent one), but maybe we could also
+ * use it when deciding if the page is cached. Right now any block that's
+ * in the cache is considered cached and not prefetched, but maybe we could
+ * have "max age", and tune it based on feedback from reading the blocks
+ * later. For example, if we find the block in cache and decide not to
+ * prefetch it, but then later find we have to do I/O, it means our cache
+ * is too large. And we could "reduce" the maximum age (measured from the
+ * current prefetchRequest value), so that only more recent blocks would
+ * be considered cached. Not sure about the opposite direction, where we
+ * decide to prefetch a block - AFAIK we don't have a way to determine if
+ * I/O was needed or not in this case (so we can't increase the max age).
+ * But maybe we could di that somehow speculatively, i.e. increase the
+ * value once in a while, and see what happens.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execPrefetch.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/relscan.h"
+#include "access/tableam.h"
+#include "access/xact.h"
+#include "catalog/index.h"
+#include "common/hashfn.h"
+#include "executor/executor.h"
+#include "nodes/nodeFuncs.h"
+#include "storage/bufmgr.h"
+#include "utils/spccache.h"
+
+
+/*
+ * An entry representing a recently prefetched block. For each block we know
+ * the request number, assigned sequentially, allowing us to decide how old
+ * the request is.
+ *
+ * XXX Is it enough to keep the request as uint32? This way we can prefetch
+ * 32TB of data, and this allows us to fit the whole entry into 64B, i.e.
+ * one cacheline. Which seems like a good thing.
+ *
+ * XXX If we're extra careful / paranoid about uint32, we could reset the
+ * cache once the request wraps around.
+ */
+typedef struct IndexPrefetchCacheEntry
+{
+	BlockNumber block;
+	uint32		request;
+} IndexPrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too small or
+ * too large. 1024 entries seems about right, it covers ~8MB of data. This is
+ * rather arbitrary - there's no formula that'd tell us what the optimal size
+ * is, and we can't even tune it based on runtime (as it depends on what the
+ * other backends do too).
+ *
+ * A value too small would mean we may issue unnecessary prefetches for pages
+ * that have already been prefetched recently (and are still in page cache),
+ * incurring costs for unnecessary fadvise() calls.
+ *
+ * A value too large would mean we do not issue prefetches for pages that have
+ * already been evicted from memory (both shared buffers and page cache).
+ *
+ * Note however that PrefetchBuffer() checks shared buffers before doing the
+ * fadvise call, which somewhat limits the risk of a small cache - the page
+ * would have to get evicted from shared buffers not yet from page cache.
+ * Also, the cost of not issuing a fadvise call (and doing synchronous I/O
+ * later) is much higher than the unnecessary fadvise call. For these reasons
+ * it's better to keep the cache fairly small.
+ *
+ * The cache is structured as an array of small LRU caches - you may also
+ * imagine it as a hash table of LRU caches. To remember a prefetched block,
+ * the block number mapped to a LRU using by hashing. And then in each LRU
+ * we organize the entries by age (per request number) - in particular, the
+ * age determines which entry gets evicted after the LRU gets full.
+ *
+ * The LRU needs to be small enough to be searched linearly. At the same
+ * time it needs to be sufficiently large to handle collisions when several
+ * hot blocks get mapped to the same LRU. For example, if the LRU was only
+ * a single entry, and there were two hot blocks mapped to it, that would
+ * often give incorrect answer.
+ *
+ * The 8 entries per LRU seems about right - it's small enough for linear
+ * search to work well, but large enough to be adaptive. It's not very
+ * likely for 9+ busy blocks (out of 1000 recent requests) to map to the
+ * same LRU. Assuming reasonable hash function.
+ *
+ * XXX Maybe we could consider effective_cache_size when sizing the cache?
+ * Not to size the cache for that, ofc, but maybe as a guidance of how many
+ * heap pages it might keep. Maybe just a fraction fraction of the value,
+ * say Max(8MB, effective_cache_size / max_connections) or something.
+ */
+#define		PREFETCH_LRU_SIZE		8	/* slots in one LRU */
+#define		PREFETCH_LRU_COUNT		128 /* number of LRUs */
+#define		PREFETCH_CACHE_SIZE		(PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
+/*
+ * Size of small sequential queue of most recently prefetched blocks, used
+ * to check if the block is exactly the same as the immediately preceding
+ * one (in which case prefetching is not needed), and if the blocks are a
+ * sequential pattern (in which case the kernel read-ahead is likely going
+ * to be more efficient, and we don't want to interfere with it).
+ */
+#define		PREFETCH_QUEUE_HISTORY	8
+
+/*
+ * An index prefetcher, which maintains a queue of TIDs from an index, and
+ * issues prefetches (if deemed beneficial and supported by the OS).
+ */
+typedef struct IndexPrefetch
+{
+	int			prefetchTarget; /* how far we should be prefetching */
+	int			prefetchMaxTarget;	/* maximum prefetching distance */
+	int			prefetchReset;	/* reset to this distance on rescan */
+	bool		prefetchDone;	/* did we get all TIDs from the index? */
+
+	/* runtime statistics, displayed in EXPLAIN etc. */
+	uint32		countAll;		/* all prefetch requests (including skipped) */
+	uint32		countPrefetch;	/* PrefetchBuffer calls */
+	uint32		countSkipSequential;	/* skipped as sequential pattern */
+	uint32		countSkipCached;	/* skipped as recently prefetched */
+
+	/*
+	 * Queue of TIDs to prefetch.
+	 *
+	 * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
+	 * than dynamically adjusting for custom values. However, 1000 entries
+	 * means ~16kB, which means an oversized chunk, and thus always a malloc()
+	 * call. However, we already have the prefetchCache, which is also large
+	 * enough to cause this :-(
+	 *
+	 * XXX However what about the case without prefetching? In that case it
+	 * would be nice to lower the malloc overhead, maybe?
+	 */
+	IndexPrefetchEntry queueItems[MAX_IO_CONCURRENCY];
+	uint32		queueIndex;		/* next TID to prefetch */
+	uint32		queueStart;		/* first valid TID in queue */
+	uint32		queueEnd;		/* first invalid (empty) TID in queue */
+
+	/*
+	 * A couple of last prefetched blocks, used to check for certain access
+	 * pattern and skip prefetching - e.g. for sequential access).
+	 *
+	 * XXX Separate from the main queue, because we only want to compare the
+	 * block numbers, not the whole TID. In sequential access it's likely we
+	 * read many items from each page, and we don't want to check many items
+	 * (as that is much more expensive).
+	 */
+	BlockNumber blockItems[PREFETCH_QUEUE_HISTORY];
+	uint32		blockIndex;		/* index in the block (points to the first
+								 * empty entry) */
+
+	/*
+	 * Cache of recently prefetched blocks, organized as a hash table of small
+	 * LRU caches.
+	 */
+	uint32		prefetchRequest;
+	IndexPrefetchCacheEntry prefetchCache[PREFETCH_CACHE_SIZE];
+
+
+	/*
+	 * Callback to customize the prefetch (decide which block need to be
+	 * prefetched, etc.)
+	 */
+	IndexPrefetchNextCB next_cb;	/* read next TID */
+	IndexPrefetchCleanupCB cleanup_cb;	/* cleanup data */
+
+	/*
+	 * If a callback is specified, it may store global state (for all TIDs).
+	 * For example VM buffer may be kept during IOS. This is similar to the
+	 * data field in IndexPrefetchEntry, but that's per-TID.
+	 */
+	void	   *data;
+} IndexPrefetch;
+
+/* small sequential queue of recent blocks */
+#define PREFETCH_BLOCK_INDEX(v)	((v) % PREFETCH_QUEUE_HISTORY)
+
+/* access to the main hybrid cache (hash of LRUs) */
+#define PREFETCH_LRU_ENTRY(p, lru, idx)	\
+	&((p)->prefetchCache[(lru) * PREFETCH_LRU_SIZE + (idx)])
+
+/* access to queue of TIDs (up to MAX_IO_CONCURRENCY elements) */
+#define PREFETCH_QUEUE_INDEX(a)	((a) % (MAX_IO_CONCURRENCY))
+#define PREFETCH_QUEUE_EMPTY(p)	((p)->queueEnd == (p)->queueIndex)
+
+/*
+ * macros to deal with prefetcher state
+ *
+ * FIXME may need rethinking, easy to confuse PREFETCH_ENABLED/PREFETCH_ACTIVE
+ */
+#define PREFETCH_ENABLED(p)		((p) && ((p)->prefetchMaxTarget > 0))
+#define PREFETCH_QUEUE_FULL(p)		((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
+#define PREFETCH_DONE(p)		((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+#define PREFETCH_ACTIVE(p)		(PREFETCH_ENABLED(p) && !(p)->prefetchDone)
+
+
+/*
+ * IndexPrefetchBlockIsSequential
+ *		Track the block number and check if the I/O pattern is sequential,
+ *		or if the block is the same as the immediately preceding one.
+ *
+ * This also updates the small sequential cache of blocks.
+ *
+ * The prefetching overhead is fairly low, but for some access patterns the
+ * benefits are small compared to the extra overhead, or the prefetching may
+ * even be harmful. In particular, for sequential access the read-ahead
+ * performed by the OS is very effective/efficient and our prefetching may
+ * be pointless or (worse) even interfere with it.
+ *
+ * This identifies simple sequential patterns, using a tiny queue of recently
+ * prefetched block numbers (PREFETCH_QUEUE_HISTORY blocks). It also checks
+ * if the block is exactly the same as any of the blocks in the queue (the
+ * main cache has block too, but checking the tiny cache is likely cheaper).
+ *
+ * The the main prefetch queue is not really useful for this, as it stores
+ * full TIDs, but while we only care about block numbers. Consider a nicely
+ * clustered table, with a perfectly sequential pattern when accessed through
+ * an index. Each heap page may have dozens of TIDs, filling the prefetch
+ * queue. But we need to compare block numbers - those may either not be
+ * in the queue anymore, or we have to walk many TIDs (making it expensive,
+ * and we're in hot path).
+ *
+ * So a tiny queue of just block numbers seems like a better option.
+ *
+ * Returns true if the block is in a sequential pattern or was prefetched
+ * recently (and so should not be prefetched this time), or false (in which
+ * case it should be prefetched).
+ */
+static bool
+IndexPrefetchBlockIsSequential(IndexPrefetch *prefetch, BlockNumber block)
+{
+	int			idx;
+
+	/*
+	 * If the block queue is empty, just store the block and we're done (it's
+	 * neither a sequential pattern, neither recently prefetched block).
+	 */
+	if (prefetch->blockIndex == 0)
+	{
+		prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+		prefetch->blockIndex++;
+		return false;
+	}
+
+	/*
+	 * Check if it's the same as the immediately preceding block. We don't
+	 * want to prefetch the same block over and over (which would happen for
+	 * well correlated indexes).
+	 *
+	 * In principle we could rely on IndexPrefetchIsCached doing this using
+	 * the full cache, but this check is much cheaper and we need to look at
+	 * the preceding block anyway, so we just do it.
+	 *
+	 * Notice we haven't added the block to the block queue yet, and there
+	 * is a preceding block (i.e. blockIndex-1 is valid).
+	 */
+	if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
+		return true;
+
+	/*
+	 * Add the block number to the small queue.
+	 *
+	 * Done before checking if the pattern is sequential, because we want to
+	 * know about the block later, even if we end up skipping the prefetch.
+	 * Otherwise we'd not be able to detect longer sequential pattens - we'd
+	 * skip one block and then fail to skip the next couple blocks even in a
+	 * perfectly sequential pattern. And this ocillation might even prevent
+	 * the OS read-ahead from kicking in.
+	 */
+	prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+	prefetch->blockIndex++;
+
+	/*
+	 * Are there enough requests to confirm a sequential pattern? We only
+	 * consider something to be sequential after finding a sequence of
+	 * PREFETCH_QUEUE_HISTORY blocks.
+	 */
+	if (prefetch->blockIndex < PREFETCH_QUEUE_HISTORY)
+		return false;
+
+	/*
+	 * Check if the last couple blocks are in a sequential pattern. We look
+	 * for a sequential pattern of PREFETCH_QUEUE_HISTORY (8 by default), so
+	 * we look for patterns of 8 pages (64kB) including the new block.
+	 *
+	 * XXX Could it be harmful that we read the queue backwards? Maybe memory
+	 * prefetching works better for the forward direction?
+	 */
+	for (int i = 1; i < PREFETCH_QUEUE_HISTORY; i++)
+	{
+		/*
+		 * Calculate index of the earlier block (we need to do -1 as we
+		 * already incremented the index after adding the new block to the
+		 * queue). So (blockIndex-1) is the new block.
+		 */
+		idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
+
+		/*
+		 * For a sequential pattern, blocks "k" step ago needs to have block
+		 * number by "k" smaller compared to the current block.
+		 */
+		if (prefetch->blockItems[idx] != (block - i))
+			return false;
+
+		/* Don't prefetch if the block happens to be the same. */
+		if (prefetch->blockItems[idx] == block)
+			return false;
+	}
+
+	/* not sequential, not recently prefetched */
+	return true;
+}
+
+/*
+ * IndexPrefetchIsCached
+ *		Check if the block was prefetched recently, and update the cache.
+ *
+ * We don't want to prefetch blocks that we already prefetched recently. It's
+ * cheap but not free, and the overhead may be quite significant.
+ *
+ * We want to remember which blocks were prefetched recently, so that we can
+ * skip repeated prefetches. We also need to eventually forget these blocks
+ * as they may get evicted from memory (particularly page cache, which is
+ * outside our control).
+ *
+ * A simple queue is not a viable option - it would allow expiring requests
+ * based on age, but it's very expensive to check (as it requires linear
+ * search, and we need fairly large number of entries). Hash table does not
+ * work because it does not allow expiring entries by age.
+ *
+ * The cache does not need to be perfect - false positives/negatives are
+ * both acceptable, as long as the rate is reasonably low.
+ *
+ * We use a hybrid cache that is organized as many small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table of LRUs). The LRU caches are tiny (e.g. 8 entries), and the
+ * expiration happens at the level of a single LRU (using age determined
+ * by sequential request number).
+ *
+ * This allows quick searches and expiration, with false negatives (when a
+ * particular LRU has too many collisions with hot blocks, we may end up
+ * evicting entries that are more recent than some other LRU).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * request in total (these are the default parameters.) representing about
+ * 8MB of data.
+ *
+ * If we want to check if a block was recently prefetched, we calculate
+ * (hash(blkno) % 128) and search only LRU at this index, using a linear
+ * search. If we want to add the block to the cache, we find either an
+ * empty slot or the "oldest" entry in the LRU, and store the block in it.
+ * If the block is already in the LRU, we only update the request number.
+ *
+ * The request age is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint32, so it should
+ * not wrap (we'd have to prefetch 32TB).
+ *
+ * If the request number is not less than PREFETCH_CACHE_SIZE ago, it's
+ * considered "recently prefetched". That is, the maximum age is the same
+ * as the total capacity of the cache.
+ *
+ * Returns true if the block was recently prefetched (and thus we don't
+ * need to prefetch it again), or false (should do a prefetch).
+ *
+ * XXX It's a bit confusing these return values are inverse compared to
+ * what IndexPrefetchBlockIsSequential does.
+ *
+ * XXX Should we increase the prefetch counter even if we determine the
+ * entry was recently prefetched? Then we might skip some request numbers
+ * (there's be no entry with them).
+ */
+static bool
+IndexPrefetchIsCached(IndexPrefetch *prefetch, BlockNumber block)
+{
+	IndexPrefetchCacheEntry *entry;
+
+	/* map the block number the the LRU */
+	int			lru;
+
+	/* age/index of the oldest entry in the LRU, to maybe use */
+	uint64		oldestRequest = PG_UINT64_MAX;
+	int			oldestIndex = -1;
+
+	/*
+	 * First add the block to the (tiny) queue and see if it's part of a
+	 * sequential pattern. In this case we just ignore the block and don't
+	 * prefetch it - we expect OS read-ahead to do a better job.
+	 *
+	 * XXX Maybe we should still add the block to the main cache, in case we
+	 * happen to access it later. That might help if we happen to scan a lot
+	 * of the table sequentially, and then randomly. Not sure that's very
+	 * likely with index access, though.
+	 */
+	if (IndexPrefetchBlockIsSequential(prefetch, block))
+	{
+		prefetch->countSkipSequential++;
+		return true;
+	}
+
+	/* Which LRU does this block belong to? */
+	lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
+
+	/*
+	 * Did we prefetch this block recently? Scan the LRU linearly, and while
+	 * doing that, track the oldest (or empty) entry, so that we know where to
+	 * put the block if we don't find a match.
+	 */
+	for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+	{
+		entry = PREFETCH_LRU_ENTRY(prefetch, lru, i);
+
+		/*
+		 * Is this the oldest prefetch request in this LRU?
+		 *
+		 * Notice that request is uint32, so an empty entry (with request=0)
+		 * is automatically oldest one.
+		 */
+		if (entry->request < oldestRequest)
+		{
+			oldestRequest = entry->request;
+			oldestIndex = i;
+		}
+
+		/* Skip unused entries. */
+		if (entry->request == 0)
+			continue;
+
+		/* Is this entry for the same block as the current request? */
+		if (entry->block == block)
+		{
+			bool		prefetched;
+
+			/*
+			 * Is the old request sufficiently recent? If yes, we treat the
+			 * block as already prefetched. We need to check before updating
+			 * the prefetch request.
+			 *
+			 * XXX We do add the cache size to the request in order not to
+			 * have issues with underflows.
+			 */
+			prefetched = ((entry->request + PREFETCH_CACHE_SIZE) >= prefetch->prefetchRequest);
+
+			prefetch->countSkipCached += (prefetched) ? 1 : 0;
+
+			/* Update the request number. */
+			entry->request = ++prefetch->prefetchRequest;
+
+			return prefetched;
+		}
+	}
+
+	/*
+	 * We didn't find the block in the LRU, so store it the "oldest" prefetch
+	 * request in this LRU (which might be an empty entry).
+	 */
+	Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+	entry = PREFETCH_LRU_ENTRY(prefetch, lru, oldestIndex);
+
+	entry->block = block;
+	entry->request = ++prefetch->prefetchRequest;
+
+	/* not in the prefetch cache */
+	return false;
+}
+
+/*
+ * IndexPrefetchHeapPage
+ *		Prefetch a heap page for the TID, unless it's sequential or was
+ *		recently prefetched.
+ */
+static void
+IndexPrefetchHeapPage(IndexScanDesc scan, IndexPrefetch *prefetch, IndexPrefetchEntry *entry)
+{
+	BlockNumber block = ItemPointerGetBlockNumber(&entry->tid);
+
+	prefetch->countAll++;
+
+	/*
+	 * Do not prefetch the same block over and over again, if it's probably
+	 * still in memory (page cache).
+	 *
+	 * This happens e.g. for clustered or naturally correlated indexes (fkey
+	 * to a sequence ID). It's not expensive (the block is in page cache
+	 * already, so no I/O), but it's not free either.
+	 *
+	 * If we make a mistake and prefetch a buffer that's still in our shared
+	 * buffers, PrefetchBuffer will take care of that. If it's in page cache,
+	 * we'll issue an unnecessary prefetch. There's not much we can do about
+	 * that, unfortunately.
+	 *
+	 * XXX Maybe we could check PrefetchBufferResult and adjust countPrefetch
+	 * based on that?
+	 */
+	if (IndexPrefetchIsCached(prefetch, block))
+		return;
+
+	prefetch->countPrefetch++;
+
+	PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+	pgBufferUsage.blks_prefetches++;
+}
+
+/*
+ * IndexPrefetchFillQueue
+ *		Fill the prefetch queue and issue necessary prefetch requests.
+ *
+ * If the prefetching is still active (enabled, not reached end of scan), read
+ * TIDs into the queue until we hit the current target.
+ *
+ * This also ramps-up the prefetch target from 0 to prefetch_max, determined
+ * when allocating the prefetcher.
+ */
+static void
+IndexPrefetchFillQueue(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	/* When inactive (not enabled or end of scan reached), we're done. */
+	if (!PREFETCH_ACTIVE(prefetch))
+		return;
+
+	/*
+	 * Ramp up the prefetch distance incrementally.
+	 *
+	 * Intentionally done as first, before reading the TIDs into the queue, so
+	 * that there's always at least one item. Otherwise we might get into a
+	 * situation where we start with target=0 and no TIDs loaded.
+	 */
+	prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+								   prefetch->prefetchMaxTarget);
+
+	/*
+	 * Read TIDs from the index until the queue is full (with respect to the
+	 * current prefetch target).
+	 */
+	while (!PREFETCH_QUEUE_FULL(prefetch))
+	{
+		IndexPrefetchEntry *entry
+		= prefetch->next_cb(scan, direction, prefetch->data);
+
+		/* no more entries in this index scan */
+		if (entry == NULL)
+		{
+			prefetch->prefetchDone = true;
+			return;
+		}
+
+		Assert(ItemPointerEquals(&entry->tid, &scan->xs_heaptid));
+
+		/* store the entry and then maybe issue the prefetch request */
+		prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd++)] = *entry;
+
+		/* issue the prefetch request? */
+		if (entry->prefetch)
+			IndexPrefetchHeapPage(scan, prefetch, entry);
+	}
+}
+
+/*
+ * IndexPrefetchNextEntry
+ *		Get the next entry from the prefetch queue (or from the index directly).
+ *
+ * If prefetching is enabled, get next entry from the prefetch queue (unless
+ * queue is empty). With prefetching disabled, read an entry directly from the
+ * index scan.
+ *
+ * XXX not sure this correctly handles xs_heap_continue - see index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this? Although, that
+ * should be in the indexscan next_cb callback, probably.
+ *
+ * XXX If xs_heap_continue=true, we need to return the last TID.
+ */
+static IndexPrefetchEntry *
+IndexPrefetchNextEntry(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	IndexPrefetchEntry *entry = NULL;
+
+	/*
+	 * With prefetching enabled (even if we already finished reading all TIDs
+	 * from the index scan), we need to return a TID from the queue.
+	 * Otherwise, we just get the next TID from the scan directly.
+	 */
+	if (PREFETCH_ENABLED(prefetch))
+	{
+		/* Did we reach the end of the scan and the queue is empty? */
+		if (PREFETCH_DONE(prefetch))
+			return NULL;
+
+		entry = palloc(sizeof(IndexPrefetchEntry));
+
+		entry->tid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].tid;
+		entry->data = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)].data;
+
+		prefetch->queueIndex++;
+
+		scan->xs_heaptid = entry->tid;
+	}
+	else						/* not prefetching, just do the regular work  */
+	{
+		ItemPointer tid;
+
+		/* Time to fetch the next TID from the index */
+		tid = index_getnext_tid(scan, direction);
+
+		/* If we're out of index entries, we're done */
+		if (tid == NULL)
+			return NULL;
+
+		Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+		entry = palloc(sizeof(IndexPrefetchEntry));
+
+		entry->tid = scan->xs_heaptid;
+		entry->data = NULL;
+	}
+
+	return entry;
+}
+
+/*
+ * IndexPrefetchComputeTarget
+ *		Calculate prefetch distance for the given heap relation.
+ *
+ * We disable prefetching when using direct I/O (when there's no page cache
+ * to prefetch into), and scans where the prefetch distance may change (e.g.
+ * for scrollable cursors).
+ *
+ * In regular cases we look at effective_io_concurrency for the tablepace
+ * (of the heap, not the index), and cap it with plan_rows.
+ *
+ * XXX We cap the target to plan_rows, becausse it's pointless to prefetch
+ * more than we expect to use.
+ *
+ * XXX Maybe we should reduce the value with parallel workers?
+ */
+int
+IndexPrefetchComputeTarget(Relation heapRel, double plan_rows, bool prefetch)
+{
+	/*
+	 * No prefetching for direct I/O.
+	 *
+	 * XXX Shouldn't we do prefetching even for direct I/O? We would only
+	 * pretend doing it now, ofc, because we'd not do posix_fadvise(), but
+	 * once the code starts loading into shared buffers, that'd work.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) != 0)
+		return 0;
+
+	/* disable prefetching (for cursors etc.) */
+	if (!prefetch)
+		return 0;
+
+	/* regular case, look at tablespace effective_io_concurrency */
+	return Min(get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace),
+			   plan_rows);
+}
+
+/*
+ * IndexPrefetchAlloc
+ *		Allocate the index prefetcher.
+ *
+ * The behavior is customized by two callbacks - next_cb, which generates TID
+ * values to put into the prefetch queue, and (optional) cleanup_cb which
+ * releases resources at the end.
+ *
+ * prefetch_max specifies the maximum prefetch distance, i.e. how many TIDs
+ * ahead to keep in the prefetch queue. prefetch_max=0 means prefetching is
+ * disabled.
+ *
+ * data may point to a custom data, associated with the prefetcher.
+ */
+IndexPrefetch *
+IndexPrefetchAlloc(IndexPrefetchNextCB next_cb, IndexPrefetchCleanupCB cleanup_cb,
+				   int prefetch_max, void *data)
+{
+	IndexPrefetch *prefetch = palloc0(sizeof(IndexPrefetch));
+
+	/* the next_cb callback is required */
+	Assert(next_cb);
+
+	/* valid prefetch distance */
+	Assert((prefetch_max >= 0) && (prefetch_max <= MAX_IO_CONCURRENCY));
+
+	prefetch->queueIndex = 0;
+	prefetch->queueStart = 0;
+	prefetch->queueEnd = 0;
+
+	prefetch->prefetchTarget = 0;
+	prefetch->prefetchMaxTarget = prefetch_max;
+
+	/*
+	 * Customize the prefetch to also check visibility map and keep the result
+	 * so that IOS does not need to repeat it.
+	 */
+	prefetch->next_cb = next_cb;
+	prefetch->cleanup_cb = cleanup_cb;
+	prefetch->data = data;
+
+	return prefetch;
+}
+
+/*
+ * IndexPrefetchNext
+ *		Read the next entry from the prefetch queue.
+ *
+ * Returns the next TID in the prefetch queue (which might have been prefetched
+ * sometime in the past). If needed, it adds more entries to the queue and does
+ * the prefetching for them.
+ *
+ * Returns IndexPrefetchEntry with the TID and optional data associated with
+ * the TID in the next_cb callback.
+ */
+IndexPrefetchEntry *
+IndexPrefetchNext(IndexScanDesc scan, IndexPrefetch *prefetch, ScanDirection direction)
+{
+	/* Do prefetching (if requested/enabled). */
+	IndexPrefetchFillQueue(scan, prefetch, direction);
+
+	/* Read the TID from the queue (or directly from the index). */
+	return IndexPrefetchNextEntry(scan, prefetch, direction);
+}
+
+/*
+ * IndexPrefetchReset
+ *		Reset the prefetch TID, restart the prefetching.
+ *
+ * Useful during rescans etc. This also resets the prefetch target, so that
+ * each rescan does the initial prefetch ramp-up from target=0 to maximum
+ * prefetch distance.
+ */
+void
+IndexPrefetchReset(IndexScanDesc scan, IndexPrefetch *state)
+{
+	if (!state)
+		return;
+
+	state->queueIndex = 0;
+	state->queueStart = 0;
+	state->queueEnd = 0;
+
+	state->prefetchDone = false;
+	state->prefetchTarget = 0;
+}
+
+/*
+ * IndexPrefetchStats
+ *		Log basic runtime debug stats of the prefetcher.
+ *
+ * FIXME Should be only in debug builds, or something like that.
+ */
+void
+IndexPrefetchStats(IndexScanDesc scan, IndexPrefetch *state)
+{
+	if (!state)
+		return;
+
+	elog(LOG, "index prefetch stats: requests %u prefetches %u (%f) skip cached %u sequential %u",
+		 state->countAll,
+		 state->countPrefetch,
+		 state->countPrefetch * 100.0 / state->countAll,
+		 state->countSkipCached,
+		 state->countSkipSequential);
+}
+
+/*
+ * IndexPrefetchEnd
+ *		Release resources associated with the prefetcher.
+ *
+ * This is primarily about the private data the caller might have allocated
+ * in the next_cb, and stored in the data field. We don't know what the
+ * data might contain (e.g. buffers etc.), requiring additional cleanup, so
+ * we call another custom callback.
+ *
+ * Needs to be called at the end of the executor node.
+ *
+ * XXX Maybe if there's no callback, we should just pfree the data? Does
+ * not seem very useful, though.
+ */
+void
+IndexPrefetchEnd(IndexScanDesc scan, IndexPrefetch *state)
+{
+	if (!state)
+		return;
+
+	if (!state->cleanup_cb)
+		return;
+
+	state->cleanup_cb(scan, state->data);
+}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 268ae8a945f..8fda8694350 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->local_blks_written += add->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+	dst->blks_prefetches += add->blks_prefetches;
 	INSTR_TIME_ADD(dst->shared_blk_read_time, add->shared_blk_read_time);
 	INSTR_TIME_ADD(dst->shared_blk_write_time, add->shared_blk_write_time);
 	INSTR_TIME_ADD(dst->local_blk_read_time, add->local_blk_read_time);
@@ -259,6 +261,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
 						  add->shared_blk_read_time, sub->shared_blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 2c2c9c10b57..fce10ea6518 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -36,6 +36,7 @@
 #include "access/tupdesc.h"
 #include "access/visibilitymap.h"
 #include "executor/execdebug.h"
+#include "executor/executor.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
@@ -44,11 +45,14 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
-
 static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
 							TupleDesc itupdesc);
-
+static IndexPrefetchEntry *IndexOnlyPrefetchNext(IndexScanDesc scan,
+												 ScanDirection direction,
+												 void *data);
+static void IndexOnlyPrefetchCleanup(IndexScanDesc scan,
+									 void *data);
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -65,6 +69,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	IndexPrefetch *prefetch;
+	IndexPrefetchEntry *entry;
 
 	/*
 	 * extract necessary information from index scan node
@@ -78,11 +84,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir);
 	scandesc = node->ioss_ScanDesc;
+	prefetch = node->ioss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
 	if (scandesc == NULL)
 	{
+		int	prefetch_max;
+
 		/*
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
@@ -111,15 +120,39 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_NumScanKeys,
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
+
+		/*
+		 * Also initialize index prefetcher. We do this even when prefetching is
+		 * not done (see IndexPrefetchComputeTarget), because the prefetcher is
+		 * used for all index reads.
+		 *
+		 * XXX Maybe we should reduce the target in case this is a parallel index
+		 * scan. We don't want to issue a multiple of effective_io_concurrency.
+		 *
+		 * XXX Maybe rename the object to "index reader" or something?
+		 */
+		prefetch_max = IndexPrefetchComputeTarget(node->ss.ss_currentRelation,
+												  node->ss.ps.plan->plan_rows,
+												  estate->es_use_prefetching);
+
+		node->ioss_prefetch = IndexPrefetchAlloc(IndexOnlyPrefetchNext,
+												 IndexOnlyPrefetchCleanup,
+												 prefetch_max,
+												 palloc0(sizeof(Buffer)));
 	}
 
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while ((entry = IndexPrefetchNext(scandesc, prefetch, direction)) != NULL)
 	{
+		bool	   *all_visible = NULL;
 		bool		tuple_from_heap = false;
 
+		/* unpack the entry */
+		tid = &entry->tid;
+		all_visible = (bool *) entry->data; /* result of visibility check */
+
 		CHECK_FOR_INTERRUPTS();
 
 		/*
@@ -155,8 +188,12 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 *
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
+		 *
+		 * XXX Skip if we already know the page is all visible from
+		 * prefetcher.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+		if (!(all_visible && *all_visible) &&
+			!VM_ALL_VISIBLE(scandesc->heapRelation,
 							ItemPointerGetBlockNumber(tid),
 							&node->ioss_VMBuffer))
 		{
@@ -353,6 +390,9 @@ ExecReScanIndexOnlyScan(IndexOnlyScanState *node)
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 
+	/* also reset the prefetcher, so that we start from scratch */
+	IndexPrefetchReset(node->ioss_ScanDesc, node->ioss_prefetch);
+
 	ExecScanReScan(&node->ss);
 }
 
@@ -380,6 +420,12 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
 		node->ioss_VMBuffer = InvalidBuffer;
 	}
 
+	/* XXX Print some debug stats. Should be removed. */
+	IndexPrefetchStats(indexScanDesc, node->ioss_prefetch);
+
+	/* Release VM buffer pin from prefetcher, if any. */
+	IndexPrefetchEnd(indexScanDesc, node->ioss_prefetch);
+
 	/*
 	 * close the index relation (no-op if we didn't open it)
 	 */
@@ -715,3 +761,62 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
 					 node->ioss_ScanKeys, node->ioss_NumScanKeys,
 					 node->ioss_OrderByKeys, node->ioss_NumOrderByKeys);
 }
+
+/*
+ * When prefetching for IOS, we want to only prefetch pages that are not
+ * marked as all-visible (because not fetching all-visible pages is the
+ * point of IOS).
+ *
+ * The buffer used by the VM_ALL_VISIBLE() check is reused, similarly to
+ * ioss_VMBuffer (maybe we could/should use it here too?). We also keep
+ * the result of the all_visible flag, so that the main loop does not to
+ * do it again.
+ */
+static IndexPrefetchEntry *
+IndexOnlyPrefetchNext(IndexScanDesc scan, ScanDirection direction, void *data)
+{
+	IndexPrefetchEntry *entry = NULL;
+	ItemPointer tid;
+
+	Assert(data);
+
+	if ((tid = index_getnext_tid(scan, direction)) != NULL)
+	{
+		BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+
+		bool		all_visible = VM_ALL_VISIBLE(scan->heapRelation,
+												 blkno,
+												 (Buffer *) data);
+
+		entry = palloc0(sizeof(IndexPrefetchEntry));
+
+		entry->tid = *tid;
+
+		/* prefetch only if not all visible */
+		entry->prefetch = !all_visible;
+
+		/* store the all_visible flag in the private part of the entry */
+		entry->data = palloc(sizeof(bool));
+		*(bool *) entry->data = all_visible;
+	}
+
+	return entry;
+}
+
+/*
+ * For IOS, we may have a VM buffer in the private data, so make sure to
+ * release it properly.
+ */
+static void
+IndexOnlyPrefetchCleanup(IndexScanDesc scan, void *data)
+{
+	Buffer	   *buffer = (Buffer *) data;
+
+	Assert(data);
+
+	if (*buffer != InvalidBuffer)
+	{
+		ReleaseBuffer(*buffer);
+		*buffer = InvalidBuffer;
+	}
+}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 03142b4a946..0548403dc50 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -34,6 +34,7 @@
 #include "access/tableam.h"
 #include "catalog/pg_am.h"
 #include "executor/execdebug.h"
+#include "executor/executor.h"
 #include "executor/nodeIndexscan.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
@@ -69,6 +70,9 @@ static void reorderqueue_push(IndexScanState *node, TupleTableSlot *slot,
 							  Datum *orderbyvals, bool *orderbynulls);
 static HeapTuple reorderqueue_pop(IndexScanState *node);
 
+static IndexPrefetchEntry *IndexScanPrefetchNext(IndexScanDesc scan,
+												 ScanDirection direction,
+												 void *data);
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -85,6 +89,8 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexPrefetch *prefetch;
+	IndexPrefetchEntry *entry;
 
 	/*
 	 * extract necessary information from index scan node
@@ -98,11 +104,14 @@ IndexNext(IndexScanState *node)
 	direction = ScanDirectionCombine(estate->es_direction,
 									 ((IndexScan *) node->ss.ps.plan)->indexorderdir);
 	scandesc = node->iss_ScanDesc;
+	prefetch = node->iss_prefetch;
 	econtext = node->ss.ps.ps_ExprContext;
 	slot = node->ss.ss_ScanTupleSlot;
 
 	if (scandesc == NULL)
 	{
+		int prefetch_max;
+
 		/*
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
@@ -123,15 +132,43 @@ IndexNext(IndexScanState *node)
 			index_rescan(scandesc,
 						 node->iss_ScanKeys, node->iss_NumScanKeys,
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		/*
+		 * Also initialize index prefetcher. We do this even when prefetching is
+		 * not done (see IndexPrefetchComputeTarget), because the prefetcher is
+		 * used for all index reads.
+		 *
+		 * XXX Maybe we should reduce the target in case this is a parallel index
+		 * scan. We don't want to issue a multiple of effective_io_concurrency.
+		 *
+		 * XXX Maybe rename the object to "index reader" or something?
+		 */
+		prefetch_max = IndexPrefetchComputeTarget(node->ss.ss_currentRelation,
+												  node->ss.ps.plan->plan_rows,
+												  estate->es_use_prefetching);
+
+		node->iss_prefetch = IndexPrefetchAlloc(IndexScanPrefetchNext,
+												NULL, /* no extra cleanup */
+												prefetch_max,
+												NULL);
 	}
 
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while ((entry = IndexPrefetchNext(scandesc, prefetch, direction)) != NULL)
 	{
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scandesc->xs_heaptid));
+		if (!index_fetch_heap(scandesc, slot))
+			continue;
+
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
 		 * the fetched tuple.
@@ -588,6 +625,9 @@ ExecReScanIndexScan(IndexScanState *node)
 					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	node->iss_ReachedEnd = false;
 
+	/* also reset the prefetcher, so that we start from scratch */
+	IndexPrefetchReset(node->iss_ScanDesc, node->iss_prefetch);
+
 	ExecScanReScan(&node->ss);
 }
 
@@ -794,6 +834,9 @@ ExecEndIndexScan(IndexScanState *node)
 	indexRelationDesc = node->iss_RelationDesc;
 	indexScanDesc = node->iss_ScanDesc;
 
+	/* XXX Print some debug stats. Should be removed. */
+	IndexPrefetchStats(indexScanDesc, node->iss_prefetch);
+
 	/*
 	 * close the index relation (no-op if we didn't open it)
 	 */
@@ -1728,3 +1771,26 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 					 node->iss_ScanKeys, node->iss_NumScanKeys,
 					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 }
+
+/*
+ * XXX not sure this correctly handles xs_heap_continue - see index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this?
+ */
+static IndexPrefetchEntry *
+IndexScanPrefetchNext(IndexScanDesc scan, ScanDirection direction, void *data)
+{
+	IndexPrefetchEntry *entry = NULL;
+	ItemPointer tid;
+
+	if ((tid = index_getnext_tid(scan, direction)) != NULL)
+	{
+		entry = palloc0(sizeof(IndexPrefetchEntry));
+
+		entry->tid = *tid;
+
+		/* prefetch always */
+		entry->prefetch = true;
+	}
+
+	return entry;
+}
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 5e8c335a737..e792c3fc8d8 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -677,4 +677,56 @@ extern ResultRelInfo *ExecLookupResultRelByOid(ModifyTableState *node,
 											   bool missing_ok,
 											   bool update_cache);
 
+/*
+ * prototypes from functions in execPrefetch.c
+ */
+
+typedef struct IndexPrefetchEntry
+{
+	ItemPointerData tid;
+
+	/* should we prefetch heap page for this TID? */
+	bool		prefetch;
+
+	/*
+	 * If a callback is specified, it may store per-tid information. The data
+	 * has to be a single palloc-ed piece of data, so that it can be easily
+	 * pfreed.
+	 *
+	 * XXX We could relax this by providing another cleanup callback, but that
+	 * seems unnecessarily complex - we expect the information to be very
+	 * simple, like bool flags or something. Easy to do in a simple struct,
+	 * and perhaps even reuse without pfree/palloc.
+	 */
+	void	   *data;
+} IndexPrefetchEntry;
+
+/*
+ * custom callback, allowing the user code to determine which TID to read
+ *
+ * If there is no TID to prefetch, the return value is expected to be NULL.
+ *
+ * Otherwise the "tid" field is expected to contain the TID to prefetch, and
+ * "data" may be set to custom information the callback needs to pass outside.
+ */
+typedef IndexPrefetchEntry *(*IndexPrefetchNextCB) (IndexScanDesc scan,
+													ScanDirection direction,
+													void *data);
+
+typedef void (*IndexPrefetchCleanupCB) (IndexScanDesc scan,
+										void *data);
+
+IndexPrefetch *IndexPrefetchAlloc(IndexPrefetchNextCB next_cb,
+								  IndexPrefetchCleanupCB cleanup_cb,
+								  int prefetch_max, void *data);
+
+IndexPrefetchEntry *IndexPrefetchNext(IndexScanDesc scan, IndexPrefetch *state,
+									  ScanDirection direction);
+
+extern void IndexPrefetchReset(IndexScanDesc scan, IndexPrefetch *state);
+extern void IndexPrefetchStats(IndexScanDesc scan, IndexPrefetch *state);
+extern void IndexPrefetchEnd(IndexScanDesc scan, IndexPrefetch *state);
+
+extern int	IndexPrefetchComputeTarget(Relation heapRel, double plan_rows, bool prefetch);
+
 #endif							/* EXECUTOR_H  */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index bfd7b6d8445..fadeb389495 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
 	int64		local_blks_written; /* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;	/* # of temp blocks written */
+	int64		blks_prefetch_rounds;	/* # of prefetch rounds */
+	int64		blks_prefetches;	/* # of buffers prefetched */
 	instr_time	shared_blk_read_time;	/* time spent reading shared blocks */
 	instr_time	shared_blk_write_time;	/* time spent writing shared blocks */
 	instr_time	local_blk_read_time;	/* time spent reading local blocks */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 561fdd98f1b..141db5d4ae2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -690,6 +690,7 @@ typedef struct EState
 	struct EPQState *es_epq_active;
 
 	bool		es_use_parallel_mode;	/* can we use parallel workers? */
+	bool		es_use_prefetching; /* can we use prefetching? */
 
 	/* The per-query shared memory area to use for parallel execution. */
 	struct dsa_area *es_query_dsa;
@@ -1529,6 +1530,9 @@ typedef struct
 	bool	   *elem_nulls;		/* array of num_elems is-null flags */
 } IndexArrayKeyInfo;
 
+/* needs to be before IndexPrefetchCallback typedef */
+typedef struct IndexPrefetch IndexPrefetch;
+
 /* ----------------
  *	 IndexScanState information
  *
@@ -1580,6 +1584,9 @@ typedef struct IndexScanState
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
 	Size		iss_PscanLen;
+
+	/* prefetching */
+	IndexPrefetch *iss_prefetch;
 } IndexScanState;
 
 /* ----------------
@@ -1618,6 +1625,9 @@ typedef struct IndexOnlyScanState
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
 	Size		ioss_PscanLen;
+
+	/* prefetching */
+	IndexPrefetch *ioss_prefetch;
 } IndexOnlyScanState;
 
 /* ----------------
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f582eb59e7d..9d194ec2715 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1183,6 +1183,9 @@ IndexOnlyScanState
 IndexOptInfo
 IndexOrderByDistance
 IndexPath
+IndexPrefetch
+IndexPrefetchCacheEntry
+IndexPrefetchEntry
 IndexRuntimeKeyInfo
 IndexScan
 IndexScanDesc
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
@ 2024-01-12 16:52               ` Robert Haas <[email protected]>
  1 sibling, 0 replies; 25+ messages in thread

From: Robert Haas @ 2024-01-12 16:52 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>

Not a full response, but just to address a few points:

On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra
<[email protected]> wrote:
> Thinking about this, I think it should be possible to make prefetching
> work even for plans with execute_once=false. In particular, when the
> plan changes direction it should be possible to simply "walk back" the
> prefetch queue, to get to the "correct" place in in the scan. But I'm
> not sure it's worth it, because plans that change direction often can't
> really benefit from prefetches anyway - they'll often visit stuff they
> accessed shortly before anyway. For plans that don't change direction
> but may pause, we don't know if the plan pauses long enough for the
> prefetched pages to get evicted or something. So I think it's OK that
> execute_once=false means no prefetching.

+1.

> > +             * XXX We do add the cache size to the request in order not to
> > +             * have issues with uint64 underflows.
> >
> > I don't know what this means.
> >
>
> There's a check that does this:
>
>       (x + PREFETCH_CACHE_SIZE) >= y
>
> it might also be done as "mathematically equivalent"
>
>       x >= (y - PREFETCH_CACHE_SIZE)
>
> but if the "y" is an uint64, and the value is smaller than the constant,
> this would underflow. It'd eventually disappear, once the "y" gets large
> enough, ofc.

The problem is, I think, that there's no particular reason that
someone reading the existing code should imagine that it might have
been done in that "mathematically equivalent" fashion. I imagined that
you were trying to make a point about adding the cache size to the
request vs. adding nothing, whereas in reality you were trying to make
a point about adding from one side vs. subtracting from the other.

> > +     * We reach here if the index only scan is not parallel, or if we're
> > +     * serially executing an index only scan that was planned to be
> > +     * parallel.
> >
> > Well, this seems sad.
>
> Stale comment, I believe. However, I didn't see much benefits with
> parallel index scan during testing. Having I/O from multiple workers
> generally had the same effect, I think.

Fair point, likely worth mentioning explicitly in the comment.

> Yeah. I renamed all the structs and functions to IndexPrefetchSomething,
> to keep it consistent. And then the constants are all capital, ofc.

It'd still be nice to get table or heap in there, IMHO, but maybe we
can't, and consistency is certainly a good thing regardless of the
details, so thanks for that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com






^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
@ 2024-01-19 21:43               ` Melanie Plageman <[email protected]>
  2024-01-22 04:53                 ` Re: index prefetching Peter Smith <[email protected]>
  2024-01-23 17:43                 ` Re: index prefetching Tomas Vondra <[email protected]>
  1 sibling, 2 replies; 25+ messages in thread

From: Melanie Plageman @ 2024-01-19 21:43 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>

On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra
<[email protected]> wrote:
>
> On 1/9/24 21:31, Robert Haas wrote:
> > On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
> > <[email protected]> wrote:
> >> Here's a somewhat reworked version of the patch. My initial goal was to
> >> see if it could adopt the StreamingRead API proposed in [1], but that
> >> turned out to be less straight-forward than I hoped, for two reasons:
> >
> > I guess we need Thomas or Andres or maybe Melanie to comment on this.
> >
>
> Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
> streaming I/O stuff.

I've been studying your patch with the intent of finding a way to
change it and or the streaming read API to work together. I've
attached a very rough sketch of how I think it could work.

We fill a queue with blocks from TIDs that we fetched from the index.
The queue is saved in a scan descriptor that is made available to the
streaming read callback. Once the queue is full, we invoke the table
AM specific index_fetch_tuple() function which calls
pg_streaming_read_buffer_get_next(). When the streaming read API
invokes the callback we registered, it simply dequeues a block number
for prefetching. The only change to the streaming read API is that
now, even if the callback returns InvalidBlockNumber, we may not be
finished, so make it resumable.

Structurally, this changes the timing of when the heap blocks are
prefetched. Your code would get a tid from the index and then prefetch
the heap block -- doing this until it filled a queue that had the
actual tids saved in it. With my approach and the streaming read API,
you fetch tids from the index until you've filled up a queue of block
numbers. Then the streaming read API will prefetch those heap blocks.

I didn't actually implement the block queue -- I just saved a single
block number and pretended it was a block queue. I was imagining we
replace this with something like your IndexPrefetch->blockItems --
which has light deduplication. We'd probably have to flesh it out more
than that.

There are also table AM layering violations in my sketch which would
have to be worked out (not to mention some resource leakage I didn't
bother investigating [which causes it to fail tests]).

0001 is all of Thomas' streaming read API code that isn't yet in
master and 0002 is my rough sketch of index prefetching using the
streaming read API

There are also numerous optimizations that your index prefetching
patch set does that would need to be added in some way. I haven't
thought much about it yet. I wanted to see what you thought of this
approach first. Basically, is it workable?

- Melanie

From 31a0b829b3aca31542dc3236b408f1e86133aea7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 19 Jan 2024 16:10:30 -0500
Subject: [PATCH v1 2/2] use streaming reads in index scan

---
 src/backend/access/heap/heapam_handler.c | 14 +++-
 src/backend/access/index/indexam.c       |  2 +
 src/backend/executor/nodeIndexscan.c     | 83 ++++++++++++++++++++----
 src/backend/storage/aio/streaming_read.c | 10 ++-
 src/include/access/relscan.h             |  6 ++
 src/include/storage/streaming_read.h     |  2 +
 6 files changed, 101 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d15a02b2be7..0ef5f824546 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -127,9 +127,17 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (scan->pgsr)
+		{
+			hscan->xs_cbuf = pg_streaming_read_buffer_get_next(scan->pgsr, NULL);
+			if (!BufferIsValid(hscan->xs_cbuf))
+				return false;
+		}
+		else
+			hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+													hscan->xs_base.rel,
+													ItemPointerGetBlockNumber(tid));
+
 
 		/*
 		 * Prune page, but only if we weren't already on this page
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 63dff101e29..c118cc3861f 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -237,6 +237,8 @@ index_beginscan(Relation heapRelation,
 
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+	scan->index_done = false;
+	scan->xs_heapfetch->blk_queue = InvalidBlockNumber;
 
 	return scan;
 }
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 03142b4a946..41437faff06 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -77,6 +77,33 @@ static HeapTuple reorderqueue_pop(IndexScanState *node);
  *		using the index specified in the IndexScanState information.
  * ----------------------------------------------------------------
  */
+
+#define QUEUE_FULL(q) ((q) != InvalidBlockNumber)
+
+static void
+blk_enqueue(BlockNumber blkno, BlockNumber *blk_queue)
+{
+	Assert(*blk_queue == InvalidBlockNumber);
+	*blk_queue = blkno;
+}
+
+static BlockNumber
+blk_dequeue(BlockNumber *blk_queue)
+{
+	BlockNumber result = *blk_queue;
+	*blk_queue = InvalidBlockNumber;
+	return result;
+}
+
+
+static BlockNumber
+index_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private, void *per_buffer_data)
+{
+	IndexFetchTableData *scan = (IndexFetchTableData *) pgsr_private;
+	return blk_dequeue(&scan->blk_queue);
+}
+
+
 static TupleTableSlot *
 IndexNext(IndexScanState *node)
 {
@@ -123,31 +150,63 @@ IndexNext(IndexScanState *node)
 			index_rescan(scandesc,
 						 node->iss_ScanKeys, node->iss_NumScanKeys,
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		// TODO: can't put this here bc not AM agnostic
+		scandesc->xs_heapfetch->pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+													scandesc->xs_heapfetch,
+													0,
+													NULL,
+													BMR_REL(scandesc->heapRelation),
+													MAIN_FORKNUM,
+													index_pgsr_next_single);
+
+		pg_streaming_read_set_resumable(scandesc->xs_heapfetch->pgsr);
 	}
 
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+
+	while (true)
 	{
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * If the index was lossy, we have to recheck the index quals using
-		 * the fetched tuple.
-		 */
-		if (scandesc->xs_recheck)
+		if (index_fetch_heap(scandesc, slot))
 		{
-			econtext->ecxt_scantuple = slot;
-			if (!ExecQualAndReset(node->indexqualorig, econtext))
+			/*
+			* If the index was lossy, we have to recheck the index quals using
+			* the fetched tuple.
+			*/
+			if (scandesc->xs_recheck)
 			{
-				/* Fails recheck, so drop it and loop back for another */
-				InstrCountFiltered2(node, 1);
-				continue;
+				econtext->ecxt_scantuple = slot;
+				if (!ExecQualAndReset(node->indexqualorig, econtext))
+				{
+					/* Fails recheck, so drop it and loop back for another */
+					InstrCountFiltered2(node, 1);
+					continue;
+				}
 			}
+
+			return slot;
 		}
 
-		return slot;
+		if (scandesc->index_done)
+			break;
+
+		Assert(!QUEUE_FULL(scandesc->xs_heapfetch->blk_queue));
+		do
+		{
+			ItemPointer tid = index_getnext_tid(scandesc, direction);
+
+			if (!tid)
+			{
+				scandesc->index_done = true;
+				break;
+			}
+
+			blk_enqueue(ItemPointerGetBlockNumber(tid), &scandesc->xs_heapfetch->blk_queue);
+		} while (!QUEUE_FULL(scandesc->xs_heapfetch->blk_queue));
 	}
 
 	/*
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 19605090fea..6465963f837 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -34,6 +34,7 @@ struct PgStreamingRead
 	int			pinned_buffers_trigger;
 	int			next_tail_buffer;
 	bool		finished;
+	bool		resumable;
 	void	   *pgsr_private;
 	PgStreamingReadBufferCB callback;
 	BufferAccessStrategy strategy;
@@ -292,7 +293,8 @@ pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
 		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
 		if (blocknum == InvalidBlockNumber)
 		{
-			pgsr->finished = true;
+			if (!pgsr->resumable)
+				pgsr->finished = true;
 			break;
 		}
 		bmr = pgsr->bmr;
@@ -433,3 +435,9 @@ pg_streaming_read_free(PgStreamingRead *pgsr)
 		pfree(pgsr->per_buffer_data);
 	pfree(pgsr);
 }
+
+void
+pg_streaming_read_set_resumable(PgStreamingRead *pgsr)
+{
+	pgsr->resumable = true;
+}
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304ab..d476cb206d5 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -18,6 +18,7 @@
 #include "access/itup.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/streaming_read.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
 
@@ -104,6 +105,8 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	PgStreamingRead *pgsr;
+	BlockNumber blk_queue;
 } IndexFetchTableData;
 
 /*
@@ -162,6 +165,9 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	bool	index_done;
+
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index 40c3408c541..2288b7b5eb0 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -42,4 +42,6 @@ extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
 extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
 extern void pg_streaming_read_free(PgStreamingRead *pgsr);
 
+extern void pg_streaming_read_set_resumable(PgStreamingRead *pgsr);
+
 #endif
-- 
2.37.2


From f6cb591ba520351ab7f0e7cbf9d6df3dacda6b44 Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v1 1/2] Streaming Read API

---
 contrib/pg_prewarm/pg_prewarm.c          |  40 +-
 src/backend/access/transam/xlogutils.c   |   2 +-
 src/backend/postmaster/bgwriter.c        |   8 +-
 src/backend/postmaster/checkpointer.c    |  15 +-
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 435 ++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      | 560 +++++++++++++++--------
 src/backend/storage/buffer/localbuf.c    |  14 +-
 src/backend/storage/meson.build          |   1 +
 src/backend/storage/smgr/smgr.c          |  49 +-
 src/include/storage/bufmgr.h             |  22 +
 src/include/storage/smgr.h               |   4 +-
 src/include/storage/streaming_read.h     |  45 ++
 src/include/utils/rel.h                  |   6 -
 src/tools/pgindent/typedefs.list         |   2 +
 17 files changed, 986 insertions(+), 238 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..9617bf130bd 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd10..8775b5789be 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
 	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * purpose.
 	 */
 	smgrcloseall();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d7d6cc0cd7b..13e5376619e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -246,10 +246,12 @@ BackgroundWriterMain(void)
 		if (FirstCallSinceLastCheckpoint())
 		{
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the bgwriter does
+			 * not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e949fc885b..5d843b61426 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -469,10 +469,12 @@ CheckpointerMain(void)
 				ckpt_performed = CreateRestartPoint(flags);
 
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the checkpointer
+			 * does not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
@@ -958,11 +960,8 @@ RequestCheckpoint(int flags)
 		 */
 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
 
-		/*
-		 * After any checkpoint, close all smgr files.  This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
+		/* Free all smgr objects, as CheckpointerMain() normally would. */
+		smgrdestroyall();
 
 		return;
 	}
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..19605090fea
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,435 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ *
+ * For hits, need_to_complete is false and there is just one block per
+ * range, already pinned and ready for use.
+ *
+ * For misses, need_to_complete is true and buffers[] holds a range of
+ * blocks that are contiguous in storage (though the buffers may not be
+ * contiguous in memory), so we can complete them with a single call to
+ * CompleteReadBuffers().
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		advice_issued;
+	bool		need_complete;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index[MAX_BUFFERS_PER_TRANSFER];
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadRange;
+
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			ios_in_progress_trigger;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	bool		finished;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	bool		advice_enabled;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+	int			per_buffer_data_next;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * The *_io_concurrency GUCs, we might have 0.  We want to allow at least
+	 * one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Start building a new range.  This is called after the previous one
+ * reached maximum size, or the callback's next block can't be merged with it.
+ *
+ * Since the previous head range has now reached its full potential size, this
+ * is also a good time to issue 'prefetch' advice, because we know that'll
+ * soon be reading.  In future, we could start an actual I/O here.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_new_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If a call to CompleteReadBuffers() will be needed, and we can issue
+	 * advice to the kernel to get the read started.  We suppress it if the
+	 * access pattern appears to be completely sequential, though, because on
+	 * some systems that interfers with the kernel's own sequential read ahead
+	 * heurstics and hurts performance.
+	 */
+	if (pgsr->advice_enabled)
+	{
+		BlockNumber blocknum = head_range->blocknum;
+		int			nblocks = head_range->nblocks;
+
+		if (head_range->need_complete && blocknum != pgsr->seq_blocknum)
+		{
+			SMgrRelation smgr =
+				pgsr->bmr.smgr ? pgsr->bmr.smgr :
+				RelationGetSmgr(pgsr->bmr.rel);
+
+			Assert(!head_range->advice_issued);
+
+			smgrprefetch(smgr, pgsr->forknum, blocknum, nblocks);
+
+			/*
+			 * Count this as an I/O that is concurrently in progress, though
+			 * we don't really know if the kernel generates a physical I/O.
+			 */
+			head_range->advice_issued = true;
+			pgsr->ios_in_progress++;
+		}
+
+		/* Remember the block after this range, for sequence detection. */
+		pgsr->seq_blocknum = blocknum + nblocks;
+	}
+
+	/* Create a new head range.  There must be space. */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	head_range = &pgsr->ranges[pgsr->head];
+	head_range->nblocks = 0;
+
+	return head_range;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BufferManagerRelation bmr;
+		ForkNumber	forknum;
+		BlockNumber blocknum;
+		Buffer		buffer;
+		bool		found;
+		bool		need_complete;
+		PgStreamingReadRange *head_range;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks == lengthof(head_range->buffers))
+		{
+			Assert(head_range->need_complete);
+			head_range = pg_streaming_read_new_range(pgsr);
+
+			/*
+			 * Give up now if I/O is saturated, or we wouldn't be able form
+			 * another full range after this due to the pin limit.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger ||
+				pgsr->ios_in_progress == pgsr->max_ios)
+				break;
+		}
+
+		per_buffer_data = (char *) pgsr->per_buffer_data +
+			pgsr->per_buffer_data_size * pgsr->per_buffer_data_next;
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			pgsr->finished = true;
+			break;
+		}
+		bmr = pgsr->bmr;
+		forknum = pgsr->forknum;
+
+		Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+		buffer = PrepareReadBuffer(bmr,
+								   forknum,
+								   blocknum,
+								   pgsr->strategy,
+								   &found);
+		pgsr->pinned_buffers++;
+
+		need_complete = !found;
+
+		/* Is there a head range that we can't extend? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks > 0 &&
+			(!need_complete ||
+			 !head_range->need_complete ||
+			 head_range->blocknum + head_range->nblocks != blocknum))
+		{
+			/* Yes, time to start building a new one. */
+			head_range = pg_streaming_read_new_range(pgsr);
+			Assert(head_range->nblocks == 0);
+		}
+
+		if (head_range->nblocks == 0)
+		{
+			/* Initialize a new range beginning at this block. */
+			head_range->blocknum = blocknum;
+			head_range->need_complete = need_complete;
+			head_range->advice_issued = false;
+		}
+		else
+		{
+			/* We can extend an existing range by one block. */
+			Assert(head_range->blocknum + head_range->nblocks == blocknum);
+			Assert(head_range->need_complete);
+		}
+
+		head_range->per_buffer_data_index[head_range->nblocks] = pgsr->per_buffer_data_next++;
+		head_range->buffers[head_range->nblocks] = buffer;
+		head_range->nblocks++;
+
+		if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
+			pgsr->per_buffer_data_next = 0;
+
+	} while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+			 pgsr->ios_in_progress < pgsr->max_ios);
+
+	if (pgsr->ranges[pgsr->head].nblocks > 0)
+		pg_streaming_read_new_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_complete)
+		{
+			CompleteReadBuffers(pgsr->bmr,
+								tail_range->buffers,
+								pgsr->forknum,
+								tail_range->blocknum,
+								tail_range->nblocks,
+								false,
+								pgsr->strategy);
+			tail_range->need_complete = false;
+
+			/*
+			 * We don't really know if the kernel generated an physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished after a read call returns.
+			 */
+			if (tail_range->advice_issued)
+				pgsr->ios_in_progress--;
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (per_buffer_data)
+				*per_buffer_data = (char *) pgsr->per_buffer_data +
+					tail_range->per_buffer_data_index[buffer_index] *
+					pgsr->per_buffer_data_size;
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6dd..2157a97b973 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
@@ -501,7 +501,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -795,15 +795,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+	buf = ReadBuffer_common(BMR_REL(reln),
 							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+
 	return buf;
 }
 
@@ -827,8 +821,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -1002,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 		bool		hit;
 
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+		buffer = ReadBuffer_common(bmr,
 								   fork, extend_to - 1, mode, strategy,
 								   &hit);
 	}
@@ -1016,18 +1011,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	Buffer		buffer;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1046,175 +1034,339 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		*hit = false;
+
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	buffer = PrepareReadBuffer(bmr,
+							   forkNum,
+							   blockNum,
+							   strategy,
+							   hit);
+
+	/* At this point we do NOT hold any locks. */
 
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+	{
+		/* if we just want zeroes and a lock, we're done */
+		ZeroBuffer(buffer, mode);
+	}
+	else if (!*hit)
+	{
+		/* we might need to perform I/O */
+		CompleteReadBuffers(bmr,
+							&buffer,
+							forkNum,
+							blockNum,
+							1,
+							mode == RBM_ZERO_ON_ERROR,
+							strategy);
+	}
+
+	return buffer;
+}
+
+/*
+ * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
+ * the returned buffer can be used immediately.  Otherwise, a physical read
+ * should be completed with CompleteReadBuffers(), or the buffer should be
+ * zeroed with ZeroBuffer().  PrepareReadBuffer() followed by
+ * CompleteReadBuffers() or ZeroBuffer() is equivalent to ReadBuffer(), but
+ * the caller has the opportunity to combine reads of multiple neighboring
+ * blocks into one CompleteReadBuffers() call.
+ *
+ * *foundPtr is set to true for a hit, and false for a miss.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * CompleteReadBuffers() (so, not for hits, and not for buffers that
+		 * are zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+					Buffer *buffers,
+					ForkNumber forknum,
+					BlockNumber blocknum,
+					int nblocks,
+					bool zero_on_error,
+					BufferAccessStrategy strategy)
+{
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+#ifdef USE_ASSERT_CHECKING
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		/*
+		 * We could get all the information from buffer headers, but it can be
+		 * expensive to access buffer header cache lines so we make the caller
+		 * provide all the information we need, and assert that it is
+		 * consistent.
+		 */
+		{
+			RelFileLocator xlocator;
+			ForkNumber	xforknum;
+			BlockNumber xblocknum;
+
+			BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+			Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+			Assert(xforknum == forknum);
+			Assert(xblocknum == blocknum + i);
+		}
+#endif
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!CompleteReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   CompleteReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if (zero_on_error || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
@@ -1228,11 +1380,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.  A read should be
+ * performed with CompleteReadBuffers().
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1291,19 +1440,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1368,19 +1508,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1412,15 +1543,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1774,7 +1899,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2043,7 +2168,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2066,7 +2191,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2381,7 +2506,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if PrepareReadBuffer() was called and
+		 * CompleteReadBuffers() hasn't been called yet.  We'll check by
+		 * loading the flags without locking.  This is racy, but it's OK to
+		 * return false spuriously: when CompleteReadBuffers() calls
+		 * StartBufferIO(), it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2390,7 +2520,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3458,7 +3588,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -4845,6 +4975,46 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would.  The buffer must be already pinned.  It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * Verify that this backend is pinning the buffer exactly once.
  *
@@ -5197,9 +5367,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5212,6 +5388,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1be4f4f8daf..717b8f58daf 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c74..0d7272e796e 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
  */
 SMgrRelation
 smgropen(RelFileLocator rlocator, BackendId backend)
@@ -259,10 +261,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
 void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -289,12 +291,14 @@ smgrclose(SMgrRelation reln)
 }
 
 /*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
  *
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr().  It may be re-owned if it is accessed by a
+ * relation before then.
  */
 void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
@@ -302,15 +306,20 @@ smgrrelease(SMgrRelation reln)
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	if (reln->smgr_owner)
+	{
+		*reln->smgr_owner = NULL;
+		reln->smgr_owner = NULL;
+		dlist_push_tail(&unowned_relns, &reln->node);
+	}
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
  */
 void
-smgrreleaseall(void)
+smgrcloseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -322,14 +331,17 @@ smgrreleaseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrclose(reln);
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
  */
 void
-smgrcloseall(void)
+smgrdestroyall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -341,7 +353,7 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+		smgrdestroy(reln);
 }
 
 /*
@@ -733,7 +745,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -747,7 +760,7 @@ AtEOXact_SMgr(void)
 	dlist_mutable_iter iter;
 
 	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
+	 * Zap all unowned SMgrRelations.  We rely on smgrdestroy() to remove each
 	 * one from the list.
 	 */
 	dlist_foreach_modify(iter, &unowned_relns)
@@ -757,7 +770,7 @@ AtEOXact_SMgr(void)
 
 		Assert(rel->smgr_owner == NULL);
 
-		smgrclose(rel);
+		smgrdestroy(rel);
 	}
 }
 
@@ -768,6 +781,6 @@ AtEOXact_SMgr(void)
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	smgrreleaseall();
+	smgrcloseall();
 	return true;
 }
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..a38f1acb37a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+								ForkNumber forkNum,
+								BlockNumber blockNum,
+								BufferAccessStrategy strategy,
+								bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+								Buffer *buffers,
+								ForkNumber forknum,
+								BlockNumber blocknum,
+								int nblocks,
+								bool zero_on_error,
+								BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a0568..d8ffe397faf 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..40c3408c541
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,45 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a584b1ddff3..6636cc82c09 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 29fd1cae641..018ebbcbaae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2089,6 +2089,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.37.2



Attachments:

  [text/plain] 0002-use-streaming-reads-in-index-scan.txt (7.3K, 2-0002-use-streaming-reads-in-index-scan.txt)
  download | inline diff:
From 31a0b829b3aca31542dc3236b408f1e86133aea7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 19 Jan 2024 16:10:30 -0500
Subject: [PATCH v1 2/2] use streaming reads in index scan

---
 src/backend/access/heap/heapam_handler.c | 14 +++-
 src/backend/access/index/indexam.c       |  2 +
 src/backend/executor/nodeIndexscan.c     | 83 ++++++++++++++++++++----
 src/backend/storage/aio/streaming_read.c | 10 ++-
 src/include/access/relscan.h             |  6 ++
 src/include/storage/streaming_read.h     |  2 +
 6 files changed, 101 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d15a02b2be7..0ef5f824546 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -127,9 +127,17 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (scan->pgsr)
+		{
+			hscan->xs_cbuf = pg_streaming_read_buffer_get_next(scan->pgsr, NULL);
+			if (!BufferIsValid(hscan->xs_cbuf))
+				return false;
+		}
+		else
+			hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+													hscan->xs_base.rel,
+													ItemPointerGetBlockNumber(tid));
+
 
 		/*
 		 * Prune page, but only if we weren't already on this page
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 63dff101e29..c118cc3861f 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -237,6 +237,8 @@ index_beginscan(Relation heapRelation,
 
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+	scan->index_done = false;
+	scan->xs_heapfetch->blk_queue = InvalidBlockNumber;
 
 	return scan;
 }
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 03142b4a946..41437faff06 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -77,6 +77,33 @@ static HeapTuple reorderqueue_pop(IndexScanState *node);
  *		using the index specified in the IndexScanState information.
  * ----------------------------------------------------------------
  */
+
+#define QUEUE_FULL(q) ((q) != InvalidBlockNumber)
+
+static void
+blk_enqueue(BlockNumber blkno, BlockNumber *blk_queue)
+{
+	Assert(*blk_queue == InvalidBlockNumber);
+	*blk_queue = blkno;
+}
+
+static BlockNumber
+blk_dequeue(BlockNumber *blk_queue)
+{
+	BlockNumber result = *blk_queue;
+	*blk_queue = InvalidBlockNumber;
+	return result;
+}
+
+
+static BlockNumber
+index_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private, void *per_buffer_data)
+{
+	IndexFetchTableData *scan = (IndexFetchTableData *) pgsr_private;
+	return blk_dequeue(&scan->blk_queue);
+}
+
+
 static TupleTableSlot *
 IndexNext(IndexScanState *node)
 {
@@ -123,31 +150,63 @@ IndexNext(IndexScanState *node)
 			index_rescan(scandesc,
 						 node->iss_ScanKeys, node->iss_NumScanKeys,
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		// TODO: can't put this here bc not AM agnostic
+		scandesc->xs_heapfetch->pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+													scandesc->xs_heapfetch,
+													0,
+													NULL,
+													BMR_REL(scandesc->heapRelation),
+													MAIN_FORKNUM,
+													index_pgsr_next_single);
+
+		pg_streaming_read_set_resumable(scandesc->xs_heapfetch->pgsr);
 	}
 
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+
+	while (true)
 	{
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * If the index was lossy, we have to recheck the index quals using
-		 * the fetched tuple.
-		 */
-		if (scandesc->xs_recheck)
+		if (index_fetch_heap(scandesc, slot))
 		{
-			econtext->ecxt_scantuple = slot;
-			if (!ExecQualAndReset(node->indexqualorig, econtext))
+			/*
+			* If the index was lossy, we have to recheck the index quals using
+			* the fetched tuple.
+			*/
+			if (scandesc->xs_recheck)
 			{
-				/* Fails recheck, so drop it and loop back for another */
-				InstrCountFiltered2(node, 1);
-				continue;
+				econtext->ecxt_scantuple = slot;
+				if (!ExecQualAndReset(node->indexqualorig, econtext))
+				{
+					/* Fails recheck, so drop it and loop back for another */
+					InstrCountFiltered2(node, 1);
+					continue;
+				}
 			}
+
+			return slot;
 		}
 
-		return slot;
+		if (scandesc->index_done)
+			break;
+
+		Assert(!QUEUE_FULL(scandesc->xs_heapfetch->blk_queue));
+		do
+		{
+			ItemPointer tid = index_getnext_tid(scandesc, direction);
+
+			if (!tid)
+			{
+				scandesc->index_done = true;
+				break;
+			}
+
+			blk_enqueue(ItemPointerGetBlockNumber(tid), &scandesc->xs_heapfetch->blk_queue);
+		} while (!QUEUE_FULL(scandesc->xs_heapfetch->blk_queue));
 	}
 
 	/*
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 19605090fea..6465963f837 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -34,6 +34,7 @@ struct PgStreamingRead
 	int			pinned_buffers_trigger;
 	int			next_tail_buffer;
 	bool		finished;
+	bool		resumable;
 	void	   *pgsr_private;
 	PgStreamingReadBufferCB callback;
 	BufferAccessStrategy strategy;
@@ -292,7 +293,8 @@ pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
 		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
 		if (blocknum == InvalidBlockNumber)
 		{
-			pgsr->finished = true;
+			if (!pgsr->resumable)
+				pgsr->finished = true;
 			break;
 		}
 		bmr = pgsr->bmr;
@@ -433,3 +435,9 @@ pg_streaming_read_free(PgStreamingRead *pgsr)
 		pfree(pgsr->per_buffer_data);
 	pfree(pgsr);
 }
+
+void
+pg_streaming_read_set_resumable(PgStreamingRead *pgsr)
+{
+	pgsr->resumable = true;
+}
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304ab..d476cb206d5 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -18,6 +18,7 @@
 #include "access/itup.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/streaming_read.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
 
@@ -104,6 +105,8 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	PgStreamingRead *pgsr;
+	BlockNumber blk_queue;
 } IndexFetchTableData;
 
 /*
@@ -162,6 +165,9 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	bool	index_done;
+
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index 40c3408c541..2288b7b5eb0 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -42,4 +42,6 @@ extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
 extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
 extern void pg_streaming_read_free(PgStreamingRead *pgsr);
 
+extern void pg_streaming_read_set_resumable(PgStreamingRead *pgsr);
+
 #endif
-- 
2.37.2



  [text/plain] 0001-Streaming-Read-API.txt (56.0K, 3-0001-Streaming-Read-API.txt)
  download | inline diff:
From f6cb591ba520351ab7f0e7cbf9d6df3dacda6b44 Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v1 1/2] Streaming Read API

---
 contrib/pg_prewarm/pg_prewarm.c          |  40 +-
 src/backend/access/transam/xlogutils.c   |   2 +-
 src/backend/postmaster/bgwriter.c        |   8 +-
 src/backend/postmaster/checkpointer.c    |  15 +-
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 435 ++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      | 560 +++++++++++++++--------
 src/backend/storage/buffer/localbuf.c    |  14 +-
 src/backend/storage/meson.build          |   1 +
 src/backend/storage/smgr/smgr.c          |  49 +-
 src/include/storage/bufmgr.h             |  22 +
 src/include/storage/smgr.h               |   4 +-
 src/include/storage/streaming_read.h     |  45 ++
 src/include/utils/rel.h                  |   6 -
 src/tools/pgindent/typedefs.list         |   2 +
 17 files changed, 986 insertions(+), 238 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..9617bf130bd 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd10..8775b5789be 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
 	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * purpose.
 	 */
 	smgrcloseall();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d7d6cc0cd7b..13e5376619e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -246,10 +246,12 @@ BackgroundWriterMain(void)
 		if (FirstCallSinceLastCheckpoint())
 		{
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the bgwriter does
+			 * not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e949fc885b..5d843b61426 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -469,10 +469,12 @@ CheckpointerMain(void)
 				ckpt_performed = CreateRestartPoint(flags);
 
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the checkpointer
+			 * does not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
@@ -958,11 +960,8 @@ RequestCheckpoint(int flags)
 		 */
 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
 
-		/*
-		 * After any checkpoint, close all smgr files.  This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
+		/* Free all smgr objects, as CheckpointerMain() normally would. */
+		smgrdestroyall();
 
 		return;
 	}
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..19605090fea
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,435 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ *
+ * For hits, need_to_complete is false and there is just one block per
+ * range, already pinned and ready for use.
+ *
+ * For misses, need_to_complete is true and buffers[] holds a range of
+ * blocks that are contiguous in storage (though the buffers may not be
+ * contiguous in memory), so we can complete them with a single call to
+ * CompleteReadBuffers().
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		advice_issued;
+	bool		need_complete;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index[MAX_BUFFERS_PER_TRANSFER];
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadRange;
+
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			ios_in_progress_trigger;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	bool		finished;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	bool		advice_enabled;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+	int			per_buffer_data_next;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * The *_io_concurrency GUCs, we might have 0.  We want to allow at least
+	 * one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Start building a new range.  This is called after the previous one
+ * reached maximum size, or the callback's next block can't be merged with it.
+ *
+ * Since the previous head range has now reached its full potential size, this
+ * is also a good time to issue 'prefetch' advice, because we know that'll
+ * soon be reading.  In future, we could start an actual I/O here.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_new_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If a call to CompleteReadBuffers() will be needed, and we can issue
+	 * advice to the kernel to get the read started.  We suppress it if the
+	 * access pattern appears to be completely sequential, though, because on
+	 * some systems that interfers with the kernel's own sequential read ahead
+	 * heurstics and hurts performance.
+	 */
+	if (pgsr->advice_enabled)
+	{
+		BlockNumber blocknum = head_range->blocknum;
+		int			nblocks = head_range->nblocks;
+
+		if (head_range->need_complete && blocknum != pgsr->seq_blocknum)
+		{
+			SMgrRelation smgr =
+				pgsr->bmr.smgr ? pgsr->bmr.smgr :
+				RelationGetSmgr(pgsr->bmr.rel);
+
+			Assert(!head_range->advice_issued);
+
+			smgrprefetch(smgr, pgsr->forknum, blocknum, nblocks);
+
+			/*
+			 * Count this as an I/O that is concurrently in progress, though
+			 * we don't really know if the kernel generates a physical I/O.
+			 */
+			head_range->advice_issued = true;
+			pgsr->ios_in_progress++;
+		}
+
+		/* Remember the block after this range, for sequence detection. */
+		pgsr->seq_blocknum = blocknum + nblocks;
+	}
+
+	/* Create a new head range.  There must be space. */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	head_range = &pgsr->ranges[pgsr->head];
+	head_range->nblocks = 0;
+
+	return head_range;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BufferManagerRelation bmr;
+		ForkNumber	forknum;
+		BlockNumber blocknum;
+		Buffer		buffer;
+		bool		found;
+		bool		need_complete;
+		PgStreamingReadRange *head_range;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks == lengthof(head_range->buffers))
+		{
+			Assert(head_range->need_complete);
+			head_range = pg_streaming_read_new_range(pgsr);
+
+			/*
+			 * Give up now if I/O is saturated, or we wouldn't be able form
+			 * another full range after this due to the pin limit.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger ||
+				pgsr->ios_in_progress == pgsr->max_ios)
+				break;
+		}
+
+		per_buffer_data = (char *) pgsr->per_buffer_data +
+			pgsr->per_buffer_data_size * pgsr->per_buffer_data_next;
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			pgsr->finished = true;
+			break;
+		}
+		bmr = pgsr->bmr;
+		forknum = pgsr->forknum;
+
+		Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+		buffer = PrepareReadBuffer(bmr,
+								   forknum,
+								   blocknum,
+								   pgsr->strategy,
+								   &found);
+		pgsr->pinned_buffers++;
+
+		need_complete = !found;
+
+		/* Is there a head range that we can't extend? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks > 0 &&
+			(!need_complete ||
+			 !head_range->need_complete ||
+			 head_range->blocknum + head_range->nblocks != blocknum))
+		{
+			/* Yes, time to start building a new one. */
+			head_range = pg_streaming_read_new_range(pgsr);
+			Assert(head_range->nblocks == 0);
+		}
+
+		if (head_range->nblocks == 0)
+		{
+			/* Initialize a new range beginning at this block. */
+			head_range->blocknum = blocknum;
+			head_range->need_complete = need_complete;
+			head_range->advice_issued = false;
+		}
+		else
+		{
+			/* We can extend an existing range by one block. */
+			Assert(head_range->blocknum + head_range->nblocks == blocknum);
+			Assert(head_range->need_complete);
+		}
+
+		head_range->per_buffer_data_index[head_range->nblocks] = pgsr->per_buffer_data_next++;
+		head_range->buffers[head_range->nblocks] = buffer;
+		head_range->nblocks++;
+
+		if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
+			pgsr->per_buffer_data_next = 0;
+
+	} while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+			 pgsr->ios_in_progress < pgsr->max_ios);
+
+	if (pgsr->ranges[pgsr->head].nblocks > 0)
+		pg_streaming_read_new_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_complete)
+		{
+			CompleteReadBuffers(pgsr->bmr,
+								tail_range->buffers,
+								pgsr->forknum,
+								tail_range->blocknum,
+								tail_range->nblocks,
+								false,
+								pgsr->strategy);
+			tail_range->need_complete = false;
+
+			/*
+			 * We don't really know if the kernel generated an physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished after a read call returns.
+			 */
+			if (tail_range->advice_issued)
+				pgsr->ios_in_progress--;
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (per_buffer_data)
+				*per_buffer_data = (char *) pgsr->per_buffer_data +
+					tail_range->per_buffer_data_index[buffer_index] *
+					pgsr->per_buffer_data_size;
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6dd..2157a97b973 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
@@ -501,7 +501,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -795,15 +795,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+	buf = ReadBuffer_common(BMR_REL(reln),
 							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+
 	return buf;
 }
 
@@ -827,8 +821,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -1002,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 		bool		hit;
 
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+		buffer = ReadBuffer_common(bmr,
 								   fork, extend_to - 1, mode, strategy,
 								   &hit);
 	}
@@ -1016,18 +1011,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	Buffer		buffer;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1046,175 +1034,339 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		*hit = false;
+
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	buffer = PrepareReadBuffer(bmr,
+							   forkNum,
+							   blockNum,
+							   strategy,
+							   hit);
+
+	/* At this point we do NOT hold any locks. */
 
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+	{
+		/* if we just want zeroes and a lock, we're done */
+		ZeroBuffer(buffer, mode);
+	}
+	else if (!*hit)
+	{
+		/* we might need to perform I/O */
+		CompleteReadBuffers(bmr,
+							&buffer,
+							forkNum,
+							blockNum,
+							1,
+							mode == RBM_ZERO_ON_ERROR,
+							strategy);
+	}
+
+	return buffer;
+}
+
+/*
+ * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
+ * the returned buffer can be used immediately.  Otherwise, a physical read
+ * should be completed with CompleteReadBuffers(), or the buffer should be
+ * zeroed with ZeroBuffer().  PrepareReadBuffer() followed by
+ * CompleteReadBuffers() or ZeroBuffer() is equivalent to ReadBuffer(), but
+ * the caller has the opportunity to combine reads of multiple neighboring
+ * blocks into one CompleteReadBuffers() call.
+ *
+ * *foundPtr is set to true for a hit, and false for a miss.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * CompleteReadBuffers() (so, not for hits, and not for buffers that
+		 * are zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+					Buffer *buffers,
+					ForkNumber forknum,
+					BlockNumber blocknum,
+					int nblocks,
+					bool zero_on_error,
+					BufferAccessStrategy strategy)
+{
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+#ifdef USE_ASSERT_CHECKING
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		/*
+		 * We could get all the information from buffer headers, but it can be
+		 * expensive to access buffer header cache lines so we make the caller
+		 * provide all the information we need, and assert that it is
+		 * consistent.
+		 */
+		{
+			RelFileLocator xlocator;
+			ForkNumber	xforknum;
+			BlockNumber xblocknum;
+
+			BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+			Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+			Assert(xforknum == forknum);
+			Assert(xblocknum == blocknum + i);
+		}
+#endif
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!CompleteReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   CompleteReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if (zero_on_error || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
@@ -1228,11 +1380,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.  A read should be
+ * performed with CompleteReadBuffers().
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1291,19 +1440,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1368,19 +1508,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1412,15 +1543,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1774,7 +1899,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2043,7 +2168,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2066,7 +2191,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2381,7 +2506,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if PrepareReadBuffer() was called and
+		 * CompleteReadBuffers() hasn't been called yet.  We'll check by
+		 * loading the flags without locking.  This is racy, but it's OK to
+		 * return false spuriously: when CompleteReadBuffers() calls
+		 * StartBufferIO(), it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2390,7 +2520,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3458,7 +3588,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -4845,6 +4975,46 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would.  The buffer must be already pinned.  It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * Verify that this backend is pinning the buffer exactly once.
  *
@@ -5197,9 +5367,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5212,6 +5388,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1be4f4f8daf..717b8f58daf 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c74..0d7272e796e 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
  */
 SMgrRelation
 smgropen(RelFileLocator rlocator, BackendId backend)
@@ -259,10 +261,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
 void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -289,12 +291,14 @@ smgrclose(SMgrRelation reln)
 }
 
 /*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
  *
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr().  It may be re-owned if it is accessed by a
+ * relation before then.
  */
 void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
@@ -302,15 +306,20 @@ smgrrelease(SMgrRelation reln)
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	if (reln->smgr_owner)
+	{
+		*reln->smgr_owner = NULL;
+		reln->smgr_owner = NULL;
+		dlist_push_tail(&unowned_relns, &reln->node);
+	}
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
  */
 void
-smgrreleaseall(void)
+smgrcloseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -322,14 +331,17 @@ smgrreleaseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrclose(reln);
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
  */
 void
-smgrcloseall(void)
+smgrdestroyall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -341,7 +353,7 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+		smgrdestroy(reln);
 }
 
 /*
@@ -733,7 +745,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -747,7 +760,7 @@ AtEOXact_SMgr(void)
 	dlist_mutable_iter iter;
 
 	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
+	 * Zap all unowned SMgrRelations.  We rely on smgrdestroy() to remove each
 	 * one from the list.
 	 */
 	dlist_foreach_modify(iter, &unowned_relns)
@@ -757,7 +770,7 @@ AtEOXact_SMgr(void)
 
 		Assert(rel->smgr_owner == NULL);
 
-		smgrclose(rel);
+		smgrdestroy(rel);
 	}
 }
 
@@ -768,6 +781,6 @@ AtEOXact_SMgr(void)
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	smgrreleaseall();
+	smgrcloseall();
 	return true;
 }
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..a38f1acb37a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+								ForkNumber forkNum,
+								BlockNumber blockNum,
+								BufferAccessStrategy strategy,
+								bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+								Buffer *buffers,
+								ForkNumber forknum,
+								BlockNumber blocknum,
+								int nblocks,
+								bool zero_on_error,
+								BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a0568..d8ffe397faf 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..40c3408c541
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,45 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a584b1ddff3..6636cc82c09 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 29fd1cae641..018ebbcbaae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2089,6 +2089,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.37.2



^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-19 21:43               ` Re: index prefetching Melanie Plageman <[email protected]>
@ 2024-01-22 04:53                 ` Peter Smith <[email protected]>
  1 sibling, 0 replies; 25+ messages in thread

From: Peter Smith @ 2024-01-22 04:53 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Tomas Vondra <[email protected]>; Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
like there were  CFbot test failures last time it was run [2]. Please
have a look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4351/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4351

Kind Regards,
Peter Smith.





^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-19 21:43               ` Re: index prefetching Melanie Plageman <[email protected]>
@ 2024-01-23 17:43                 ` Tomas Vondra <[email protected]>
  2024-01-24 00:51                   ` Re: index prefetching Melanie Plageman <[email protected]>
  1 sibling, 1 reply; 25+ messages in thread

From: Tomas Vondra @ 2024-01-23 17:43 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>

On 1/19/24 22:43, Melanie Plageman wrote:
> On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra
> <[email protected]> wrote:
>>
>> On 1/9/24 21:31, Robert Haas wrote:
>>> On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
>>> <[email protected]> wrote:
>>>> Here's a somewhat reworked version of the patch. My initial goal was to
>>>> see if it could adopt the StreamingRead API proposed in [1], but that
>>>> turned out to be less straight-forward than I hoped, for two reasons:
>>>
>>> I guess we need Thomas or Andres or maybe Melanie to comment on this.
>>>
>>
>> Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
>> streaming I/O stuff.
> 
> I've been studying your patch with the intent of finding a way to
> change it and or the streaming read API to work together. I've
> attached a very rough sketch of how I think it could work.
> 

Thanks.

> We fill a queue with blocks from TIDs that we fetched from the index.
> The queue is saved in a scan descriptor that is made available to the
> streaming read callback. Once the queue is full, we invoke the table
> AM specific index_fetch_tuple() function which calls
> pg_streaming_read_buffer_get_next(). When the streaming read API
> invokes the callback we registered, it simply dequeues a block number
> for prefetching.

So in a way there are two queues in IndexFetchTableData. One (blk_queue)
is being filled from IndexNext, and then the queue in StreamingRead.

> The only change to the streaming read API is that now, even if the
> callback returns InvalidBlockNumber, we may not be finished, so make
> it resumable.
> 

Hmm, not sure when can the callback return InvalidBlockNumber before
reaching the end. Perhaps for the first index_fetch_heap call? Any
reason not to fill the blk_queue before calling index_fetch_heap?


> Structurally, this changes the timing of when the heap blocks are
> prefetched. Your code would get a tid from the index and then prefetch
> the heap block -- doing this until it filled a queue that had the
> actual tids saved in it. With my approach and the streaming read API,
> you fetch tids from the index until you've filled up a queue of block
> numbers. Then the streaming read API will prefetch those heap blocks.
> 

And is that a good/desirable change? I'm not saying it's not, but maybe
we should not be filling either queue in one go - we don't want to
overload the prefetching.

> I didn't actually implement the block queue -- I just saved a single
> block number and pretended it was a block queue. I was imagining we
> replace this with something like your IndexPrefetch->blockItems --
> which has light deduplication. We'd probably have to flesh it out more
> than that.
> 

I don't understand how this passes the TID to the index_fetch_heap.
Isn't it working only by accident, due to blk_queue only having a single
entry? Shouldn't the first queue (blk_queue) store TIDs instead?

> There are also table AM layering violations in my sketch which would
> have to be worked out (not to mention some resource leakage I didn't
> bother investigating [which causes it to fail tests]).
> 
> 0001 is all of Thomas' streaming read API code that isn't yet in
> master and 0002 is my rough sketch of index prefetching using the
> streaming read API
> 
> There are also numerous optimizations that your index prefetching
> patch set does that would need to be added in some way. I haven't
> thought much about it yet. I wanted to see what you thought of this
> approach first. Basically, is it workable?
> 

It seems workable, yes. I'm not sure it's much simpler than my patch
(considering a lot of the code is in the optimizations, which are
missing from this patch).

I think the question is where should the optimizations happen. I suppose
some of them might/should happen in the StreamingRead API itself - like
the detection of sequential patterns, recently prefetched blocks, ...

But I'm not sure what to do about optimizations that are more specific
to the access path. Consider for example the index-only scans. We don't
want to prefetch all the pages, we need to inspect the VM and prefetch
just the not-all-visible ones. And then pass the info to the index scan,
so that it does not need to check the VM again. It's not clear to me how
to do this with this approach.


The main


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-19 21:43               ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-23 17:43                 ` Re: index prefetching Tomas Vondra <[email protected]>
@ 2024-01-24 00:51                   ` Melanie Plageman <[email protected]>
  2024-01-24 09:19                     ` Re: index prefetching Tomas Vondra <[email protected]>
  0 siblings, 1 reply; 25+ messages in thread

From: Melanie Plageman @ 2024-01-24 00:51 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>

On Tue, Jan 23, 2024 at 12:43 PM Tomas Vondra
<[email protected]> wrote:
>
> On 1/19/24 22:43, Melanie Plageman wrote:
>
> > We fill a queue with blocks from TIDs that we fetched from the index.
> > The queue is saved in a scan descriptor that is made available to the
> > streaming read callback. Once the queue is full, we invoke the table
> > AM specific index_fetch_tuple() function which calls
> > pg_streaming_read_buffer_get_next(). When the streaming read API
> > invokes the callback we registered, it simply dequeues a block number
> > for prefetching.
>
> So in a way there are two queues in IndexFetchTableData. One (blk_queue)
> is being filled from IndexNext, and then the queue in StreamingRead.

I've changed the name from blk_queue to tid_queue to fix the issue you
mention in your later remarks.
I suppose there are two queues. The tid_queue is just to pass the
block requests to the streaming read API. The prefetch distance will
be the smaller of the two sizes.

> > The only change to the streaming read API is that now, even if the
> > callback returns InvalidBlockNumber, we may not be finished, so make
> > it resumable.
>
> Hmm, not sure when can the callback return InvalidBlockNumber before
> reaching the end. Perhaps for the first index_fetch_heap call? Any
> reason not to fill the blk_queue before calling index_fetch_heap?

The callback will return InvalidBlockNumber whenever the queue is
empty. Let's say your queue size is 5 and your effective prefetch
distance is 10 (some combination of the PgStreamingReadRange sizes and
PgStreamingRead->max_ios). The first time you call index_fetch_heap(),
the callback returns InvalidBlockNumber. Then the tid_queue is filled
with 5 tids. Then index_fetch_heap() is called.
pg_streaming_read_look_ahead() will prefetch all 5 of these TID's
blocks, emptying the queue. Once all 5 have been dequeued, the
callback will return InvalidBlockNumber.
pg_streaming_read_buffer_get_next() will return one of the 5 blocks in
a buffer and save the associated TID in the per_buffer_data. Before
index_fetch_heap() is called again, we will see that the queue is not
full and fill it up again with 5 TIDs. So, the callback will return
InvalidBlockNumber 3 times in this scenario.

> > Structurally, this changes the timing of when the heap blocks are
> > prefetched. Your code would get a tid from the index and then prefetch
> > the heap block -- doing this until it filled a queue that had the
> > actual tids saved in it. With my approach and the streaming read API,
> > you fetch tids from the index until you've filled up a queue of block
> > numbers. Then the streaming read API will prefetch those heap blocks.
>
> And is that a good/desirable change? I'm not saying it's not, but maybe
> we should not be filling either queue in one go - we don't want to
> overload the prefetching.

We can focus on the prefetch distance algorithm maintained in the
streaming read API and then make sure that the tid_queue is larger
than the desired prefetch distance maintained by the streaming read
API.

> > I didn't actually implement the block queue -- I just saved a single
> > block number and pretended it was a block queue. I was imagining we
> > replace this with something like your IndexPrefetch->blockItems --
> > which has light deduplication. We'd probably have to flesh it out more
> > than that.
>
> I don't understand how this passes the TID to the index_fetch_heap.
> Isn't it working only by accident, due to blk_queue only having a single
> entry? Shouldn't the first queue (blk_queue) store TIDs instead?

Oh dear! Fixed in the attached v2. I've replaced the single
BlockNumber with a single ItemPointerData. I will work on implementing
an actual queue next week.

> > There are also table AM layering violations in my sketch which would
> > have to be worked out (not to mention some resource leakage I didn't
> > bother investigating [which causes it to fail tests]).
> >
> > 0001 is all of Thomas' streaming read API code that isn't yet in
> > master and 0002 is my rough sketch of index prefetching using the
> > streaming read API
> >
> > There are also numerous optimizations that your index prefetching
> > patch set does that would need to be added in some way. I haven't
> > thought much about it yet. I wanted to see what you thought of this
> > approach first. Basically, is it workable?
>
> It seems workable, yes. I'm not sure it's much simpler than my patch
> (considering a lot of the code is in the optimizations, which are
> missing from this patch).
>
> I think the question is where should the optimizations happen. I suppose
> some of them might/should happen in the StreamingRead API itself - like
> the detection of sequential patterns, recently prefetched blocks, ...

So, the streaming read API does detection of sequential patterns and
not prefetching things that are in shared buffers. It doesn't handle
avoiding prefetching recently prefetched blocks yet AFAIK. But I
daresay this would be relevant for other streaming read users and
could certainly be implemented there.

> But I'm not sure what to do about optimizations that are more specific
> to the access path. Consider for example the index-only scans. We don't
> want to prefetch all the pages, we need to inspect the VM and prefetch
> just the not-all-visible ones. And then pass the info to the index scan,
> so that it does not need to check the VM again. It's not clear to me how
> to do this with this approach.

Yea, this is an issue I'll need to think about. To really spell out
the problem: the callback dequeues a TID from the tid_queue and looks
up its block in the VM. It's all visible. So, it shouldn't return that
block to the streaming read API to fetch from the heap because it
doesn't need to be read. But, where does the callback put the TID so
that the caller can get it? I'm going to think more about this.

As for passing around the all visible status so as to not reread the
VM block -- that feels solvable but I haven't looked into it.

- Melanie

Attachments:

  [application/octet-stream] v2-0002-use-streaming-reads-in-index-scan.nocfbot (7.7K, 2-v2-0002-use-streaming-reads-in-index-scan.nocfbot)
  download

  [application/octet-stream] v2-0001-Streaming-Read-API.nocfbot (55.9K, 3-v2-0001-Streaming-Read-API.nocfbot)
  download

^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-19 21:43               ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-23 17:43                 ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 00:51                   ` Re: index prefetching Melanie Plageman <[email protected]>
@ 2024-01-24 09:19                     ` Tomas Vondra <[email protected]>
  2024-01-24 20:20                       ` Re: index prefetching Melanie Plageman <[email protected]>
  0 siblings, 1 reply; 25+ messages in thread

From: Tomas Vondra @ 2024-01-24 09:19 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>

On 1/24/24 01:51, Melanie Plageman wrote:
> On Tue, Jan 23, 2024 at 12:43 PM Tomas Vondra
> <[email protected]> wrote:
>>
>> On 1/19/24 22:43, Melanie Plageman wrote:
>>
>>> We fill a queue with blocks from TIDs that we fetched from the index.
>>> The queue is saved in a scan descriptor that is made available to the
>>> streaming read callback. Once the queue is full, we invoke the table
>>> AM specific index_fetch_tuple() function which calls
>>> pg_streaming_read_buffer_get_next(). When the streaming read API
>>> invokes the callback we registered, it simply dequeues a block number
>>> for prefetching.
>>
>> So in a way there are two queues in IndexFetchTableData. One (blk_queue)
>> is being filled from IndexNext, and then the queue in StreamingRead.
> 
> I've changed the name from blk_queue to tid_queue to fix the issue you
> mention in your later remarks.
> I suppose there are two queues. The tid_queue is just to pass the
> block requests to the streaming read API. The prefetch distance will
> be the smaller of the two sizes.
> 

FWIW I think the two queues are a nice / elegant approach. In hindsight
my problems with trying to utilize the StreamingRead were due to trying
to use the block-oriented API directly from places that work with TIDs,
and this just makes that go away.

I wonder what the overhead of shuffling stuff between queues will be,
but hopefully not too high (that's my assumption).

>>> The only change to the streaming read API is that now, even if the
>>> callback returns InvalidBlockNumber, we may not be finished, so make
>>> it resumable.
>>
>> Hmm, not sure when can the callback return InvalidBlockNumber before
>> reaching the end. Perhaps for the first index_fetch_heap call? Any
>> reason not to fill the blk_queue before calling index_fetch_heap?
> 
> The callback will return InvalidBlockNumber whenever the queue is
> empty. Let's say your queue size is 5 and your effective prefetch
> distance is 10 (some combination of the PgStreamingReadRange sizes and
> PgStreamingRead->max_ios). The first time you call index_fetch_heap(),
> the callback returns InvalidBlockNumber. Then the tid_queue is filled
> with 5 tids. Then index_fetch_heap() is called.
> pg_streaming_read_look_ahead() will prefetch all 5 of these TID's
> blocks, emptying the queue. Once all 5 have been dequeued, the
> callback will return InvalidBlockNumber.
> pg_streaming_read_buffer_get_next() will return one of the 5 blocks in
> a buffer and save the associated TID in the per_buffer_data. Before
> index_fetch_heap() is called again, we will see that the queue is not
> full and fill it up again with 5 TIDs. So, the callback will return
> InvalidBlockNumber 3 times in this scenario.
> 

Thanks for the explanation. Yes, I didn't realize that the queues may be
of different length, at which point it makes sense to return invalid
block to signal the TID queue is empty.

>>> Structurally, this changes the timing of when the heap blocks are
>>> prefetched. Your code would get a tid from the index and then prefetch
>>> the heap block -- doing this until it filled a queue that had the
>>> actual tids saved in it. With my approach and the streaming read API,
>>> you fetch tids from the index until you've filled up a queue of block
>>> numbers. Then the streaming read API will prefetch those heap blocks.
>>
>> And is that a good/desirable change? I'm not saying it's not, but maybe
>> we should not be filling either queue in one go - we don't want to
>> overload the prefetching.
> 
> We can focus on the prefetch distance algorithm maintained in the
> streaming read API and then make sure that the tid_queue is larger
> than the desired prefetch distance maintained by the streaming read
> API.
> 

Agreed. I think I wasn't quite right when concerned about "overloading"
the prefetch, because that depends entirely on the StreamingRead API
queue. A lage TID queue can't cause overload of anything.

What could happen is a TID queue being too small, so the prefetch can't
hit the target distance. But that can happen already, e.g. indexes that
are correlated and/or index-only scans with all-visible pages.

>>> There are also table AM layering violations in my sketch which would
>>> have to be worked out (not to mention some resource leakage I didn't
>>> bother investigating [which causes it to fail tests]).
>>>
>>> 0001 is all of Thomas' streaming read API code that isn't yet in
>>> master and 0002 is my rough sketch of index prefetching using the
>>> streaming read API
>>>
>>> There are also numerous optimizations that your index prefetching
>>> patch set does that would need to be added in some way. I haven't
>>> thought much about it yet. I wanted to see what you thought of this
>>> approach first. Basically, is it workable?
>>
>> It seems workable, yes. I'm not sure it's much simpler than my patch
>> (considering a lot of the code is in the optimizations, which are
>> missing from this patch).
>>
>> I think the question is where should the optimizations happen. I suppose
>> some of them might/should happen in the StreamingRead API itself - like
>> the detection of sequential patterns, recently prefetched blocks, ...
> 
> So, the streaming read API does detection of sequential patterns and
> not prefetching things that are in shared buffers. It doesn't handle
> avoiding prefetching recently prefetched blocks yet AFAIK. But I
> daresay this would be relevant for other streaming read users and
> could certainly be implemented there.
> 

Yes, the "recently prefetched stuff" cache seems like a fairly natural
complement to the pattern detection and shared-buffers check.

FWIW I wonder if we should make some of this customizable, so that
systems with customized storage (e.g. neon or with direct I/O) can e.g.
disable some of these checks. Or replace them with their version.

>> But I'm not sure what to do about optimizations that are more specific
>> to the access path. Consider for example the index-only scans. We don't
>> want to prefetch all the pages, we need to inspect the VM and prefetch
>> just the not-all-visible ones. And then pass the info to the index scan,
>> so that it does not need to check the VM again. It's not clear to me how
>> to do this with this approach.
> 
> Yea, this is an issue I'll need to think about. To really spell out
> the problem: the callback dequeues a TID from the tid_queue and looks
> up its block in the VM. It's all visible. So, it shouldn't return that
> block to the streaming read API to fetch from the heap because it
> doesn't need to be read. But, where does the callback put the TID so
> that the caller can get it? I'm going to think more about this.
> 

Yes, that's the problem for index-only scans. I'd generalize it so that
it's about the callback being able to (a) decide if it needs to read the
heap page, and (b) store some custom info for the TID.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-19 21:43               ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-23 17:43                 ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 00:51                   ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-24 09:19                     ` Re: index prefetching Tomas Vondra <[email protected]>
@ 2024-01-24 20:20                       ` Melanie Plageman <[email protected]>
  2024-02-07 21:48                         ` Re: index prefetching Melanie Plageman <[email protected]>
  0 siblings, 1 reply; 25+ messages in thread

From: Melanie Plageman @ 2024-01-24 20:20 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>

On Wed, Jan 24, 2024 at 4:19 AM Tomas Vondra
<[email protected]> wrote:
>
> On 1/24/24 01:51, Melanie Plageman wrote:
>
> >>> There are also table AM layering violations in my sketch which would
> >>> have to be worked out (not to mention some resource leakage I didn't
> >>> bother investigating [which causes it to fail tests]).
> >>>
> >>> 0001 is all of Thomas' streaming read API code that isn't yet in
> >>> master and 0002 is my rough sketch of index prefetching using the
> >>> streaming read API
> >>>
> >>> There are also numerous optimizations that your index prefetching
> >>> patch set does that would need to be added in some way. I haven't
> >>> thought much about it yet. I wanted to see what you thought of this
> >>> approach first. Basically, is it workable?
> >>
> >> It seems workable, yes. I'm not sure it's much simpler than my patch
> >> (considering a lot of the code is in the optimizations, which are
> >> missing from this patch).
> >>
> >> I think the question is where should the optimizations happen. I suppose
> >> some of them might/should happen in the StreamingRead API itself - like
> >> the detection of sequential patterns, recently prefetched blocks, ...
> >
> > So, the streaming read API does detection of sequential patterns and
> > not prefetching things that are in shared buffers. It doesn't handle
> > avoiding prefetching recently prefetched blocks yet AFAIK. But I
> > daresay this would be relevant for other streaming read users and
> > could certainly be implemented there.
> >
>
> Yes, the "recently prefetched stuff" cache seems like a fairly natural
> complement to the pattern detection and shared-buffers check.
>
> FWIW I wonder if we should make some of this customizable, so that
> systems with customized storage (e.g. neon or with direct I/O) can e.g.
> disable some of these checks. Or replace them with their version.

That's a promising idea.

> >> But I'm not sure what to do about optimizations that are more specific
> >> to the access path. Consider for example the index-only scans. We don't
> >> want to prefetch all the pages, we need to inspect the VM and prefetch
> >> just the not-all-visible ones. And then pass the info to the index scan,
> >> so that it does not need to check the VM again. It's not clear to me how
> >> to do this with this approach.
> >
> > Yea, this is an issue I'll need to think about. To really spell out
> > the problem: the callback dequeues a TID from the tid_queue and looks
> > up its block in the VM. It's all visible. So, it shouldn't return that
> > block to the streaming read API to fetch from the heap because it
> > doesn't need to be read. But, where does the callback put the TID so
> > that the caller can get it? I'm going to think more about this.
> >
>
> Yes, that's the problem for index-only scans. I'd generalize it so that
> it's about the callback being able to (a) decide if it needs to read the
> heap page, and (b) store some custom info for the TID.

Actually, I think this is no big deal. See attached. I just don't
enqueue tids whose blocks are all visible. I had to switch the order
from fetch heap then fill queue to fill queue then fetch heap.

While doing this I noticed some wrong results in the regression tests
(like in the alter table test), so I suspect I have some kind of
control flow issue. Perhaps I should fix the resource leak so I can
actually see the failing tests :)

As for your a) and b) above.

Regarding a): We discussed allowing speculative prefetching and
separating the logic for prefetching from actually reading blocks (so
you can prefetch blocks you ultimately don't read). We decided this
may not belong in a streaming read API. What do you think?

Regarding b): We can store per buffer data for anything that actually
goes down through the streaming read API, but, in the index only case,
we don't want the streaming read API to know about blocks that it
doesn't actually need to read.

- Melanie

From f6cb591ba520351ab7f0e7cbf9d6df3dacda6b44 Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v3 1/2] Streaming Read API

---
 contrib/pg_prewarm/pg_prewarm.c          |  40 +-
 src/backend/access/transam/xlogutils.c   |   2 +-
 src/backend/postmaster/bgwriter.c        |   8 +-
 src/backend/postmaster/checkpointer.c    |  15 +-
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 435 ++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      | 560 +++++++++++++++--------
 src/backend/storage/buffer/localbuf.c    |  14 +-
 src/backend/storage/meson.build          |   1 +
 src/backend/storage/smgr/smgr.c          |  49 +-
 src/include/storage/bufmgr.h             |  22 +
 src/include/storage/smgr.h               |   4 +-
 src/include/storage/streaming_read.h     |  45 ++
 src/include/utils/rel.h                  |   6 -
 src/tools/pgindent/typedefs.list         |   2 +
 17 files changed, 986 insertions(+), 238 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e4..9617bf130b 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd1..8775b5789b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
 	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * purpose.
 	 */
 	smgrcloseall();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d7d6cc0cd7..13e5376619 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -246,10 +246,12 @@ BackgroundWriterMain(void)
 		if (FirstCallSinceLastCheckpoint())
 		{
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the bgwriter does
+			 * not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e949fc885..5d843b6142 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -469,10 +469,12 @@ CheckpointerMain(void)
 				ckpt_performed = CreateRestartPoint(flags);
 
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the checkpointer
+			 * does not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
@@ -958,11 +960,8 @@ RequestCheckpoint(int flags)
 		 */
 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
 
-		/*
-		 * After any checkpoint, close all smgr files.  This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
+		/* Free all smgr objects, as CheckpointerMain() normally would. */
+		smgrdestroyall();
 
 		return;
 	}
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..eec03f6f2b 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 0000000000..bcab44c802
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 0000000000..39aef2a84a
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 0000000000..19605090fe
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,435 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ *
+ * For hits, need_to_complete is false and there is just one block per
+ * range, already pinned and ready for use.
+ *
+ * For misses, need_to_complete is true and buffers[] holds a range of
+ * blocks that are contiguous in storage (though the buffers may not be
+ * contiguous in memory), so we can complete them with a single call to
+ * CompleteReadBuffers().
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		advice_issued;
+	bool		need_complete;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index[MAX_BUFFERS_PER_TRANSFER];
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadRange;
+
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			ios_in_progress_trigger;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	bool		finished;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	bool		advice_enabled;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+	int			per_buffer_data_next;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * The *_io_concurrency GUCs, we might have 0.  We want to allow at least
+	 * one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Start building a new range.  This is called after the previous one
+ * reached maximum size, or the callback's next block can't be merged with it.
+ *
+ * Since the previous head range has now reached its full potential size, this
+ * is also a good time to issue 'prefetch' advice, because we know that'll
+ * soon be reading.  In future, we could start an actual I/O here.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_new_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If a call to CompleteReadBuffers() will be needed, and we can issue
+	 * advice to the kernel to get the read started.  We suppress it if the
+	 * access pattern appears to be completely sequential, though, because on
+	 * some systems that interfers with the kernel's own sequential read ahead
+	 * heurstics and hurts performance.
+	 */
+	if (pgsr->advice_enabled)
+	{
+		BlockNumber blocknum = head_range->blocknum;
+		int			nblocks = head_range->nblocks;
+
+		if (head_range->need_complete && blocknum != pgsr->seq_blocknum)
+		{
+			SMgrRelation smgr =
+				pgsr->bmr.smgr ? pgsr->bmr.smgr :
+				RelationGetSmgr(pgsr->bmr.rel);
+
+			Assert(!head_range->advice_issued);
+
+			smgrprefetch(smgr, pgsr->forknum, blocknum, nblocks);
+
+			/*
+			 * Count this as an I/O that is concurrently in progress, though
+			 * we don't really know if the kernel generates a physical I/O.
+			 */
+			head_range->advice_issued = true;
+			pgsr->ios_in_progress++;
+		}
+
+		/* Remember the block after this range, for sequence detection. */
+		pgsr->seq_blocknum = blocknum + nblocks;
+	}
+
+	/* Create a new head range.  There must be space. */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	head_range = &pgsr->ranges[pgsr->head];
+	head_range->nblocks = 0;
+
+	return head_range;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BufferManagerRelation bmr;
+		ForkNumber	forknum;
+		BlockNumber blocknum;
+		Buffer		buffer;
+		bool		found;
+		bool		need_complete;
+		PgStreamingReadRange *head_range;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks == lengthof(head_range->buffers))
+		{
+			Assert(head_range->need_complete);
+			head_range = pg_streaming_read_new_range(pgsr);
+
+			/*
+			 * Give up now if I/O is saturated, or we wouldn't be able form
+			 * another full range after this due to the pin limit.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger ||
+				pgsr->ios_in_progress == pgsr->max_ios)
+				break;
+		}
+
+		per_buffer_data = (char *) pgsr->per_buffer_data +
+			pgsr->per_buffer_data_size * pgsr->per_buffer_data_next;
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			pgsr->finished = true;
+			break;
+		}
+		bmr = pgsr->bmr;
+		forknum = pgsr->forknum;
+
+		Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+		buffer = PrepareReadBuffer(bmr,
+								   forknum,
+								   blocknum,
+								   pgsr->strategy,
+								   &found);
+		pgsr->pinned_buffers++;
+
+		need_complete = !found;
+
+		/* Is there a head range that we can't extend? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks > 0 &&
+			(!need_complete ||
+			 !head_range->need_complete ||
+			 head_range->blocknum + head_range->nblocks != blocknum))
+		{
+			/* Yes, time to start building a new one. */
+			head_range = pg_streaming_read_new_range(pgsr);
+			Assert(head_range->nblocks == 0);
+		}
+
+		if (head_range->nblocks == 0)
+		{
+			/* Initialize a new range beginning at this block. */
+			head_range->blocknum = blocknum;
+			head_range->need_complete = need_complete;
+			head_range->advice_issued = false;
+		}
+		else
+		{
+			/* We can extend an existing range by one block. */
+			Assert(head_range->blocknum + head_range->nblocks == blocknum);
+			Assert(head_range->need_complete);
+		}
+
+		head_range->per_buffer_data_index[head_range->nblocks] = pgsr->per_buffer_data_next++;
+		head_range->buffers[head_range->nblocks] = buffer;
+		head_range->nblocks++;
+
+		if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
+			pgsr->per_buffer_data_next = 0;
+
+	} while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+			 pgsr->ios_in_progress < pgsr->max_ios);
+
+	if (pgsr->ranges[pgsr->head].nblocks > 0)
+		pg_streaming_read_new_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_complete)
+		{
+			CompleteReadBuffers(pgsr->bmr,
+								tail_range->buffers,
+								pgsr->forknum,
+								tail_range->blocknum,
+								tail_range->nblocks,
+								false,
+								pgsr->strategy);
+			tail_range->need_complete = false;
+
+			/*
+			 * We don't really know if the kernel generated an physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished after a read call returns.
+			 */
+			if (tail_range->advice_issued)
+				pgsr->ios_in_progress--;
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (per_buffer_data)
+				*per_buffer_data = (char *) pgsr->per_buffer_data +
+					tail_range->per_buffer_data_index[buffer_index] *
+					pgsr->per_buffer_data_size;
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6d..2157a97b97 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
@@ -501,7 +501,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -795,15 +795,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+	buf = ReadBuffer_common(BMR_REL(reln),
 							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+
 	return buf;
 }
 
@@ -827,8 +821,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -1002,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 		bool		hit;
 
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+		buffer = ReadBuffer_common(bmr,
 								   fork, extend_to - 1, mode, strategy,
 								   &hit);
 	}
@@ -1016,18 +1011,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	Buffer		buffer;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1046,175 +1034,339 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		*hit = false;
+
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	buffer = PrepareReadBuffer(bmr,
+							   forkNum,
+							   blockNum,
+							   strategy,
+							   hit);
+
+	/* At this point we do NOT hold any locks. */
 
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+	{
+		/* if we just want zeroes and a lock, we're done */
+		ZeroBuffer(buffer, mode);
+	}
+	else if (!*hit)
+	{
+		/* we might need to perform I/O */
+		CompleteReadBuffers(bmr,
+							&buffer,
+							forkNum,
+							blockNum,
+							1,
+							mode == RBM_ZERO_ON_ERROR,
+							strategy);
+	}
+
+	return buffer;
+}
+
+/*
+ * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
+ * the returned buffer can be used immediately.  Otherwise, a physical read
+ * should be completed with CompleteReadBuffers(), or the buffer should be
+ * zeroed with ZeroBuffer().  PrepareReadBuffer() followed by
+ * CompleteReadBuffers() or ZeroBuffer() is equivalent to ReadBuffer(), but
+ * the caller has the opportunity to combine reads of multiple neighboring
+ * blocks into one CompleteReadBuffers() call.
+ *
+ * *foundPtr is set to true for a hit, and false for a miss.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * CompleteReadBuffers() (so, not for hits, and not for buffers that
+		 * are zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+					Buffer *buffers,
+					ForkNumber forknum,
+					BlockNumber blocknum,
+					int nblocks,
+					bool zero_on_error,
+					BufferAccessStrategy strategy)
+{
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+#ifdef USE_ASSERT_CHECKING
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		/*
+		 * We could get all the information from buffer headers, but it can be
+		 * expensive to access buffer header cache lines so we make the caller
+		 * provide all the information we need, and assert that it is
+		 * consistent.
+		 */
+		{
+			RelFileLocator xlocator;
+			ForkNumber	xforknum;
+			BlockNumber xblocknum;
+
+			BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+			Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+			Assert(xforknum == forknum);
+			Assert(xblocknum == blocknum + i);
+		}
+#endif
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!CompleteReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   CompleteReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if (zero_on_error || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
@@ -1228,11 +1380,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.  A read should be
+ * performed with CompleteReadBuffers().
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1291,19 +1440,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1368,19 +1508,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1412,15 +1543,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1774,7 +1899,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2043,7 +2168,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2066,7 +2191,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2381,7 +2506,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if PrepareReadBuffer() was called and
+		 * CompleteReadBuffers() hasn't been called yet.  We'll check by
+		 * loading the flags without locking.  This is racy, but it's OK to
+		 * return false spuriously: when CompleteReadBuffers() calls
+		 * StartBufferIO(), it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2390,7 +2520,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3458,7 +3588,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -4845,6 +4975,46 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would.  The buffer must be already pinned.  It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * Verify that this backend is pinning the buffer exactly once.
  *
@@ -5197,9 +5367,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5212,6 +5388,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1be4f4f8da..717b8f58da 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca2..739d13293f 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c7..0d7272e796 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
  */
 SMgrRelation
 smgropen(RelFileLocator rlocator, BackendId backend)
@@ -259,10 +261,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
 void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -289,12 +291,14 @@ smgrclose(SMgrRelation reln)
 }
 
 /*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
  *
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr().  It may be re-owned if it is accessed by a
+ * relation before then.
  */
 void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
@@ -302,15 +306,20 @@ smgrrelease(SMgrRelation reln)
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	if (reln->smgr_owner)
+	{
+		*reln->smgr_owner = NULL;
+		reln->smgr_owner = NULL;
+		dlist_push_tail(&unowned_relns, &reln->node);
+	}
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
  */
 void
-smgrreleaseall(void)
+smgrcloseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -322,14 +331,17 @@ smgrreleaseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrclose(reln);
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
  */
 void
-smgrcloseall(void)
+smgrdestroyall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -341,7 +353,7 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+		smgrdestroy(reln);
 }
 
 /*
@@ -733,7 +745,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -747,7 +760,7 @@ AtEOXact_SMgr(void)
 	dlist_mutable_iter iter;
 
 	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
+	 * Zap all unowned SMgrRelations.  We rely on smgrdestroy() to remove each
 	 * one from the list.
 	 */
 	dlist_foreach_modify(iter, &unowned_relns)
@@ -757,7 +770,7 @@ AtEOXact_SMgr(void)
 
 		Assert(rel->smgr_owner == NULL);
 
-		smgrclose(rel);
+		smgrdestroy(rel);
 	}
 }
 
@@ -768,6 +781,6 @@ AtEOXact_SMgr(void)
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	smgrreleaseall();
+	smgrcloseall();
 	return true;
 }
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d335..a38f1acb37 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+								ForkNumber forkNum,
+								BlockNumber blockNum,
+								BufferAccessStrategy strategy,
+								bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+								Buffer *buffers,
+								ForkNumber forknum,
+								BlockNumber blocknum,
+								int nblocks,
+								bool zero_on_error,
+								BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a056..d8ffe397fa 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 0000000000..40c3408c54
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,45 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a584b1ddff..6636cc82c0 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 29fd1cae64..018ebbcbaa 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2089,6 +2089,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.37.2


From 339b39d7f2cc5fc4bd6aa3429a12e6f3a4f9d2db Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 19 Jan 2024 16:10:30 -0500
Subject: [PATCH v3 2/2] use streaming reads in index scan

ci-os-only:
---
 src/backend/access/heap/heapam_handler.c | 19 ++++++--
 src/backend/access/index/indexam.c       | 42 +++++++++++++++++
 src/backend/executor/nodeIndexonlyscan.c | 26 ++++++++++-
 src/backend/executor/nodeIndexscan.c     | 57 +++++++++++++++++++-----
 src/backend/storage/aio/streaming_read.c | 10 ++++-
 src/include/access/relscan.h             |  7 +++
 src/include/executor/nodeIndexscan.h     |  6 +++
 src/include/storage/streaming_read.h     |  2 +
 8 files changed, 151 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d15a02b2be..e5e13e92d8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -127,9 +127,22 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (scan->pgsr && scan->do_pgsr)
+		{
+			hscan->xs_cbuf = pg_streaming_read_buffer_get_next(scan->pgsr, (void **) &tid);
+			if (!BufferIsValid(hscan->xs_cbuf))
+				return false;
+		}
+		else
+		{
+			ItemPointerSet(&scan->tid_queue, InvalidBlockNumber, InvalidOffsetNumber);
+			if (!ItemPointerIsValid(tid))
+				return false;
+			hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+													hscan->xs_base.rel,
+													ItemPointerGetBlockNumber(tid));
+		}
+
 
 		/*
 		 * Prune page, but only if we weren't already on this page
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 63dff101e2..f247a1d2d3 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -211,6 +211,29 @@ index_insert_cleanup(Relation indexRelation,
 		indexRelation->rd_indam->aminsertcleanup(indexInfo);
 }
 
+static ItemPointerData
+index_tid_dequeue(ItemPointer tid_queue)
+{
+	ItemPointerData result = *tid_queue;
+	ItemPointerSet(tid_queue, InvalidBlockNumber, InvalidOffsetNumber);
+	return result;
+}
+
+
+static BlockNumber
+index_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private, void *per_buffer_data)
+{
+	IndexFetchTableData *scan = (IndexFetchTableData *) pgsr_private;
+	ItemPointerData data = index_tid_dequeue(&scan->tid_queue);
+
+	ItemPointer dest = per_buffer_data;
+	*dest = data;
+
+	if (!ItemPointerIsValid(&data))
+		return InvalidBlockNumber;
+	return ItemPointerGetBlockNumber(&data);
+}
+
 /*
  * index_beginscan - start a scan of an index with amgettuple
  *
@@ -236,7 +259,22 @@ index_beginscan(Relation heapRelation,
 	scan->xs_snapshot = snapshot;
 
 	/* prepare to fetch index matches from table */
+	scan->index_done = false;
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+	ItemPointerSet(&scan->xs_heapfetch->tid_queue, InvalidBlockNumber,
+			InvalidOffsetNumber);
+
+	// TODO: can't put this here bc not AM agnostic
+	scan->xs_heapfetch->pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+												scan->xs_heapfetch,
+												sizeof(ItemPointerData),
+												NULL,
+												BMR_REL(scan->heapRelation),
+												MAIN_FORKNUM,
+												index_pgsr_next_single);
+
+	pg_streaming_read_set_resumable(scan->xs_heapfetch->pgsr);
+	scan->xs_heapfetch->do_pgsr = false;
 
 	return scan;
 }
@@ -525,6 +563,9 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
+	ItemPointerSet(&scan->xs_heapfetch->tid_queue, InvalidBlockNumber,
+			InvalidOffsetNumber);
+	scan->xs_heapfetch->do_pgsr = false;
 
 	return scan;
 }
@@ -566,6 +607,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		if (scan->xs_heapfetch)
 			table_index_fetch_reset(scan->xs_heapfetch);
 
+		ItemPointerSet(&scan->xs_heaptid, InvalidBlockNumber, InvalidOffsetNumber);
 		return NULL;
 	}
 	Assert(ItemPointerIsValid(&scan->xs_heaptid));
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 2c2c9c10b5..7979ecf1e4 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -111,17 +111,36 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_NumScanKeys,
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
+
+		scandesc->xs_heapfetch->do_pgsr = true;
 	}
 
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (true)
 	{
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		Assert(!TID_QUEUE_FULL(&scandesc->xs_heapfetch->tid_queue));
+		do
+		{
+			tid = index_getnext_tid(scandesc, direction);
+
+			if (!tid)
+			{
+				scandesc->index_done = true;
+				break;
+			}
+
+			index_tid_enqueue(tid, &scandesc->xs_heapfetch->tid_queue);
+		} while (!TID_QUEUE_FULL(&scandesc->xs_heapfetch->tid_queue));
+
+		if (!tid && TID_QUEUE_EMPTY(&scandesc->xs_heapfetch->tid_queue))
+			break;
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -156,7 +175,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+		if (!tid || !VM_ALL_VISIBLE(scandesc->heapRelation,
 							ItemPointerGetBlockNumber(tid),
 							&node->ioss_VMBuffer))
 		{
@@ -187,6 +206,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 			tuple_from_heap = true;
 		}
+		else
+			ItemPointerSet(&scandesc->xs_heapfetch->tid_queue,
+					InvalidBlockNumber, InvalidOffsetNumber);
 
 		/*
 		 * Fill the scan tuple slot with data from the index.  This might be
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 03142b4a94..be91854436 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -77,6 +77,16 @@ static HeapTuple reorderqueue_pop(IndexScanState *node);
  *		using the index specified in the IndexScanState information.
  * ----------------------------------------------------------------
  */
+
+void
+index_tid_enqueue(ItemPointer tid, ItemPointer tid_queue)
+{
+	Assert(!ItemPointerIsValid(tid_queue));
+
+	ItemPointerSet(tid_queue, ItemPointerGetBlockNumber(tid),
+			ItemPointerGetOffsetNumber(tid));
+}
+
 static TupleTableSlot *
 IndexNext(IndexScanState *node)
 {
@@ -123,31 +133,54 @@ IndexNext(IndexScanState *node)
 			index_rescan(scandesc,
 						 node->iss_ScanKeys, node->iss_NumScanKeys,
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		scandesc->xs_heapfetch->do_pgsr = true;
 	}
 
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+
+	while (true)
 	{
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * If the index was lossy, we have to recheck the index quals using
-		 * the fetched tuple.
-		 */
-		if (scandesc->xs_recheck)
+		Assert(!TID_QUEUE_FULL(&scandesc->xs_heapfetch->tid_queue));
+		do
 		{
-			econtext->ecxt_scantuple = slot;
-			if (!ExecQualAndReset(node->indexqualorig, econtext))
+			ItemPointer tid = index_getnext_tid(scandesc, direction);
+
+			if (!tid)
 			{
-				/* Fails recheck, so drop it and loop back for another */
-				InstrCountFiltered2(node, 1);
-				continue;
+				scandesc->index_done = true;
+				break;
 			}
+
+			index_tid_enqueue(tid, &scandesc->xs_heapfetch->tid_queue);
+		} while (!TID_QUEUE_FULL(&scandesc->xs_heapfetch->tid_queue));
+
+		if (index_fetch_heap(scandesc, slot))
+		{
+			/*
+			* If the index was lossy, we have to recheck the index quals using
+			* the fetched tuple.
+			*/
+			if (scandesc->xs_recheck)
+			{
+				econtext->ecxt_scantuple = slot;
+				if (!ExecQualAndReset(node->indexqualorig, econtext))
+				{
+					/* Fails recheck, so drop it and loop back for another */
+					InstrCountFiltered2(node, 1);
+					continue;
+				}
+			}
+
+			return slot;
 		}
 
-		return slot;
+		if (scandesc->index_done)
+			break;
 	}
 
 	/*
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 19605090fe..6465963f83 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -34,6 +34,7 @@ struct PgStreamingRead
 	int			pinned_buffers_trigger;
 	int			next_tail_buffer;
 	bool		finished;
+	bool		resumable;
 	void	   *pgsr_private;
 	PgStreamingReadBufferCB callback;
 	BufferAccessStrategy strategy;
@@ -292,7 +293,8 @@ pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
 		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
 		if (blocknum == InvalidBlockNumber)
 		{
-			pgsr->finished = true;
+			if (!pgsr->resumable)
+				pgsr->finished = true;
 			break;
 		}
 		bmr = pgsr->bmr;
@@ -433,3 +435,9 @@ pg_streaming_read_free(PgStreamingRead *pgsr)
 		pfree(pgsr->per_buffer_data);
 	pfree(pgsr);
 }
+
+void
+pg_streaming_read_set_resumable(PgStreamingRead *pgsr)
+{
+	pgsr->resumable = true;
+}
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304a..ade7f59946 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -18,6 +18,7 @@
 #include "access/itup.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/streaming_read.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
 
@@ -104,6 +105,9 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	PgStreamingRead *pgsr;
+	ItemPointerData tid_queue;
+	bool do_pgsr;
 } IndexFetchTableData;
 
 /*
@@ -162,6 +166,9 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	bool	index_done;
+
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 3cddece67c..7dbff789e9 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -44,4 +44,10 @@ extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
 								   IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
 extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
 
+#define TID_QUEUE_FULL(tid_queue) (ItemPointerIsValid(tid_queue))
+/* If it were a real queue empty and full wouldn't be opposites */
+#define TID_QUEUE_EMPTY(tid_queue) (!ItemPointerIsValid(tid_queue))
+
+extern void index_tid_enqueue(ItemPointer tid, ItemPointer tid_queue);
+
 #endif							/* NODEINDEXSCAN_H */
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index 40c3408c54..2288b7b5eb 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -42,4 +42,6 @@ extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
 extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
 extern void pg_streaming_read_free(PgStreamingRead *pgsr);
 
+extern void pg_streaming_read_set_resumable(PgStreamingRead *pgsr);
+
 #endif
-- 
2.37.2



Attachments:

  [text/plain] v3-0001-Streaming-Read-API.txt (55.9K, 2-v3-0001-Streaming-Read-API.txt)
  download | inline diff:
From f6cb591ba520351ab7f0e7cbf9d6df3dacda6b44 Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v3 1/2] Streaming Read API

---
 contrib/pg_prewarm/pg_prewarm.c          |  40 +-
 src/backend/access/transam/xlogutils.c   |   2 +-
 src/backend/postmaster/bgwriter.c        |   8 +-
 src/backend/postmaster/checkpointer.c    |  15 +-
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 435 ++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      | 560 +++++++++++++++--------
 src/backend/storage/buffer/localbuf.c    |  14 +-
 src/backend/storage/meson.build          |   1 +
 src/backend/storage/smgr/smgr.c          |  49 +-
 src/include/storage/bufmgr.h             |  22 +
 src/include/storage/smgr.h               |   4 +-
 src/include/storage/streaming_read.h     |  45 ++
 src/include/utils/rel.h                  |   6 -
 src/tools/pgindent/typedefs.list         |   2 +
 17 files changed, 986 insertions(+), 238 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e4..9617bf130b 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd1..8775b5789b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
 	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * purpose.
 	 */
 	smgrcloseall();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d7d6cc0cd7..13e5376619 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -246,10 +246,12 @@ BackgroundWriterMain(void)
 		if (FirstCallSinceLastCheckpoint())
 		{
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the bgwriter does
+			 * not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e949fc885..5d843b6142 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -469,10 +469,12 @@ CheckpointerMain(void)
 				ckpt_performed = CreateRestartPoint(flags);
 
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the checkpointer
+			 * does not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
@@ -958,11 +960,8 @@ RequestCheckpoint(int flags)
 		 */
 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
 
-		/*
-		 * After any checkpoint, close all smgr files.  This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
+		/* Free all smgr objects, as CheckpointerMain() normally would. */
+		smgrdestroyall();
 
 		return;
 	}
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..eec03f6f2b 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 0000000000..bcab44c802
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 0000000000..39aef2a84a
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 0000000000..19605090fe
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,435 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ *
+ * For hits, need_to_complete is false and there is just one block per
+ * range, already pinned and ready for use.
+ *
+ * For misses, need_to_complete is true and buffers[] holds a range of
+ * blocks that are contiguous in storage (though the buffers may not be
+ * contiguous in memory), so we can complete them with a single call to
+ * CompleteReadBuffers().
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		advice_issued;
+	bool		need_complete;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index[MAX_BUFFERS_PER_TRANSFER];
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadRange;
+
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			ios_in_progress_trigger;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	bool		finished;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	bool		advice_enabled;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+	int			per_buffer_data_next;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * The *_io_concurrency GUCs, we might have 0.  We want to allow at least
+	 * one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Start building a new range.  This is called after the previous one
+ * reached maximum size, or the callback's next block can't be merged with it.
+ *
+ * Since the previous head range has now reached its full potential size, this
+ * is also a good time to issue 'prefetch' advice, because we know that'll
+ * soon be reading.  In future, we could start an actual I/O here.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_new_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If a call to CompleteReadBuffers() will be needed, and we can issue
+	 * advice to the kernel to get the read started.  We suppress it if the
+	 * access pattern appears to be completely sequential, though, because on
+	 * some systems that interfers with the kernel's own sequential read ahead
+	 * heurstics and hurts performance.
+	 */
+	if (pgsr->advice_enabled)
+	{
+		BlockNumber blocknum = head_range->blocknum;
+		int			nblocks = head_range->nblocks;
+
+		if (head_range->need_complete && blocknum != pgsr->seq_blocknum)
+		{
+			SMgrRelation smgr =
+				pgsr->bmr.smgr ? pgsr->bmr.smgr :
+				RelationGetSmgr(pgsr->bmr.rel);
+
+			Assert(!head_range->advice_issued);
+
+			smgrprefetch(smgr, pgsr->forknum, blocknum, nblocks);
+
+			/*
+			 * Count this as an I/O that is concurrently in progress, though
+			 * we don't really know if the kernel generates a physical I/O.
+			 */
+			head_range->advice_issued = true;
+			pgsr->ios_in_progress++;
+		}
+
+		/* Remember the block after this range, for sequence detection. */
+		pgsr->seq_blocknum = blocknum + nblocks;
+	}
+
+	/* Create a new head range.  There must be space. */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	head_range = &pgsr->ranges[pgsr->head];
+	head_range->nblocks = 0;
+
+	return head_range;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BufferManagerRelation bmr;
+		ForkNumber	forknum;
+		BlockNumber blocknum;
+		Buffer		buffer;
+		bool		found;
+		bool		need_complete;
+		PgStreamingReadRange *head_range;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks == lengthof(head_range->buffers))
+		{
+			Assert(head_range->need_complete);
+			head_range = pg_streaming_read_new_range(pgsr);
+
+			/*
+			 * Give up now if I/O is saturated, or we wouldn't be able form
+			 * another full range after this due to the pin limit.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger ||
+				pgsr->ios_in_progress == pgsr->max_ios)
+				break;
+		}
+
+		per_buffer_data = (char *) pgsr->per_buffer_data +
+			pgsr->per_buffer_data_size * pgsr->per_buffer_data_next;
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			pgsr->finished = true;
+			break;
+		}
+		bmr = pgsr->bmr;
+		forknum = pgsr->forknum;
+
+		Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+		buffer = PrepareReadBuffer(bmr,
+								   forknum,
+								   blocknum,
+								   pgsr->strategy,
+								   &found);
+		pgsr->pinned_buffers++;
+
+		need_complete = !found;
+
+		/* Is there a head range that we can't extend? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks > 0 &&
+			(!need_complete ||
+			 !head_range->need_complete ||
+			 head_range->blocknum + head_range->nblocks != blocknum))
+		{
+			/* Yes, time to start building a new one. */
+			head_range = pg_streaming_read_new_range(pgsr);
+			Assert(head_range->nblocks == 0);
+		}
+
+		if (head_range->nblocks == 0)
+		{
+			/* Initialize a new range beginning at this block. */
+			head_range->blocknum = blocknum;
+			head_range->need_complete = need_complete;
+			head_range->advice_issued = false;
+		}
+		else
+		{
+			/* We can extend an existing range by one block. */
+			Assert(head_range->blocknum + head_range->nblocks == blocknum);
+			Assert(head_range->need_complete);
+		}
+
+		head_range->per_buffer_data_index[head_range->nblocks] = pgsr->per_buffer_data_next++;
+		head_range->buffers[head_range->nblocks] = buffer;
+		head_range->nblocks++;
+
+		if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
+			pgsr->per_buffer_data_next = 0;
+
+	} while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+			 pgsr->ios_in_progress < pgsr->max_ios);
+
+	if (pgsr->ranges[pgsr->head].nblocks > 0)
+		pg_streaming_read_new_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_complete)
+		{
+			CompleteReadBuffers(pgsr->bmr,
+								tail_range->buffers,
+								pgsr->forknum,
+								tail_range->blocknum,
+								tail_range->nblocks,
+								false,
+								pgsr->strategy);
+			tail_range->need_complete = false;
+
+			/*
+			 * We don't really know if the kernel generated an physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished after a read call returns.
+			 */
+			if (tail_range->advice_issued)
+				pgsr->ios_in_progress--;
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (per_buffer_data)
+				*per_buffer_data = (char *) pgsr->per_buffer_data +
+					tail_range->per_buffer_data_index[buffer_index] *
+					pgsr->per_buffer_data_size;
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6d..2157a97b97 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
@@ -501,7 +501,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -795,15 +795,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+	buf = ReadBuffer_common(BMR_REL(reln),
 							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+
 	return buf;
 }
 
@@ -827,8 +821,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -1002,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 		bool		hit;
 
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+		buffer = ReadBuffer_common(bmr,
 								   fork, extend_to - 1, mode, strategy,
 								   &hit);
 	}
@@ -1016,18 +1011,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	Buffer		buffer;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1046,175 +1034,339 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		*hit = false;
+
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	buffer = PrepareReadBuffer(bmr,
+							   forkNum,
+							   blockNum,
+							   strategy,
+							   hit);
+
+	/* At this point we do NOT hold any locks. */
 
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+	{
+		/* if we just want zeroes and a lock, we're done */
+		ZeroBuffer(buffer, mode);
+	}
+	else if (!*hit)
+	{
+		/* we might need to perform I/O */
+		CompleteReadBuffers(bmr,
+							&buffer,
+							forkNum,
+							blockNum,
+							1,
+							mode == RBM_ZERO_ON_ERROR,
+							strategy);
+	}
+
+	return buffer;
+}
+
+/*
+ * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
+ * the returned buffer can be used immediately.  Otherwise, a physical read
+ * should be completed with CompleteReadBuffers(), or the buffer should be
+ * zeroed with ZeroBuffer().  PrepareReadBuffer() followed by
+ * CompleteReadBuffers() or ZeroBuffer() is equivalent to ReadBuffer(), but
+ * the caller has the opportunity to combine reads of multiple neighboring
+ * blocks into one CompleteReadBuffers() call.
+ *
+ * *foundPtr is set to true for a hit, and false for a miss.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * CompleteReadBuffers() (so, not for hits, and not for buffers that
+		 * are zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+					Buffer *buffers,
+					ForkNumber forknum,
+					BlockNumber blocknum,
+					int nblocks,
+					bool zero_on_error,
+					BufferAccessStrategy strategy)
+{
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+#ifdef USE_ASSERT_CHECKING
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		/*
+		 * We could get all the information from buffer headers, but it can be
+		 * expensive to access buffer header cache lines so we make the caller
+		 * provide all the information we need, and assert that it is
+		 * consistent.
+		 */
+		{
+			RelFileLocator xlocator;
+			ForkNumber	xforknum;
+			BlockNumber xblocknum;
+
+			BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+			Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+			Assert(xforknum == forknum);
+			Assert(xblocknum == blocknum + i);
+		}
+#endif
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!CompleteReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   CompleteReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if (zero_on_error || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
@@ -1228,11 +1380,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.  A read should be
+ * performed with CompleteReadBuffers().
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1291,19 +1440,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1368,19 +1508,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1412,15 +1543,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1774,7 +1899,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2043,7 +2168,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2066,7 +2191,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2381,7 +2506,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if PrepareReadBuffer() was called and
+		 * CompleteReadBuffers() hasn't been called yet.  We'll check by
+		 * loading the flags without locking.  This is racy, but it's OK to
+		 * return false spuriously: when CompleteReadBuffers() calls
+		 * StartBufferIO(), it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2390,7 +2520,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3458,7 +3588,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -4845,6 +4975,46 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would.  The buffer must be already pinned.  It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * Verify that this backend is pinning the buffer exactly once.
  *
@@ -5197,9 +5367,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5212,6 +5388,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1be4f4f8da..717b8f58da 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca2..739d13293f 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c7..0d7272e796 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
  */
 SMgrRelation
 smgropen(RelFileLocator rlocator, BackendId backend)
@@ -259,10 +261,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
 void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -289,12 +291,14 @@ smgrclose(SMgrRelation reln)
 }
 
 /*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
  *
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr().  It may be re-owned if it is accessed by a
+ * relation before then.
  */
 void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
@@ -302,15 +306,20 @@ smgrrelease(SMgrRelation reln)
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	if (reln->smgr_owner)
+	{
+		*reln->smgr_owner = NULL;
+		reln->smgr_owner = NULL;
+		dlist_push_tail(&unowned_relns, &reln->node);
+	}
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
  */
 void
-smgrreleaseall(void)
+smgrcloseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -322,14 +331,17 @@ smgrreleaseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrclose(reln);
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
  */
 void
-smgrcloseall(void)
+smgrdestroyall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -341,7 +353,7 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+		smgrdestroy(reln);
 }
 
 /*
@@ -733,7 +745,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -747,7 +760,7 @@ AtEOXact_SMgr(void)
 	dlist_mutable_iter iter;
 
 	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
+	 * Zap all unowned SMgrRelations.  We rely on smgrdestroy() to remove each
 	 * one from the list.
 	 */
 	dlist_foreach_modify(iter, &unowned_relns)
@@ -757,7 +770,7 @@ AtEOXact_SMgr(void)
 
 		Assert(rel->smgr_owner == NULL);
 
-		smgrclose(rel);
+		smgrdestroy(rel);
 	}
 }
 
@@ -768,6 +781,6 @@ AtEOXact_SMgr(void)
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	smgrreleaseall();
+	smgrcloseall();
 	return true;
 }
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d335..a38f1acb37 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+								ForkNumber forkNum,
+								BlockNumber blockNum,
+								BufferAccessStrategy strategy,
+								bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+								Buffer *buffers,
+								ForkNumber forknum,
+								BlockNumber blocknum,
+								int nblocks,
+								bool zero_on_error,
+								BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a056..d8ffe397fa 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 0000000000..40c3408c54
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,45 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a584b1ddff..6636cc82c0 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 29fd1cae64..018ebbcbaa 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2089,6 +2089,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.37.2



  [text/plain] v3-0002-use-streaming-reads-in-index-scan.txt (11.4K, 3-v3-0002-use-streaming-reads-in-index-scan.txt)
  download | inline diff:
From 339b39d7f2cc5fc4bd6aa3429a12e6f3a4f9d2db Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Fri, 19 Jan 2024 16:10:30 -0500
Subject: [PATCH v3 2/2] use streaming reads in index scan

ci-os-only:
---
 src/backend/access/heap/heapam_handler.c | 19 ++++++--
 src/backend/access/index/indexam.c       | 42 +++++++++++++++++
 src/backend/executor/nodeIndexonlyscan.c | 26 ++++++++++-
 src/backend/executor/nodeIndexscan.c     | 57 +++++++++++++++++++-----
 src/backend/storage/aio/streaming_read.c | 10 ++++-
 src/include/access/relscan.h             |  7 +++
 src/include/executor/nodeIndexscan.h     |  6 +++
 src/include/storage/streaming_read.h     |  2 +
 8 files changed, 151 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d15a02b2be..e5e13e92d8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -127,9 +127,22 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (scan->pgsr && scan->do_pgsr)
+		{
+			hscan->xs_cbuf = pg_streaming_read_buffer_get_next(scan->pgsr, (void **) &tid);
+			if (!BufferIsValid(hscan->xs_cbuf))
+				return false;
+		}
+		else
+		{
+			ItemPointerSet(&scan->tid_queue, InvalidBlockNumber, InvalidOffsetNumber);
+			if (!ItemPointerIsValid(tid))
+				return false;
+			hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+													hscan->xs_base.rel,
+													ItemPointerGetBlockNumber(tid));
+		}
+
 
 		/*
 		 * Prune page, but only if we weren't already on this page
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 63dff101e2..f247a1d2d3 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -211,6 +211,29 @@ index_insert_cleanup(Relation indexRelation,
 		indexRelation->rd_indam->aminsertcleanup(indexInfo);
 }
 
+static ItemPointerData
+index_tid_dequeue(ItemPointer tid_queue)
+{
+	ItemPointerData result = *tid_queue;
+	ItemPointerSet(tid_queue, InvalidBlockNumber, InvalidOffsetNumber);
+	return result;
+}
+
+
+static BlockNumber
+index_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private, void *per_buffer_data)
+{
+	IndexFetchTableData *scan = (IndexFetchTableData *) pgsr_private;
+	ItemPointerData data = index_tid_dequeue(&scan->tid_queue);
+
+	ItemPointer dest = per_buffer_data;
+	*dest = data;
+
+	if (!ItemPointerIsValid(&data))
+		return InvalidBlockNumber;
+	return ItemPointerGetBlockNumber(&data);
+}
+
 /*
  * index_beginscan - start a scan of an index with amgettuple
  *
@@ -236,7 +259,22 @@ index_beginscan(Relation heapRelation,
 	scan->xs_snapshot = snapshot;
 
 	/* prepare to fetch index matches from table */
+	scan->index_done = false;
 	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+	ItemPointerSet(&scan->xs_heapfetch->tid_queue, InvalidBlockNumber,
+			InvalidOffsetNumber);
+
+	// TODO: can't put this here bc not AM agnostic
+	scan->xs_heapfetch->pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+												scan->xs_heapfetch,
+												sizeof(ItemPointerData),
+												NULL,
+												BMR_REL(scan->heapRelation),
+												MAIN_FORKNUM,
+												index_pgsr_next_single);
+
+	pg_streaming_read_set_resumable(scan->xs_heapfetch->pgsr);
+	scan->xs_heapfetch->do_pgsr = false;
 
 	return scan;
 }
@@ -525,6 +563,9 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 
 	/* prepare to fetch index matches from table */
 	scan->xs_heapfetch = table_index_fetch_begin(heaprel);
+	ItemPointerSet(&scan->xs_heapfetch->tid_queue, InvalidBlockNumber,
+			InvalidOffsetNumber);
+	scan->xs_heapfetch->do_pgsr = false;
 
 	return scan;
 }
@@ -566,6 +607,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		if (scan->xs_heapfetch)
 			table_index_fetch_reset(scan->xs_heapfetch);
 
+		ItemPointerSet(&scan->xs_heaptid, InvalidBlockNumber, InvalidOffsetNumber);
 		return NULL;
 	}
 	Assert(ItemPointerIsValid(&scan->xs_heaptid));
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 2c2c9c10b5..7979ecf1e4 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -111,17 +111,36 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_NumScanKeys,
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
+
+		scandesc->xs_heapfetch->do_pgsr = true;
 	}
 
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (true)
 	{
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		Assert(!TID_QUEUE_FULL(&scandesc->xs_heapfetch->tid_queue));
+		do
+		{
+			tid = index_getnext_tid(scandesc, direction);
+
+			if (!tid)
+			{
+				scandesc->index_done = true;
+				break;
+			}
+
+			index_tid_enqueue(tid, &scandesc->xs_heapfetch->tid_queue);
+		} while (!TID_QUEUE_FULL(&scandesc->xs_heapfetch->tid_queue));
+
+		if (!tid && TID_QUEUE_EMPTY(&scandesc->xs_heapfetch->tid_queue))
+			break;
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -156,7 +175,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * It's worth going through this complexity to avoid needing to lock
 		 * the VM buffer, which could cause significant contention.
 		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
+		if (!tid || !VM_ALL_VISIBLE(scandesc->heapRelation,
 							ItemPointerGetBlockNumber(tid),
 							&node->ioss_VMBuffer))
 		{
@@ -187,6 +206,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 			tuple_from_heap = true;
 		}
+		else
+			ItemPointerSet(&scandesc->xs_heapfetch->tid_queue,
+					InvalidBlockNumber, InvalidOffsetNumber);
 
 		/*
 		 * Fill the scan tuple slot with data from the index.  This might be
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 03142b4a94..be91854436 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -77,6 +77,16 @@ static HeapTuple reorderqueue_pop(IndexScanState *node);
  *		using the index specified in the IndexScanState information.
  * ----------------------------------------------------------------
  */
+
+void
+index_tid_enqueue(ItemPointer tid, ItemPointer tid_queue)
+{
+	Assert(!ItemPointerIsValid(tid_queue));
+
+	ItemPointerSet(tid_queue, ItemPointerGetBlockNumber(tid),
+			ItemPointerGetOffsetNumber(tid));
+}
+
 static TupleTableSlot *
 IndexNext(IndexScanState *node)
 {
@@ -123,31 +133,54 @@ IndexNext(IndexScanState *node)
 			index_rescan(scandesc,
 						 node->iss_ScanKeys, node->iss_NumScanKeys,
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		scandesc->xs_heapfetch->do_pgsr = true;
 	}
 
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+
+	while (true)
 	{
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * If the index was lossy, we have to recheck the index quals using
-		 * the fetched tuple.
-		 */
-		if (scandesc->xs_recheck)
+		Assert(!TID_QUEUE_FULL(&scandesc->xs_heapfetch->tid_queue));
+		do
 		{
-			econtext->ecxt_scantuple = slot;
-			if (!ExecQualAndReset(node->indexqualorig, econtext))
+			ItemPointer tid = index_getnext_tid(scandesc, direction);
+
+			if (!tid)
 			{
-				/* Fails recheck, so drop it and loop back for another */
-				InstrCountFiltered2(node, 1);
-				continue;
+				scandesc->index_done = true;
+				break;
 			}
+
+			index_tid_enqueue(tid, &scandesc->xs_heapfetch->tid_queue);
+		} while (!TID_QUEUE_FULL(&scandesc->xs_heapfetch->tid_queue));
+
+		if (index_fetch_heap(scandesc, slot))
+		{
+			/*
+			* If the index was lossy, we have to recheck the index quals using
+			* the fetched tuple.
+			*/
+			if (scandesc->xs_recheck)
+			{
+				econtext->ecxt_scantuple = slot;
+				if (!ExecQualAndReset(node->indexqualorig, econtext))
+				{
+					/* Fails recheck, so drop it and loop back for another */
+					InstrCountFiltered2(node, 1);
+					continue;
+				}
+			}
+
+			return slot;
 		}
 
-		return slot;
+		if (scandesc->index_done)
+			break;
 	}
 
 	/*
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 19605090fe..6465963f83 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -34,6 +34,7 @@ struct PgStreamingRead
 	int			pinned_buffers_trigger;
 	int			next_tail_buffer;
 	bool		finished;
+	bool		resumable;
 	void	   *pgsr_private;
 	PgStreamingReadBufferCB callback;
 	BufferAccessStrategy strategy;
@@ -292,7 +293,8 @@ pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
 		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
 		if (blocknum == InvalidBlockNumber)
 		{
-			pgsr->finished = true;
+			if (!pgsr->resumable)
+				pgsr->finished = true;
 			break;
 		}
 		bmr = pgsr->bmr;
@@ -433,3 +435,9 @@ pg_streaming_read_free(PgStreamingRead *pgsr)
 		pfree(pgsr->per_buffer_data);
 	pfree(pgsr);
 }
+
+void
+pg_streaming_read_set_resumable(PgStreamingRead *pgsr)
+{
+	pgsr->resumable = true;
+}
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304a..ade7f59946 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -18,6 +18,7 @@
 #include "access/itup.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/streaming_read.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
 
@@ -104,6 +105,9 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	PgStreamingRead *pgsr;
+	ItemPointerData tid_queue;
+	bool do_pgsr;
 } IndexFetchTableData;
 
 /*
@@ -162,6 +166,9 @@ typedef struct IndexScanDescData
 	bool	   *xs_orderbynulls;
 	bool		xs_recheckorderby;
 
+	bool	index_done;
+
+
 	/* parallel index scan information, in shared memory */
 	struct ParallelIndexScanDescData *parallel_scan;
 }			IndexScanDescData;
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 3cddece67c..7dbff789e9 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -44,4 +44,10 @@ extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
 								   IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
 extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
 
+#define TID_QUEUE_FULL(tid_queue) (ItemPointerIsValid(tid_queue))
+/* If it were a real queue empty and full wouldn't be opposites */
+#define TID_QUEUE_EMPTY(tid_queue) (!ItemPointerIsValid(tid_queue))
+
+extern void index_tid_enqueue(ItemPointer tid, ItemPointer tid_queue);
+
 #endif							/* NODEINDEXSCAN_H */
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index 40c3408c54..2288b7b5eb 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -42,4 +42,6 @@ extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
 extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
 extern void pg_streaming_read_free(PgStreamingRead *pgsr);
 
+extern void pg_streaming_read_set_resumable(PgStreamingRead *pgsr);
+
 #endif
-- 
2.37.2



^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-19 21:43               ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-23 17:43                 ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 00:51                   ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-24 09:19                     ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 20:20                       ` Re: index prefetching Melanie Plageman <[email protected]>
@ 2024-02-07 21:48                         ` Melanie Plageman <[email protected]>
  2024-02-13 19:00                           ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-02-14 07:10                           ` Re: index prefetching Robert Haas <[email protected]>
  0 siblings, 2 replies; 25+ messages in thread

From: Melanie Plageman @ 2024-02-07 21:48 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]>

On Wed, Jan 24, 2024 at 3:20 PM Melanie Plageman
<[email protected]> wrote:
>
> On Wed, Jan 24, 2024 at 4:19 AM Tomas Vondra
> <[email protected]> wrote:
> >
> > On 1/24/24 01:51, Melanie Plageman wrote:
> > >> But I'm not sure what to do about optimizations that are more specific
> > >> to the access path. Consider for example the index-only scans. We don't
> > >> want to prefetch all the pages, we need to inspect the VM and prefetch
> > >> just the not-all-visible ones. And then pass the info to the index scan,
> > >> so that it does not need to check the VM again. It's not clear to me how
> > >> to do this with this approach.
> > >
> > > Yea, this is an issue I'll need to think about. To really spell out
> > > the problem: the callback dequeues a TID from the tid_queue and looks
> > > up its block in the VM. It's all visible. So, it shouldn't return that
> > > block to the streaming read API to fetch from the heap because it
> > > doesn't need to be read. But, where does the callback put the TID so
> > > that the caller can get it? I'm going to think more about this.
> > >
> >
> > Yes, that's the problem for index-only scans. I'd generalize it so that
> > it's about the callback being able to (a) decide if it needs to read the
> > heap page, and (b) store some custom info for the TID.
>
> Actually, I think this is no big deal. See attached. I just don't
> enqueue tids whose blocks are all visible. I had to switch the order
> from fetch heap then fill queue to fill queue then fetch heap.
>
> While doing this I noticed some wrong results in the regression tests
> (like in the alter table test), so I suspect I have some kind of
> control flow issue. Perhaps I should fix the resource leak so I can
> actually see the failing tests :)

Attached is a patch which implements a real queue and fixes some of
the issues with the previous version. It doesn't pass tests yet and
has issues. Some are bugs in my implementation I need to fix. Some are
issues we would need to solve in the streaming read API. Some are
issues with index prefetching generally.

Note that these two patches have to be applied before 21d9c3ee4e
because Thomas hasn't released a rebased version of the streaming read
API patches yet.

Issues
---
- kill prior tuple

This optimization doesn't work with index prefetching with the current
design. Kill prior tuple relies on alternating between fetching a
single index tuple and visiting the heap. After visiting the heap we
can potentially kill the immediately preceding index tuple. Once we
fetch multiple index tuples, enqueue their TIDs, and later visit the
heap, the next index page we visit may not contain all of the index
tuples deemed killable by our visit to the heap.

In our case, we could try and fix this by prefetching only heap blocks
referred to by index tuples on the same index page. Or we could try
and keep a pool of index pages pinned and go back and kill index
tuples on those pages.

Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps
there is an easier way to fix this, as I don't think the mvcc test
failed on Tomas' version.

- switching scan directions

If the index scan switches directions on a given invocation of
IndexNext(), heap blocks may have already been prefetched and read for
blocks containing tuples beyond the point at which we want to switch
directions.

We could fix this by having some kind of streaming read "reset"
callback to drop all of the buffers which have been prefetched which
are now no longer needed. We'd have to go backwards from the last TID
which was yielded to the caller and figure out which buffers in the
pgsr buffer ranges are associated with all of the TIDs which were
prefetched after that TID. The TIDs are in the per_buffer_data
associated with each buffer in pgsr. The issue would be searching
through those efficiently.

The other issue is that the streaming read API does not currently
support backwards scans. So, if we switch to a backwards scan from a
forwards scan, we would need to fallback to the non streaming read
method. We could do this by just setting the TID queue size to 1
(which is what I have currently implemented). Or we could add
backwards scan support to the streaming read API.

- mark and restore

Similar to the issue with switching the scan direction, mark and
restore requires us to reset the TID queue and streaming read queue.
For now, I've hacked in something to the PlannerInfo and Plan to set
the TID queue size to 1 for plans containing a merge join (yikes).

- multiple executions

For reasons I don't entirely understand yet, multiple executions (not
rescan -- see ExecutorRun(...execute_once)) do not work. As in Tomas'
patch, I have disabled prefetching (and made the TID queue size 1)
when execute_once is false.

- Index Only Scans need to return IndexTuples

Because index only scans return either the IndexTuple pointed to by
IndexScanDesc->xs_itup or the HeapTuple pointed to by
IndexScanDesc->xs_hitup -- both of which are populated by the index
AM, we have to save copies of those IndexTupleData and HeapTupleDatas
for every TID whose block we prefetch.

This might be okay, but it is a bit sad to have to make copies of those tuples.

In this patch, I still haven't figured out the memory management part.
I copy over the tuples when enqueuing a TID queue item and then copy
them back again when the streaming read API returns the
per_buffer_data to us. Something is still not quite right here. I
suspect this is part of the reason why some of the other tests are
failing.

Other issues/gaps in my implementation:

Determining where to allocate the memory for the streaming read object
and the TID queue is an outstanding TODO. To implement a fallback
method for cases in which streaming read doesn't work, I set the queue
size to 1. This is obviously not good.

Right now, I allocate the TID queue and streaming read objects in
IndexNext() and IndexOnlyNext(). This doesn't seem ideal. Doing it in
index_beginscan() (and index_beginscan_parallel()) is tricky though
because we don't know the scan direction at that point (and the scan
direction can change). There are also callers of index_beginscan() who
do not call Index[Only]Next() (like systable_getnext() which calls
index_getnext_slot() directly).

Also, my implementation does not yet have the optimization Tomas does
to skip prefetching recently prefetched blocks. As he has said, it
probably makes sense to add something to do this in a lower layer --
such as in the streaming read API or even in bufmgr.c (maybe in
PrefetchSharedBuffer()).

- Melanie


Attachments:

  [text/x-patch] v4-0001-Streaming-Read-API.patch (55.9K, 2-v4-0001-Streaming-Read-API.patch)
  download | inline diff:
From 550e3a4b55eb0f3edc0f8c4f691cff134b256371 Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v4 1/2] Streaming Read API

---
 contrib/pg_prewarm/pg_prewarm.c          |  40 +-
 src/backend/access/transam/xlogutils.c   |   2 +-
 src/backend/postmaster/bgwriter.c        |   8 +-
 src/backend/postmaster/checkpointer.c    |  15 +-
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 435 ++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      | 560 +++++++++++++++--------
 src/backend/storage/buffer/localbuf.c    |  14 +-
 src/backend/storage/meson.build          |   1 +
 src/backend/storage/smgr/smgr.c          |  49 +-
 src/include/storage/bufmgr.h             |  22 +
 src/include/storage/smgr.h               |   4 +-
 src/include/storage/streaming_read.h     |  45 ++
 src/include/utils/rel.h                  |   6 -
 src/tools/pgindent/typedefs.list         |   2 +
 17 files changed, 986 insertions(+), 238 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e4..9617bf130b 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd1..8775b5789b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
 	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * purpose.
 	 */
 	smgrcloseall();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d7d6cc0cd7..13e5376619 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -246,10 +246,12 @@ BackgroundWriterMain(void)
 		if (FirstCallSinceLastCheckpoint())
 		{
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the bgwriter does
+			 * not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e949fc885..5d843b6142 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -469,10 +469,12 @@ CheckpointerMain(void)
 				ckpt_performed = CreateRestartPoint(flags);
 
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the checkpointer
+			 * does not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
@@ -958,11 +960,8 @@ RequestCheckpoint(int flags)
 		 */
 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
 
-		/*
-		 * After any checkpoint, close all smgr files.  This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
+		/* Free all smgr objects, as CheckpointerMain() normally would. */
+		smgrdestroyall();
 
 		return;
 	}
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..eec03f6f2b 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 0000000000..bcab44c802
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 0000000000..39aef2a84a
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 0000000000..19605090fe
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,435 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ *
+ * For hits, need_to_complete is false and there is just one block per
+ * range, already pinned and ready for use.
+ *
+ * For misses, need_to_complete is true and buffers[] holds a range of
+ * blocks that are contiguous in storage (though the buffers may not be
+ * contiguous in memory), so we can complete them with a single call to
+ * CompleteReadBuffers().
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		advice_issued;
+	bool		need_complete;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index[MAX_BUFFERS_PER_TRANSFER];
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadRange;
+
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			ios_in_progress_trigger;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	bool		finished;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	bool		advice_enabled;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+	int			per_buffer_data_next;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * The *_io_concurrency GUCs, we might have 0.  We want to allow at least
+	 * one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Start building a new range.  This is called after the previous one
+ * reached maximum size, or the callback's next block can't be merged with it.
+ *
+ * Since the previous head range has now reached its full potential size, this
+ * is also a good time to issue 'prefetch' advice, because we know that'll
+ * soon be reading.  In future, we could start an actual I/O here.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_new_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If a call to CompleteReadBuffers() will be needed, and we can issue
+	 * advice to the kernel to get the read started.  We suppress it if the
+	 * access pattern appears to be completely sequential, though, because on
+	 * some systems that interfers with the kernel's own sequential read ahead
+	 * heurstics and hurts performance.
+	 */
+	if (pgsr->advice_enabled)
+	{
+		BlockNumber blocknum = head_range->blocknum;
+		int			nblocks = head_range->nblocks;
+
+		if (head_range->need_complete && blocknum != pgsr->seq_blocknum)
+		{
+			SMgrRelation smgr =
+				pgsr->bmr.smgr ? pgsr->bmr.smgr :
+				RelationGetSmgr(pgsr->bmr.rel);
+
+			Assert(!head_range->advice_issued);
+
+			smgrprefetch(smgr, pgsr->forknum, blocknum, nblocks);
+
+			/*
+			 * Count this as an I/O that is concurrently in progress, though
+			 * we don't really know if the kernel generates a physical I/O.
+			 */
+			head_range->advice_issued = true;
+			pgsr->ios_in_progress++;
+		}
+
+		/* Remember the block after this range, for sequence detection. */
+		pgsr->seq_blocknum = blocknum + nblocks;
+	}
+
+	/* Create a new head range.  There must be space. */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	head_range = &pgsr->ranges[pgsr->head];
+	head_range->nblocks = 0;
+
+	return head_range;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BufferManagerRelation bmr;
+		ForkNumber	forknum;
+		BlockNumber blocknum;
+		Buffer		buffer;
+		bool		found;
+		bool		need_complete;
+		PgStreamingReadRange *head_range;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks == lengthof(head_range->buffers))
+		{
+			Assert(head_range->need_complete);
+			head_range = pg_streaming_read_new_range(pgsr);
+
+			/*
+			 * Give up now if I/O is saturated, or we wouldn't be able form
+			 * another full range after this due to the pin limit.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger ||
+				pgsr->ios_in_progress == pgsr->max_ios)
+				break;
+		}
+
+		per_buffer_data = (char *) pgsr->per_buffer_data +
+			pgsr->per_buffer_data_size * pgsr->per_buffer_data_next;
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			pgsr->finished = true;
+			break;
+		}
+		bmr = pgsr->bmr;
+		forknum = pgsr->forknum;
+
+		Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+		buffer = PrepareReadBuffer(bmr,
+								   forknum,
+								   blocknum,
+								   pgsr->strategy,
+								   &found);
+		pgsr->pinned_buffers++;
+
+		need_complete = !found;
+
+		/* Is there a head range that we can't extend? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks > 0 &&
+			(!need_complete ||
+			 !head_range->need_complete ||
+			 head_range->blocknum + head_range->nblocks != blocknum))
+		{
+			/* Yes, time to start building a new one. */
+			head_range = pg_streaming_read_new_range(pgsr);
+			Assert(head_range->nblocks == 0);
+		}
+
+		if (head_range->nblocks == 0)
+		{
+			/* Initialize a new range beginning at this block. */
+			head_range->blocknum = blocknum;
+			head_range->need_complete = need_complete;
+			head_range->advice_issued = false;
+		}
+		else
+		{
+			/* We can extend an existing range by one block. */
+			Assert(head_range->blocknum + head_range->nblocks == blocknum);
+			Assert(head_range->need_complete);
+		}
+
+		head_range->per_buffer_data_index[head_range->nblocks] = pgsr->per_buffer_data_next++;
+		head_range->buffers[head_range->nblocks] = buffer;
+		head_range->nblocks++;
+
+		if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
+			pgsr->per_buffer_data_next = 0;
+
+	} while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+			 pgsr->ios_in_progress < pgsr->max_ios);
+
+	if (pgsr->ranges[pgsr->head].nblocks > 0)
+		pg_streaming_read_new_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_complete)
+		{
+			CompleteReadBuffers(pgsr->bmr,
+								tail_range->buffers,
+								pgsr->forknum,
+								tail_range->blocknum,
+								tail_range->nblocks,
+								false,
+								pgsr->strategy);
+			tail_range->need_complete = false;
+
+			/*
+			 * We don't really know if the kernel generated an physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished after a read call returns.
+			 */
+			if (tail_range->advice_issued)
+				pgsr->ios_in_progress--;
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (per_buffer_data)
+				*per_buffer_data = (char *) pgsr->per_buffer_data +
+					tail_range->per_buffer_data_index[buffer_index] *
+					pgsr->per_buffer_data_size;
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6d..2157a97b97 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
@@ -501,7 +501,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -795,15 +795,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+	buf = ReadBuffer_common(BMR_REL(reln),
 							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+
 	return buf;
 }
 
@@ -827,8 +821,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -1002,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 		bool		hit;
 
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+		buffer = ReadBuffer_common(bmr,
 								   fork, extend_to - 1, mode, strategy,
 								   &hit);
 	}
@@ -1016,18 +1011,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	Buffer		buffer;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1046,175 +1034,339 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		*hit = false;
+
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	buffer = PrepareReadBuffer(bmr,
+							   forkNum,
+							   blockNum,
+							   strategy,
+							   hit);
+
+	/* At this point we do NOT hold any locks. */
 
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+	{
+		/* if we just want zeroes and a lock, we're done */
+		ZeroBuffer(buffer, mode);
+	}
+	else if (!*hit)
+	{
+		/* we might need to perform I/O */
+		CompleteReadBuffers(bmr,
+							&buffer,
+							forkNum,
+							blockNum,
+							1,
+							mode == RBM_ZERO_ON_ERROR,
+							strategy);
+	}
+
+	return buffer;
+}
+
+/*
+ * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
+ * the returned buffer can be used immediately.  Otherwise, a physical read
+ * should be completed with CompleteReadBuffers(), or the buffer should be
+ * zeroed with ZeroBuffer().  PrepareReadBuffer() followed by
+ * CompleteReadBuffers() or ZeroBuffer() is equivalent to ReadBuffer(), but
+ * the caller has the opportunity to combine reads of multiple neighboring
+ * blocks into one CompleteReadBuffers() call.
+ *
+ * *foundPtr is set to true for a hit, and false for a miss.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * CompleteReadBuffers() (so, not for hits, and not for buffers that
+		 * are zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+					Buffer *buffers,
+					ForkNumber forknum,
+					BlockNumber blocknum,
+					int nblocks,
+					bool zero_on_error,
+					BufferAccessStrategy strategy)
+{
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+#ifdef USE_ASSERT_CHECKING
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		/*
+		 * We could get all the information from buffer headers, but it can be
+		 * expensive to access buffer header cache lines so we make the caller
+		 * provide all the information we need, and assert that it is
+		 * consistent.
+		 */
+		{
+			RelFileLocator xlocator;
+			ForkNumber	xforknum;
+			BlockNumber xblocknum;
+
+			BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+			Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+			Assert(xforknum == forknum);
+			Assert(xblocknum == blocknum + i);
+		}
+#endif
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!CompleteReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   CompleteReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if (zero_on_error || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
@@ -1228,11 +1380,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.  A read should be
+ * performed with CompleteReadBuffers().
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1291,19 +1440,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1368,19 +1508,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1412,15 +1543,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1774,7 +1899,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2043,7 +2168,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2066,7 +2191,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2381,7 +2506,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if PrepareReadBuffer() was called and
+		 * CompleteReadBuffers() hasn't been called yet.  We'll check by
+		 * loading the flags without locking.  This is racy, but it's OK to
+		 * return false spuriously: when CompleteReadBuffers() calls
+		 * StartBufferIO(), it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2390,7 +2520,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3458,7 +3588,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -4845,6 +4975,46 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would.  The buffer must be already pinned.  It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * Verify that this backend is pinning the buffer exactly once.
  *
@@ -5197,9 +5367,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5212,6 +5388,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1be4f4f8da..717b8f58da 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca2..739d13293f 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c7..0d7272e796 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
  */
 SMgrRelation
 smgropen(RelFileLocator rlocator, BackendId backend)
@@ -259,10 +261,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
 void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -289,12 +291,14 @@ smgrclose(SMgrRelation reln)
 }
 
 /*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
  *
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr().  It may be re-owned if it is accessed by a
+ * relation before then.
  */
 void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
@@ -302,15 +306,20 @@ smgrrelease(SMgrRelation reln)
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	if (reln->smgr_owner)
+	{
+		*reln->smgr_owner = NULL;
+		reln->smgr_owner = NULL;
+		dlist_push_tail(&unowned_relns, &reln->node);
+	}
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
  */
 void
-smgrreleaseall(void)
+smgrcloseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -322,14 +331,17 @@ smgrreleaseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrclose(reln);
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
  */
 void
-smgrcloseall(void)
+smgrdestroyall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -341,7 +353,7 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+		smgrdestroy(reln);
 }
 
 /*
@@ -733,7 +745,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -747,7 +760,7 @@ AtEOXact_SMgr(void)
 	dlist_mutable_iter iter;
 
 	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
+	 * Zap all unowned SMgrRelations.  We rely on smgrdestroy() to remove each
 	 * one from the list.
 	 */
 	dlist_foreach_modify(iter, &unowned_relns)
@@ -757,7 +770,7 @@ AtEOXact_SMgr(void)
 
 		Assert(rel->smgr_owner == NULL);
 
-		smgrclose(rel);
+		smgrdestroy(rel);
 	}
 }
 
@@ -768,6 +781,6 @@ AtEOXact_SMgr(void)
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	smgrreleaseall();
+	smgrcloseall();
 	return true;
 }
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d335..a38f1acb37 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+								ForkNumber forkNum,
+								BlockNumber blockNum,
+								BufferAccessStrategy strategy,
+								bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+								Buffer *buffers,
+								ForkNumber forknum,
+								BlockNumber blocknum,
+								int nblocks,
+								bool zero_on_error,
+								BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a056..d8ffe397fa 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 0000000000..40c3408c54
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,45 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a584b1ddff..6636cc82c0 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7e866e3c3d..0e34145187 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2093,6 +2093,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.37.2



  [text/x-patch] v4-0002-Index-scans-prefetch-with-streaming-read-API.patch (24.8K, 3-v4-0002-Index-scans-prefetch-with-streaming-read-API.patch)
  download | inline diff:
From ec809e08fe59ffc3eaee772d2269cb47f365c0a6 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Wed, 7 Feb 2024 12:47:51 -0500
Subject: [PATCH v4 2/2] Index scans prefetch with streaming read API

---
 src/backend/access/heap/heapam_handler.c |  55 +++++++-
 src/backend/access/index/indexam.c       | 104 ++++++++++++++-
 src/backend/executor/execMain.c          |   4 +
 src/backend/executor/nodeIndexonlyscan.c | 153 ++++++++++++++++-------
 src/backend/executor/nodeIndexscan.c     |  61 ++++++++-
 src/backend/optimizer/plan/createplan.c  |   1 +
 src/backend/optimizer/plan/planner.c     |   3 +
 src/backend/storage/aio/streaming_read.c |  11 +-
 src/include/access/genam.h               |  41 ++++++
 src/include/access/relscan.h             |  11 ++
 src/include/nodes/execnodes.h            |   1 +
 src/include/nodes/pathnodes.h            |   1 +
 src/include/nodes/plannodes.h            |   1 +
 src/include/storage/streaming_read.h     |   2 +
 src/tools/pgindent/typedefs.list         |   2 +
 15 files changed, 401 insertions(+), 50 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d15a02b2be..03e6b522a8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -108,6 +108,12 @@ heapam_index_fetch_end(IndexFetchTableData *scan)
 	pfree(hscan);
 }
 
+/*
+ * For those using the streaming read user, tid is an output parameter set with
+ * the latest TID obtained from the streaming read API. For non-streaming read
+ * users, tid is an input parameter and contains the next block to be read from
+ * the heap.
+ */
 static bool
 heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 						 ItemPointer tid,
@@ -127,9 +133,52 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = hscan->xs_cbuf;
 
-		hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-											  hscan->xs_base.rel,
-											  ItemPointerGetBlockNumber(tid));
+		if (scan->pgsr)
+		{
+			TIDQueueItem *result;
+
+			if (BufferIsValid(hscan->xs_cbuf))
+				ReleaseBuffer(hscan->xs_cbuf);
+
+			hscan->xs_cbuf = pg_streaming_read_buffer_get_next(scan->pgsr, (void **) &result);
+			if (!BufferIsValid(hscan->xs_cbuf))
+			{
+				/*
+				 * Invalidate the item pointer to allow the caller to
+				 * distinguish between index_fetch_heap() returning false
+				 * because the tuple is not visible and because the streaming
+				 * read callback ran out of queue items.
+				 */
+				ItemPointerSetInvalid(tid);
+				return false;
+			}
+
+			/* Set this for use below */
+			*tid = result->tid;
+
+			scan->tid = result->tid;
+			scan->recheck = result->recheck;
+
+			if (scan->itup)
+				pfree(scan->itup);
+			scan->itup = NULL;
+
+			if (scan->htup)
+				pfree(scan->htup);
+			scan->htup = NULL;
+
+			if (result->itup)
+				scan->itup = CopyIndexTuple(result->itup);
+			if (result->htup)
+				scan->htup = heap_copytuple(result->htup);
+		}
+		else
+		{
+			hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+												  hscan->xs_base.rel,
+												  ItemPointerGetBlockNumber(tid));
+		}
+
 
 		/*
 		 * Prune page, but only if we weren't already on this page
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index bbd499abcf..0bf50bcd83 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -248,6 +248,91 @@ index_insert_cleanup(Relation indexRelation,
 		indexRelation->rd_indam->aminsertcleanup(indexInfo);
 }
 
+void
+tid_queue_reset(TIDQueue *q)
+{
+	q->head = q->tail = 0;
+}
+
+TIDQueue *
+tid_queue_alloc(int size)
+{
+	TIDQueue   *result;
+
+	result = palloc(sizeof(TIDQueue) + (sizeof(TIDQueueItem) * size));
+	result->size = size;
+	tid_queue_reset(result);
+	return result;
+}
+
+static TIDQueueItem
+index_tid_dequeue(TIDQueue *tid_queue)
+{
+	TIDQueueItem result;
+
+	Assert(tid_queue->tail > tid_queue->head);
+	result = tid_queue->data[tid_queue->head % tid_queue->size];
+	tid_queue->head++;
+
+	return result;
+}
+
+void
+index_tid_enqueue(TIDQueue *tid_queue, ItemPointer tid, bool recheck,
+				  HeapTuple htup, IndexTuple itup)
+{
+	TIDQueueItem *cur;
+
+	Assert(tid_queue->tail >= tid_queue->head);
+	Assert(!TID_QUEUE_FULL(tid_queue));
+	cur = &tid_queue->data[tid_queue->tail % tid_queue->size];
+	ItemPointerSet(&cur->tid,
+				   ItemPointerGetBlockNumber(tid), ItemPointerGetOffsetNumber(tid));
+	cur->recheck = recheck;
+	cur->itup = NULL;
+	cur->htup = NULL;
+
+	if (itup)
+		cur->itup = CopyIndexTuple(itup);
+
+	if (htup)
+		cur->htup = heap_copytuple(htup);
+
+	tid_queue->tail++;
+}
+
+static BlockNumber
+index_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private, void *per_buffer_data)
+{
+	IndexScanDesc scan = pgsr_private;
+	TIDQueueItem *result = per_buffer_data;
+
+	scan->kill_prior_tuple = false;
+	scan->xs_heap_continue = false;
+
+	if (TID_QUEUE_EMPTY(scan->tid_queue))
+		return InvalidBlockNumber;
+
+	*result = index_tid_dequeue(scan->tid_queue);
+	return ItemPointerGetBlockNumber(&result->tid);
+}
+
+void
+index_pgsr_alloc(IndexScanDesc scan)
+{
+	if (scan->xs_heapfetch->pgsr)
+		pg_streaming_read_free(scan->xs_heapfetch->pgsr);
+	scan->xs_heapfetch->pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+															  scan,
+															  sizeof(TIDQueueItem),
+															  NULL,
+															  BMR_REL(scan->heapRelation),
+															  MAIN_FORKNUM,
+															  index_pgsr_next_single);
+
+	pg_streaming_read_set_resumable(scan->xs_heapfetch->pgsr);
+}
+
 /*
  * index_beginscan - start a scan of an index with amgettuple
  *
@@ -333,6 +418,8 @@ index_beginscan_internal(Relation indexRelation,
 	/* Initialize information for parallel scan. */
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
+	scan->index_done = false;
+	scan->tid_queue = NULL;
 
 	return scan;
 }
@@ -364,8 +451,12 @@ index_rescan(IndexScanDesc scan,
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	scan->index_done = false;
+
 	scan->kill_prior_tuple = false; /* for safety */
 	scan->xs_heap_continue = false;
+	if (scan->tid_queue)
+		tid_queue_reset(scan->tid_queue);
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 											orderbys, norderbys);
@@ -384,10 +475,17 @@ index_endscan(IndexScanDesc scan)
 	/* Release resources (like buffer pins) from table accesses */
 	if (scan->xs_heapfetch)
 	{
+		if (scan->xs_heapfetch->pgsr)
+			pg_streaming_read_free(scan->xs_heapfetch->pgsr);
+		scan->xs_heapfetch->pgsr = NULL;
 		table_index_fetch_end(scan->xs_heapfetch);
 		scan->xs_heapfetch = NULL;
 	}
 
+	if (scan->tid_queue)
+		pfree(scan->tid_queue);
+	scan->tid_queue = NULL;
+
 	/* End the AM's scan */
 	scan->indexRelation->rd_indam->amendscan(scan);
 
@@ -530,6 +628,9 @@ index_parallelrescan(IndexScanDesc scan)
 	if (scan->xs_heapfetch)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
+	if (scan->tid_queue)
+		tid_queue_reset(scan->tid_queue);
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_indam->amparallelrescan != NULL)
 		scan->indexRelation->rd_indam->amparallelrescan(scan);
@@ -603,6 +704,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 		if (scan->xs_heapfetch)
 			table_index_fetch_reset(scan->xs_heapfetch);
 
+		scan->index_done = true;
 		return NULL;
 	}
 	Assert(ItemPointerIsValid(&scan->xs_heaptid));
@@ -651,7 +753,7 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 	 * recovery because it may violate MVCC to do so.  See comments in
 	 * RelationGetIndexScan().
 	 */
-	if (!scan->xactStartedInRecovery)
+	if (!scan->xactStartedInRecovery && !scan->xs_heapfetch->pgsr)
 		scan->kill_prior_tuple = all_dead;
 
 	return found;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 13a9b7da83..9e951a69ab 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1652,6 +1652,10 @@ ExecutePlan(EState *estate,
 	if (!execute_once)
 		use_parallel_mode = false;
 
+	estate->es_use_prefetching = execute_once;
+	if (!planstate->plan->allow_prefetch)
+		estate->es_use_prefetching = false;
+
 	estate->es_use_parallel_mode = use_parallel_mode;
 	if (use_parallel_mode)
 		EnterParallelMode();
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 2c2c9c10b5..e3824c25e0 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -113,61 +113,109 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_NumOrderByKeys);
 	}
 
+	if (!scandesc->tid_queue)
+	{
+		/* Fall back to a queue size of 1 for now */
+		int			queue_size = 1;
+
+		if (estate->es_use_prefetching && ScanDirectionIsForward(direction))
+			queue_size = TID_QUEUE_SIZE;
+		scandesc->tid_queue = tid_queue_alloc(queue_size);
+		index_pgsr_alloc(scandesc);
+	}
+
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	for (;;)
 	{
-		bool		tuple_from_heap = false;
+		bool		tuple_from_heap = true;
 
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We can skip the heap fetch if the TID references a heap page on
-		 * which all tuples are known visible to everybody.  In any case,
-		 * we'll use the index tuple not the heap tuple as the data source.
-		 *
-		 * Note on Memory Ordering Effects: visibilitymap_get_status does not
-		 * lock the visibility map buffer, and therefore the result we read
-		 * here could be slightly stale.  However, it can't be stale enough to
-		 * matter.
-		 *
-		 * We need to detect clearing a VM bit due to an insert right away,
-		 * because the tuple is present in the index page but not visible. The
-		 * reading of the TID by this scan (using a shared lock on the index
-		 * buffer) is serialized with the insert of the TID into the index
-		 * (using an exclusive lock on the index buffer). Because the VM bit
-		 * is cleared before updating the index, and locking/unlocking of the
-		 * index page acts as a full memory barrier, we are sure to see the
-		 * cleared bit if we see a recently-inserted TID.
-		 *
-		 * Deletes do not update the index page (only VACUUM will clear out
-		 * the TID), so the clearing of the VM bit by a delete is not
-		 * serialized with this test below, and we may see a value that is
-		 * significantly stale. However, we don't care about the delete right
-		 * away, because the tuple is still visible until the deleting
-		 * transaction commits or the statement ends (if it's our
-		 * transaction). In either case, the lock on the VM buffer will have
-		 * been released (acting as a write barrier) after clearing the bit.
-		 * And for us to have a snapshot that includes the deleting
-		 * transaction (making the tuple invisible), we must have acquired
-		 * ProcArrayLock after that time, acting as a read barrier.
-		 *
-		 * It's worth going through this complexity to avoid needing to lock
-		 * the VM buffer, which could cause significant contention.
-		 */
-		if (!VM_ALL_VISIBLE(scandesc->heapRelation,
-							ItemPointerGetBlockNumber(tid),
-							&node->ioss_VMBuffer))
+		if (!scandesc->index_done)
+		{
+			while (!TID_QUEUE_FULL(scandesc->tid_queue))
+			{
+				if ((tid = index_getnext_tid(scandesc, direction)) == NULL)
+				{
+					scandesc->index_done = true;
+					break;
+				}
+
+				/*
+				 * We can skip the heap fetch if the TID references a heap
+				 * page on which all tuples are known visible to everybody. In
+				 * any case, we'll use the index tuple not the heap tuple as
+				 * the data source.
+				 *
+				 * Note on Memory Ordering Effects: visibilitymap_get_status
+				 * does not lock the visibility map buffer, and therefore the
+				 * result we read here could be slightly stale.  However, it
+				 * can't be stale enough to matter.
+				 *
+				 * We need to detect clearing a VM bit due to an insert right
+				 * away, because the tuple is present in the index page but
+				 * not visible. The reading of the TID by this scan (using a
+				 * shared lock on the index buffer) is serialized with the
+				 * insert of the TID into the index (using an exclusive lock
+				 * on the index buffer). Because the VM bit is cleared before
+				 * updating the index, and locking/unlocking of the index page
+				 * acts as a full memory barrier, we are sure to see the
+				 * cleared bit if we see a recently-inserted TID.
+				 *
+				 * Deletes do not update the index page (only VACUUM will
+				 * clear out the TID), so the clearing of the VM bit by a
+				 * delete is not serialized with this test below, and we may
+				 * see a value that is significantly stale. However, we don't
+				 * care about the delete right away, because the tuple is
+				 * still visible until the deleting transaction commits or the
+				 * statement ends (if it's our transaction). In either case,
+				 * the lock on the VM buffer will have been released (acting
+				 * as a write barrier) after clearing the bit. And for us to
+				 * have a snapshot that includes the deleting transaction
+				 * (making the tuple invisible), we must have acquired
+				 * ProcArrayLock after that time, acting as a read barrier.
+				 *
+				 * It's worth going through this complexity to avoid needing
+				 * to lock the VM buffer, which could cause significant
+				 * contention.
+				 */
+
+				if (VM_ALL_VISIBLE(scandesc->heapRelation,
+								   ItemPointerGetBlockNumber(tid),
+								   &node->ioss_VMBuffer))
+				{
+					tuple_from_heap = false;
+					break;
+				}
+
+				index_tid_enqueue(scandesc->tid_queue, tid, scandesc->xs_recheck,
+								  scandesc->xs_hitup, scandesc->xs_itup);
+			}
+		}
+
+		if (tuple_from_heap)
 		{
 			/*
 			 * Rats, we have to visit the heap to check visibility.
 			 */
 			InstrCountTuples2(node, 1);
+
 			if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
-				continue;		/* no visible tuple, try next index entry */
+			{
+				/*
+				 * Either there is no visible tuple or the streaming read ran
+				 * out of queue items and it is time to add more.
+				 */
+				if (ItemPointerIsValid(&scandesc->xs_heaptid))
+					continue;
 
-			ExecClearTuple(node->ioss_TableSlot);
+				if (!scandesc->index_done)
+					continue;
+
+				break;
+			}
 
 			/*
 			 * Only MVCC snapshots are supported here, so there should be no
@@ -185,15 +233,32 @@ IndexOnlyNext(IndexOnlyScanState *node)
 			 * entry might require a visit to the same heap page.
 			 */
 
-			tuple_from_heap = true;
+			/*
+			 * If we visit the underlying table, we need to reset the
+			 * IndexScanDesc's fields to match the per tuple state returned by
+			 * the streaming read API. The most recent index tuple fetched
+			 * will not necessarily match the current TID being processed
+			 * after returning from index_fetch_heap().
+			 */
+			scandesc->xs_recheck = scandesc->xs_heapfetch->recheck;
+			scandesc->xs_heaptid = scandesc->xs_heapfetch->tid;
+
+			scandesc->xs_hitup = scandesc->xs_heapfetch->htup;
+			scandesc->xs_itup = scandesc->xs_heapfetch->itup;
 		}
 
 		/*
 		 * Fill the scan tuple slot with data from the index.  This might be
-		 * provided in either HeapTuple or IndexTuple format.  Conceivably an
+		 * provided in either HeapTuple or IndexTuple format. Conceivably an
 		 * index AM might fill both fields, in which case we prefer the heap
-		 * format, since it's probably a bit cheaper to fill a slot from.
+		 * format, since it's probably a bit cheaper to fill a slot from. As
+		 * soon as we encounter a tuple from an all visible block, we stop
+		 * prefetching and yield the tuple. As such, we can use the IndexTuple
+		 * and HeapTuple that the index AM filled in the scan descriptor
+		 * instead of having to get them from the per tuple state yielded by
+		 * the streaming read API.
 		 */
+		ExecClearTuple(node->ioss_TableSlot);
 		if (scandesc->xs_hitup)
 		{
 			/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 03142b4a94..aa72479df1 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -125,13 +125,72 @@ IndexNext(IndexScanState *node)
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	}
 
+	if (!scandesc->tid_queue)
+	{
+		/* Fall back to a queue size of 1 for now */
+		int			queue_size = 1;
+
+		if (estate->es_use_prefetching && ScanDirectionIsForward(direction))
+			queue_size = TID_QUEUE_SIZE;
+		scandesc->tid_queue = tid_queue_alloc(queue_size);
+		index_pgsr_alloc(scandesc);
+	}
+
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	for (;;)
 	{
+		ItemPointerData last_tid = scandesc->xs_heaptid;
+
+		/*
+		 * If we haven't exhausted TIDs from the index, then fill the queue
+		 * with TIDs from the index until the queue is full. Mark the index as
+		 * exhausted if we reach the end of it.
+		 */
+		if (!scandesc->index_done)
+		{
+			while (!TID_QUEUE_FULL(scandesc->tid_queue))
+			{
+				ItemPointer tid;
+
+				if ((tid = index_getnext_tid(scandesc, direction)) == NULL)
+				{
+					scandesc->index_done = true;
+					break;
+				}
+
+				index_tid_enqueue(scandesc->tid_queue, tid, scandesc->xs_recheck,
+								  NULL, NULL);
+			}
+		}
+
+		if (scandesc->xs_heap_continue)
+			scandesc->xs_heaptid = last_tid;
+
+		/*
+		 * index_fetch_heap() returns false when either the tuple isn't
+		 * visible or when there's no more to read
+		 */
+		if (!index_fetch_heap(scandesc, slot))
+		{
+			if (ItemPointerIsValid(&scandesc->xs_heaptid))
+				continue;
+
+			if (!scandesc->index_done)
+				continue;
+
+			if (scandesc->xs_heap_continue)
+				continue;
+
+			break;
+		}
+
 		CHECK_FOR_INTERRUPTS();
 
+		scandesc->xs_recheck = scandesc->xs_heapfetch->recheck;
+		scandesc->xs_heaptid = scandesc->xs_heapfetch->tid;
+
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
 		 * the fetched tuple.
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 610f4a56d6..3360da8288 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -4719,6 +4719,7 @@ create_mergejoin_plan(PlannerInfo *root,
 
 	/* Costs of sort and material steps are included in path cost already */
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
+	root->allow_prefetch = false;
 
 	return join_plan;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 2e2458b128..3be133f757 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -413,11 +413,14 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions,
 	root = subquery_planner(glob, parse, NULL,
 							false, tuple_fraction);
 
+	root->allow_prefetch = true;
+
 	/* Select best Path and turn it into a Plan */
 	final_rel = fetch_upper_rel(root, UPPERREL_FINAL, NULL);
 	best_path = get_cheapest_fractional_path(final_rel, tuple_fraction);
 
 	top_plan = create_plan(root, best_path);
+	top_plan->allow_prefetch = root->allow_prefetch;
 
 	/*
 	 * If creating a plan for a scrollable cursor, make sure it can run
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 19605090fe..aaccf25a7f 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -34,6 +34,7 @@ struct PgStreamingRead
 	int			pinned_buffers_trigger;
 	int			next_tail_buffer;
 	bool		finished;
+	bool		resumable;
 	void	   *pgsr_private;
 	PgStreamingReadBufferCB callback;
 	BufferAccessStrategy strategy;
@@ -245,10 +246,12 @@ pg_streaming_read_new_range(PgStreamingRead *pgsr)
 static void
 pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
 {
+	bool		done = pgsr->finished && !pgsr->resumable;
+
 	/*
 	 * If we're finished or can't start more I/O, then don't look ahead.
 	 */
-	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+	if (done || pgsr->ios_in_progress == pgsr->max_ios)
 		return;
 
 	/*
@@ -433,3 +436,9 @@ pg_streaming_read_free(PgStreamingRead *pgsr)
 		pfree(pgsr->per_buffer_data);
 	pfree(pgsr);
 }
+
+void
+pg_streaming_read_set_resumable(PgStreamingRead *pgsr)
+{
+	pgsr->resumable = true;
+}
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8026c2b36d..13d6fab318 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -14,6 +14,7 @@
 #ifndef GENAM_H
 #define GENAM_H
 
+#include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/skey.h"
 #include "nodes/tidbitmap.h"
@@ -24,6 +25,22 @@
 /* We don't want this file to depend on execnodes.h. */
 struct IndexInfo;
 
+typedef struct TIDQueueItem
+{
+	ItemPointerData tid;
+	bool		recheck;
+	IndexTuple	itup;
+	HeapTuple	htup;
+} TIDQueueItem;
+
+typedef struct TIDQueue
+{
+	uint64		head;
+	uint64		tail;
+	int			size;
+	TIDQueueItem data[FLEXIBLE_ARRAY_MEMBER /* size */ ];
+} TIDQueue;
+
 /*
  * Struct for statistics returned by ambuild
  */
@@ -175,6 +192,30 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
+
+extern void index_pgsr_alloc(IndexScanDesc scan);
+
+extern void tid_queue_reset(TIDQueue *q);
+
+extern void index_tid_enqueue(TIDQueue *tid_queue, ItemPointer tid, bool recheck,
+							  HeapTuple htup, IndexTuple itup);
+
+#define TID_QUEUE_SIZE 6
+
+extern TIDQueue *tid_queue_alloc(int size);
+
+static inline bool
+TID_QUEUE_FULL(TIDQueue *q)
+{
+	return q->tail - q->head == q->size;
+}
+
+static inline bool
+TID_QUEUE_EMPTY(TIDQueue *q)
+{
+	return q->head == q->tail;
+}
+
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304a..66b3d90c83 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -18,6 +18,7 @@
 #include "access/itup.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
+#include "storage/streaming_read.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
 
@@ -104,8 +105,16 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker;
 typedef struct IndexFetchTableData
 {
 	Relation	rel;
+	PgStreamingRead *pgsr;
+
+	ItemPointerData tid;
+	bool		recheck;
+	IndexTuple	itup;
+	HeapTuple	htup;
 } IndexFetchTableData;
 
+typedef struct TIDQueue TIDQueue;
+
 /*
  * We use the same IndexScanDescData structure for both amgettuple-based
  * and amgetbitmap-based index scans.  Some fields are only relevant in
@@ -148,6 +157,8 @@ typedef struct IndexScanDescData
 	bool		xs_heap_continue;	/* T if must keep walking, potential
 									 * further results */
 	IndexFetchTableData *xs_heapfetch;
+	TIDQueue   *tid_queue;
+	bool		index_done;
 
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 444a5f0fd5..a8042abf95 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -697,6 +697,7 @@ typedef struct EState
 	struct EPQState *es_epq_active;
 
 	bool		es_use_parallel_mode;	/* can we use parallel workers? */
+	bool		es_use_prefetching;
 
 	/* The per-query shared memory area to use for parallel execution. */
 	struct dsa_area *es_query_dsa;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 534692bee1..171066b14e 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -203,6 +203,7 @@ struct PlannerInfo
 
 	/* 1 at the outermost Query */
 	Index		query_level;
+	bool		allow_prefetch;
 
 	/* NULL at outermost Query */
 	PlannerInfo *parent_root pg_node_attr(read_write_ignore);
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index b4ef6bc44c..317de2d781 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -169,6 +169,7 @@ typedef struct Plan
 	 */
 	Bitmapset  *extParam;
 	Bitmapset  *allParam;
+	bool		allow_prefetch;
 } Plan;
 
 /* ----------------
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index 40c3408c54..2288b7b5eb 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -42,4 +42,6 @@ extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
 extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
 extern void pg_streaming_read_free(PgStreamingRead *pgsr);
 
+extern void pg_streaming_read_set_resumable(PgStreamingRead *pgsr);
+
 #endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 0e34145187..ecc910ff35 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2726,6 +2726,8 @@ TBMSharedIteratorState
 TBMStatus
 TBlockState
 TIDBitmap
+TIDQueue
+TIDQueueItem
 TM_FailureData
 TM_IndexDelete
 TM_IndexDeleteOp
-- 
2.37.2



^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-19 21:43               ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-23 17:43                 ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 00:51                   ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-24 09:19                     ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 20:20                       ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-02-07 21:48                         ` Re: index prefetching Melanie Plageman <[email protected]>
@ 2024-02-13 19:00                           ` Tomas Vondra <[email protected]>
  2024-02-13 19:54                             ` Re: index prefetching Peter Geoghegan <[email protected]>
  2024-02-14 16:40                             ` Re: index prefetching Melanie Plageman <[email protected]>
  1 sibling, 2 replies; 25+ messages in thread

From: Tomas Vondra @ 2024-02-13 19:00 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]>

On 2/7/24 22:48, Melanie Plageman wrote:
> ...
> 
> Attached is a patch which implements a real queue and fixes some of
> the issues with the previous version. It doesn't pass tests yet and
> has issues. Some are bugs in my implementation I need to fix. Some are
> issues we would need to solve in the streaming read API. Some are
> issues with index prefetching generally.
> 
> Note that these two patches have to be applied before 21d9c3ee4e
> because Thomas hasn't released a rebased version of the streaming read
> API patches yet.
> 

Thanks for working on this, and for investigating the various issues.

> Issues
> ---
> - kill prior tuple
> 
> This optimization doesn't work with index prefetching with the current
> design. Kill prior tuple relies on alternating between fetching a
> single index tuple and visiting the heap. After visiting the heap we
> can potentially kill the immediately preceding index tuple. Once we
> fetch multiple index tuples, enqueue their TIDs, and later visit the
> heap, the next index page we visit may not contain all of the index
> tuples deemed killable by our visit to the heap.
> 

I admit I haven't thought about kill_prior_tuple until you pointed out.
Yeah, prefetching separates (de-synchronizes) the two scans (index and
heap) in a way that prevents this optimization. Or at least makes it
much more complex :-(

> In our case, we could try and fix this by prefetching only heap blocks
> referred to by index tuples on the same index page. Or we could try
> and keep a pool of index pages pinned and go back and kill index
> tuples on those pages.
> 

I think restricting the prefetching to a single index page would not be
a huge issue performance-wise - that's what the initial patch version
(implemented at the index AM level) did, pretty much. The prefetch queue
would get drained as we approach the end of the index page, but luckily
index pages tend to have a lot of entries. But it'd put an upper bound
on the prefetch distance (much lower than the e_i_c maximum 1000, but
I'd say common values are 10-100 anyway).

But how would we know we're on the same index page? That knowledge is
not available outside the index AM - the executor or indexam.c does not
know this, right? Presumably we could expose this, somehow, but it seems
like a violation of the abstraction ...

The same thing affects keeping multiple index pages pinned, for TIDs
that are yet to be used by the index scan. We'd need to know when to
release a pinned page, once we're done with processing all items.

FWIW I haven't tried to implementing any of this, so maybe I'm missing
something and it can be made to work in a nice way.

> Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps
> there is an easier way to fix this, as I don't think the mvcc test
> failed on Tomas' version.
> 

I kinda doubt it worked correctly, considering I simply ignored the
optimization. It's far more likely it just worked by luck.

> - switching scan directions
> 
> If the index scan switches directions on a given invocation of
> IndexNext(), heap blocks may have already been prefetched and read for
> blocks containing tuples beyond the point at which we want to switch
> directions.
> 
> We could fix this by having some kind of streaming read "reset"
> callback to drop all of the buffers which have been prefetched which
> are now no longer needed. We'd have to go backwards from the last TID
> which was yielded to the caller and figure out which buffers in the
> pgsr buffer ranges are associated with all of the TIDs which were
> prefetched after that TID. The TIDs are in the per_buffer_data
> associated with each buffer in pgsr. The issue would be searching
> through those efficiently.
> 

Yeah, that's roughly what I envisioned in one of my previous messages
about this issue - walking back the TIDs read from the index and added
to the prefetch queue.

> The other issue is that the streaming read API does not currently
> support backwards scans. So, if we switch to a backwards scan from a
> forwards scan, we would need to fallback to the non streaming read
> method. We could do this by just setting the TID queue size to 1
> (which is what I have currently implemented). Or we could add
> backwards scan support to the streaming read API.
> 

What do you mean by "support for backwards scans" in the streaming read
API? I imagined it naively as

1) drop all requests in the streaming read API queue

2) walk back all "future" requests in the TID queue

3) start prefetching as if from scratch

Maybe there's a way to optimize this and reuse some of the work more
efficiently, but my assumption is that the scan direction does not
change very often, and that we process many items in between.

> - mark and restore
> 
> Similar to the issue with switching the scan direction, mark and
> restore requires us to reset the TID queue and streaming read queue.
> For now, I've hacked in something to the PlannerInfo and Plan to set
> the TID queue size to 1 for plans containing a merge join (yikes).
> 

Haven't thought about this very much, will take a closer look.

> - multiple executions
> 
> For reasons I don't entirely understand yet, multiple executions (not
> rescan -- see ExecutorRun(...execute_once)) do not work. As in Tomas'
> patch, I have disabled prefetching (and made the TID queue size 1)
> when execute_once is false.
> 

Don't work in what sense? What is (not) happening?

> - Index Only Scans need to return IndexTuples
> 
> Because index only scans return either the IndexTuple pointed to by
> IndexScanDesc->xs_itup or the HeapTuple pointed to by
> IndexScanDesc->xs_hitup -- both of which are populated by the index
> AM, we have to save copies of those IndexTupleData and HeapTupleDatas
> for every TID whose block we prefetch.
> 
> This might be okay, but it is a bit sad to have to make copies of those tuples.
> 
> In this patch, I still haven't figured out the memory management part.
> I copy over the tuples when enqueuing a TID queue item and then copy
> them back again when the streaming read API returns the
> per_buffer_data to us. Something is still not quite right here. I
> suspect this is part of the reason why some of the other tests are
> failing.
> 

It's not clear to me what you need to copy the tuples back - shouldn't
it be enough to copy the tuple just once?

FWIW if we decide to pin multiple index pages (to make kill_prior_tuple
work), that would also mean we don't need to copy any tuples, right? We
could point into the buffers for all of them, right?

> Other issues/gaps in my implementation:
> 
> Determining where to allocate the memory for the streaming read object
> and the TID queue is an outstanding TODO. To implement a fallback
> method for cases in which streaming read doesn't work, I set the queue
> size to 1. This is obviously not good.
> 

I think IndexFetchTableData seems like a not entirely terrible place for
allocating the pgsr, but I wonder what Andres thinks about this. IIRC he
advocated for doing the prefetching in executor, and I'm not sure
heapam_handled.c + relscan.h is what he imagined ...

Also, when you say "obviously not good" - why? Are you concerned about
the extra overhead of shuffling stuff between queues, or something else?

> Right now, I allocate the TID queue and streaming read objects in
> IndexNext() and IndexOnlyNext(). This doesn't seem ideal. Doing it in
> index_beginscan() (and index_beginscan_parallel()) is tricky though
> because we don't know the scan direction at that point (and the scan
> direction can change). There are also callers of index_beginscan() who
> do not call Index[Only]Next() (like systable_getnext() which calls
> index_getnext_slot() directly).
> 

Yeah, not sure this is the right layering ... the initial patch did
everything in individual index AMs, then it moved to indexam.c, then to
executor. And this seems to move it to lower layers again ...

> Also, my implementation does not yet have the optimization Tomas does
> to skip prefetching recently prefetched blocks. As he has said, it
> probably makes sense to add something to do this in a lower layer --
> such as in the streaming read API or even in bufmgr.c (maybe in
> PrefetchSharedBuffer()).
> 

I agree this should happen in lower layers. I'd probably do this in the
streaming read API, because that would define "scope" of the cache
(pages prefetched for that read). Doing it in PrefetchSharedBuffer seems
like it would do a single cache (for that particular backend).

But that's just an initial thought ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-19 21:43               ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-23 17:43                 ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 00:51                   ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-24 09:19                     ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 20:20                       ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-02-07 21:48                         ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-02-13 19:00                           ` Re: index prefetching Tomas Vondra <[email protected]>
@ 2024-02-13 19:54                             ` Peter Geoghegan <[email protected]>
  1 sibling, 0 replies; 25+ messages in thread

From: Peter Geoghegan @ 2024-02-13 19:54 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Melanie Plageman <[email protected]>; Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]>

On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra
<[email protected]> wrote:
> On 2/7/24 22:48, Melanie Plageman wrote:
> I admit I haven't thought about kill_prior_tuple until you pointed out.
> Yeah, prefetching separates (de-synchronizes) the two scans (index and
> heap) in a way that prevents this optimization. Or at least makes it
> much more complex :-(

Another thing that argues against doing this is that we might not need
to visit any more B-Tree leaf pages when there is a LIMIT n involved.
We could end up scanning a whole extra leaf page (including all of its
tuples) for want of the ability to "push down" a LIMIT to the index AM
(that's not what happens right now, but it isn't really needed at all
right now).

This property of index scans is fundamental to how index scans work.
Pinning an index page as an interlock against concurrently TID
recycling by VACUUM is directly described by the index API docs [1],
even (the docs actually use terms like "buffer pin" rather than
something more abstract sounding). I don't think that anything
affecting that behavior should be considered an implementation detail
of the nbtree index AM as such (nor any particular index AM).

I think that it makes sense to put the index AM in control here --
that almost follows from what I said about the index AM API. The index
AM already needs to be in control, in about the same way, to deal with
kill_prior_tuple (plus it helps with the  LIMIT issue I described).

There doesn't necessarily need to be much code duplication to make
that work. Offhand I suspect it would be kind of similar to how
deletion of LP_DEAD-marked index tuples by non-nbtree index AMs gets
by with generic logic implemented by
index_compute_xid_horizon_for_tuples -- that's all that we need to
determine a snapshotConflictHorizon value for recovery conflict
purposes. Note that index_compute_xid_horizon_for_tuples() reads
*index* pages, despite not being aware of the caller's index AM and
index tuple format.

(The only reason why nbtree needs a custom solution is because it has
posting list tuples to worry about, unlike GiST and unlike Hash, which
consistently use unadorned generic IndexTuple structs with heap TID
represented in the standard/generic way only. While these concepts
probably all originated in nbtree, they're still not nbtree
implementation details.)

> > Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps
> > there is an easier way to fix this, as I don't think the mvcc test
> > failed on Tomas' version.
> >
>
> I kinda doubt it worked correctly, considering I simply ignored the
> optimization. It's far more likely it just worked by luck.

The test that did fail will have only revealed that the
kill_prior_tuple wasn't operating as  expected -- which isn't the same
thing as giving wrong answers.

Note that there are various ways that concurrent TID recycling might
prevent _bt_killitems() from setting LP_DEAD bits. It's totally
unsurprising that breaking kill_prior_tuple in some way could be
missed. Andres wrote the MVCC test in question precisely because
certain aspects of kill_prior_tuple were broken for months without
anybody noticing.

[1] https://www.postgresql.org/docs/devel/index-locking.html
-- 
Peter Geoghegan

^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-19 21:43               ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-23 17:43                 ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 00:51                   ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-24 09:19                     ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 20:20                       ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-02-07 21:48                         ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-02-13 19:00                           ` Re: index prefetching Tomas Vondra <[email protected]>
@ 2024-02-14 16:40                             ` Melanie Plageman <[email protected]>
  2024-02-14 19:21                               ` Re: index prefetching Peter Geoghegan <[email protected]>
  2024-02-14 21:02                               ` Re: index prefetching Melanie Plageman <[email protected]>
  1 sibling, 2 replies; 25+ messages in thread

From: Melanie Plageman @ 2024-02-14 16:40 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]>; Peter Geoghegan <[email protected]>

On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra
<[email protected]> wrote:
>
> On 2/7/24 22:48, Melanie Plageman wrote:
> > ...
Issues
> > ---
> > - kill prior tuple
> >
> > This optimization doesn't work with index prefetching with the current
> > design. Kill prior tuple relies on alternating between fetching a
> > single index tuple and visiting the heap. After visiting the heap we
> > can potentially kill the immediately preceding index tuple. Once we
> > fetch multiple index tuples, enqueue their TIDs, and later visit the
> > heap, the next index page we visit may not contain all of the index
> > tuples deemed killable by our visit to the heap.
> >
>
> I admit I haven't thought about kill_prior_tuple until you pointed out.
> Yeah, prefetching separates (de-synchronizes) the two scans (index and
> heap) in a way that prevents this optimization. Or at least makes it
> much more complex :-(
>
> > In our case, we could try and fix this by prefetching only heap blocks
> > referred to by index tuples on the same index page. Or we could try
> > and keep a pool of index pages pinned and go back and kill index
> > tuples on those pages.
> >
>
> I think restricting the prefetching to a single index page would not be
> a huge issue performance-wise - that's what the initial patch version
> (implemented at the index AM level) did, pretty much. The prefetch queue
> would get drained as we approach the end of the index page, but luckily
> index pages tend to have a lot of entries. But it'd put an upper bound
> on the prefetch distance (much lower than the e_i_c maximum 1000, but
> I'd say common values are 10-100 anyway).
>
> But how would we know we're on the same index page? That knowledge is
> not available outside the index AM - the executor or indexam.c does not
> know this, right? Presumably we could expose this, somehow, but it seems
> like a violation of the abstraction ...

The easiest way to do this would be to have the index AM amgettuple()
functions set a new member in the IndexScanDescData which is either
the index page identifier or a boolean that indicates we have moved on
to the next page. Then, when filling the queue, we would stop doing so
when the page switches. Now, this wouldn't really work for the first
index tuple on each new page, so, perhaps we would need the index AMs
to implement some kind of "peek" functionality.

Or, we could provide the index AM with a max queue size and allow it
to fill up the queue with the TIDs it wants (which it could keep to
the same index page). And, for the index-only scan case, could have
some kind of flag which indicates if the caller is putting
TIDs+HeapTuples or TIDS+IndexTuples on the queue, which might reduce
the amount of space we need. I'm not sure who manages the memory here.

I wasn't quite sure how we could use
index_compute_xid_horizon_for_tuples() for inspiration -- per Peter's
suggestion. But, I'd like to understand.

> > - switching scan directions
> >
> > If the index scan switches directions on a given invocation of
> > IndexNext(), heap blocks may have already been prefetched and read for
> > blocks containing tuples beyond the point at which we want to switch
> > directions.
> >
> > We could fix this by having some kind of streaming read "reset"
> > callback to drop all of the buffers which have been prefetched which
> > are now no longer needed. We'd have to go backwards from the last TID
> > which was yielded to the caller and figure out which buffers in the
> > pgsr buffer ranges are associated with all of the TIDs which were
> > prefetched after that TID. The TIDs are in the per_buffer_data
> > associated with each buffer in pgsr. The issue would be searching
> > through those efficiently.
> >
>
> Yeah, that's roughly what I envisioned in one of my previous messages
> about this issue - walking back the TIDs read from the index and added
> to the prefetch queue.
>
> > The other issue is that the streaming read API does not currently
> > support backwards scans. So, if we switch to a backwards scan from a
> > forwards scan, we would need to fallback to the non streaming read
> > method. We could do this by just setting the TID queue size to 1
> > (which is what I have currently implemented). Or we could add
> > backwards scan support to the streaming read API.
> >
>
> What do you mean by "support for backwards scans" in the streaming read
> API? I imagined it naively as
>
> 1) drop all requests in the streaming read API queue
>
> 2) walk back all "future" requests in the TID queue
>
> 3) start prefetching as if from scratch
>
> Maybe there's a way to optimize this and reuse some of the work more
> efficiently, but my assumption is that the scan direction does not
> change very often, and that we process many items in between.

Yes, the steps you mention for resetting the queues make sense. What I
meant by "backwards scan is not supported by the streaming read API"
is that Thomas/Andres had mentioned that the streaming read API does
not support backwards scans right now. Though, since the callback just
returns a block number, I don't know how it would break.

When switching between a forwards and backwards scan, does it go
backwards from the current position or start at the end (or beginning)
of the relation? If it is the former, then the blocks would most
likely be in shared buffers -- which the streaming read API handles.
It is not obvious to me from looking at the code what the gap is, so
perhaps Thomas could weigh in.

As for handling this in index prefetching, if you think a TID queue
size of 1 is a sufficient fallback method, then resetting the pgsr
queue and resizing the TID queue to 1 would work with no issues. If
the fallback method requires the streaming read code path not be used
at all, then that is more work.

> > - multiple executions
> >
> > For reasons I don't entirely understand yet, multiple executions (not
> > rescan -- see ExecutorRun(...execute_once)) do not work. As in Tomas'
> > patch, I have disabled prefetching (and made the TID queue size 1)
> > when execute_once is false.
> >
>
> Don't work in what sense? What is (not) happening?

I got wrong results for this. I'll have to do more investigation, but
I assumed that not resetting the TID queue and pgsr queue was also the
source of this issue.

What I imagined we would do is figure out if there is a viable
solution for the larger design issues and then investigate what seemed
like smaller issues. But, perhaps I should dig into this first to
ensure there isn't a larger issue.

> > - Index Only Scans need to return IndexTuples
> >
> > Because index only scans return either the IndexTuple pointed to by
> > IndexScanDesc->xs_itup or the HeapTuple pointed to by
> > IndexScanDesc->xs_hitup -- both of which are populated by the index
> > AM, we have to save copies of those IndexTupleData and HeapTupleDatas
> > for every TID whose block we prefetch.
> >
> > This might be okay, but it is a bit sad to have to make copies of those tuples.
> >
> > In this patch, I still haven't figured out the memory management part.
> > I copy over the tuples when enqueuing a TID queue item and then copy
> > them back again when the streaming read API returns the
> > per_buffer_data to us. Something is still not quite right here. I
> > suspect this is part of the reason why some of the other tests are
> > failing.
> >
>
> It's not clear to me what you need to copy the tuples back - shouldn't
> it be enough to copy the tuple just once?

When enqueueing it, IndexTuple has to be copied from the scan
descriptor to somewhere in memory with a TIDQueueItem pointing to it.
Once we do this, the IndexTuple memory should stick around until we
free it, so yes, I'm not sure why I was seeing the IndexTuple no
longer be valid when I tried to put it in a slot. I'll have to do more
investigation.

> FWIW if we decide to pin multiple index pages (to make kill_prior_tuple
> work), that would also mean we don't need to copy any tuples, right? We
> could point into the buffers for all of them, right?

Yes, this would be a nice benefit.

> > Other issues/gaps in my implementation:
> >
> > Determining where to allocate the memory for the streaming read object
> > and the TID queue is an outstanding TODO. To implement a fallback
> > method for cases in which streaming read doesn't work, I set the queue
> > size to 1. This is obviously not good.
> >
>
> I think IndexFetchTableData seems like a not entirely terrible place for
> allocating the pgsr, but I wonder what Andres thinks about this. IIRC he
> advocated for doing the prefetching in executor, and I'm not sure
> heapam_handled.c + relscan.h is what he imagined ...
>
> Also, when you say "obviously not good" - why? Are you concerned about
> the extra overhead of shuffling stuff between queues, or something else?

Well, I didn't resize the queue, I just limited how much of it we can
use to a single member (thus wasting the other memory). But resizing a
queue isn't free either. Also, I wondered if a queue size of 1 for
index AMs using the fallback method is too confusing (like it is a
fake queue?). But, I'd really, really rather not maintain both a queue
and non-queue control flow for Index[Only]Next(). The maintenance
overhead seems like it would outweigh the potential downsides.

> > Right now, I allocate the TID queue and streaming read objects in
> > IndexNext() and IndexOnlyNext(). This doesn't seem ideal. Doing it in
> > index_beginscan() (and index_beginscan_parallel()) is tricky though
> > because we don't know the scan direction at that point (and the scan
> > direction can change). There are also callers of index_beginscan() who
> > do not call Index[Only]Next() (like systable_getnext() which calls
> > index_getnext_slot() directly).
> >
>
> Yeah, not sure this is the right layering ... the initial patch did
> everything in individual index AMs, then it moved to indexam.c, then to
> executor. And this seems to move it to lower layers again ...

If we do something like make the index AM responsible for the TID
queue (as mentioned above as a potential solution to the kill prior
tuple issue), then we might be able to allocate the TID queue in the
index AMs?

As for the streaming read object, if we were able to solve the issue
where callers of index_beginscan() don't call Index[Only]Next() (and
thus shouldn't allocate a streaming read object), then it seems easy
enough to move the streaming read object allocation into the table
AM-specific begin scan method.

> > Also, my implementation does not yet have the optimization Tomas does
> > to skip prefetching recently prefetched blocks. As he has said, it
> > probably makes sense to add something to do this in a lower layer --
> > such as in the streaming read API or even in bufmgr.c (maybe in
> > PrefetchSharedBuffer()).
> >
>
> I agree this should happen in lower layers. I'd probably do this in the
> streaming read API, because that would define "scope" of the cache
> (pages prefetched for that read). Doing it in PrefetchSharedBuffer seems
> like it would do a single cache (for that particular backend).

Hmm. I wonder if there are any upsides to having the cache be
per-backend. Though, that does sound like a whole other project...

-  Melanie






^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-19 21:43               ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-23 17:43                 ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 00:51                   ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-24 09:19                     ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 20:20                       ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-02-07 21:48                         ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-02-13 19:00                           ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-02-14 16:40                             ` Re: index prefetching Melanie Plageman <[email protected]>
@ 2024-02-14 19:21                               ` Peter Geoghegan <[email protected]>
  1 sibling, 0 replies; 25+ messages in thread

From: Peter Geoghegan @ 2024-02-14 19:21 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Tomas Vondra <[email protected]>; Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]>

On Wed, Feb 14, 2024 at 11:40 AM Melanie Plageman
<[email protected]> wrote:
> I wasn't quite sure how we could use
> index_compute_xid_horizon_for_tuples() for inspiration -- per Peter's
> suggestion. But, I'd like to understand.

The point I was trying to make with that example was: a highly generic
mechanism can sometimes work across disparate index AMs (that all at
least support plain index scans) when it just so happens that these
AMs don't actually differ in a way that could possibly matter to that
mechanism. While it's true that (say) nbtree and hash are very
different at a high level, it's nevertheless also true that the way
things work at the level of individual index pages is much more
similar than different.

With index deletion, we know that we're differences between each
supported index AM either don't matter at all (which is what obviates
the need for index_compute_xid_horizon_for_tuples() to be directly
aware of which index AM the page it is passed comes from), or matter
only in small, incidental ways (e.g., nbtree stores posting lists in
its tuples, despite using IndexTuple structs).

With prefetching, it seems reasonable to suppose that an index-AM
specific approach would end up needing very little truly custom code.
This is pretty strongly suggested by the fact that the rules around
buffer pins (as an interlock against concurrent TID recycling by
VACUUM) are standardized by the index AM API itself. Those rules might
be slightly more natural with nbtree, but that's kinda beside the
point. While the basic organizing principle for where each index tuple
goes can vary enormously, it doesn't necessarily matter at all -- in
the end, you're really just reading each index page (that has TIDs to
read) exactly once per scan, in some fixed order, with interlaced
inline heap accesses (that go fetch heap tuples for each individual
TID read from each index page).

In general I don't accept that we need to do things outside the index
AM, because software architecture encapsulation something something. I
suspect that we'll need to share some limited information across
different layers of abstraction, because that's just fundamentally
what's required by the constraints we're operating under. Can't really
prove it, though.

-- 
Peter Geoghegan

^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-19 21:43               ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-23 17:43                 ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 00:51                   ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-24 09:19                     ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 20:20                       ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-02-07 21:48                         ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-02-13 19:00                           ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-02-14 16:40                             ` Re: index prefetching Melanie Plageman <[email protected]>
@ 2024-02-14 21:02                               ` Melanie Plageman <[email protected]>
  1 sibling, 0 replies; 25+ messages in thread

From: Melanie Plageman @ 2024-02-14 21:02 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Robert Haas <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]>; Peter Geoghegan <[email protected]>

On Wed, Feb 14, 2024 at 11:40 AM Melanie Plageman
<[email protected]> wrote:
>
> On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra
> <[email protected]> wrote:
> >
> > On 2/7/24 22:48, Melanie Plageman wrote:
> > > ...
> > > - switching scan directions
> > >
> > > If the index scan switches directions on a given invocation of
> > > IndexNext(), heap blocks may have already been prefetched and read for
> > > blocks containing tuples beyond the point at which we want to switch
> > > directions.
> > >
> > > We could fix this by having some kind of streaming read "reset"
> > > callback to drop all of the buffers which have been prefetched which
> > > are now no longer needed. We'd have to go backwards from the last TID
> > > which was yielded to the caller and figure out which buffers in the
> > > pgsr buffer ranges are associated with all of the TIDs which were
> > > prefetched after that TID. The TIDs are in the per_buffer_data
> > > associated with each buffer in pgsr. The issue would be searching
> > > through those efficiently.
> > >
> >
> > Yeah, that's roughly what I envisioned in one of my previous messages
> > about this issue - walking back the TIDs read from the index and added
> > to the prefetch queue.
> >
> > > The other issue is that the streaming read API does not currently
> > > support backwards scans. So, if we switch to a backwards scan from a
> > > forwards scan, we would need to fallback to the non streaming read
> > > method. We could do this by just setting the TID queue size to 1
> > > (which is what I have currently implemented). Or we could add
> > > backwards scan support to the streaming read API.
> > >
> >
> > What do you mean by "support for backwards scans" in the streaming read
> > API? I imagined it naively as
> >
> > 1) drop all requests in the streaming read API queue
> >
> > 2) walk back all "future" requests in the TID queue
> >
> > 3) start prefetching as if from scratch
> >
> > Maybe there's a way to optimize this and reuse some of the work more
> > efficiently, but my assumption is that the scan direction does not
> > change very often, and that we process many items in between.
>
> Yes, the steps you mention for resetting the queues make sense. What I
> meant by "backwards scan is not supported by the streaming read API"
> is that Thomas/Andres had mentioned that the streaming read API does
> not support backwards scans right now. Though, since the callback just
> returns a block number, I don't know how it would break.
>
> When switching between a forwards and backwards scan, does it go
> backwards from the current position or start at the end (or beginning)
> of the relation?

Okay, well I answered this question for myself, by, um, trying it :).
FETCH backward will go backwards from the current cursor position. So,
I don't see exactly why this would be an issue.

> If it is the former, then the blocks would most
> likely be in shared buffers -- which the streaming read API handles.
> It is not obvious to me from looking at the code what the gap is, so
> perhaps Thomas could weigh in.

I have the same problem with the sequential scan streaming read user,
so I am going to try and figure this backwards scan and switching scan
direction thing there (where we don't have other issues).

- Melanie






^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-19 21:43               ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-23 17:43                 ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 00:51                   ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-24 09:19                     ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 20:20                       ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-02-07 21:48                         ` Re: index prefetching Melanie Plageman <[email protected]>
@ 2024-02-14 07:10                           ` Robert Haas <[email protected]>
  2024-02-14 14:13                             ` Re: index prefetching Tomas Vondra <[email protected]>
  1 sibling, 1 reply; 25+ messages in thread

From: Robert Haas @ 2024-02-14 07:10 UTC (permalink / raw)
  To: Melanie Plageman <[email protected]>; +Cc: Tomas Vondra <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]>

On Thu, Feb 8, 2024 at 3:18 AM Melanie Plageman
<[email protected]> wrote:
> - kill prior tuple
>
> This optimization doesn't work with index prefetching with the current
> design. Kill prior tuple relies on alternating between fetching a
> single index tuple and visiting the heap. After visiting the heap we
> can potentially kill the immediately preceding index tuple. Once we
> fetch multiple index tuples, enqueue their TIDs, and later visit the
> heap, the next index page we visit may not contain all of the index
> tuples deemed killable by our visit to the heap.

Is this maybe just a bookkeeping problem? A Boolean that says "you can
kill the prior tuple" is well-suited if and only if the prior tuple is
well-defined. But perhaps it could be replaced with something more
sophisticated that tells you which tuples are eligible to be killed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com






^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-19 21:43               ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-23 17:43                 ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 00:51                   ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-24 09:19                     ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 20:20                       ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-02-07 21:48                         ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-02-14 07:10                           ` Re: index prefetching Robert Haas <[email protected]>
@ 2024-02-14 14:13                             ` Tomas Vondra <[email protected]>
  2024-02-15 04:29                               ` Re: index prefetching Robert Haas <[email protected]>
  0 siblings, 1 reply; 25+ messages in thread

From: Tomas Vondra @ 2024-02-14 14:13 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; Melanie Plageman <[email protected]>; +Cc: Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]>



On 2/14/24 08:10, Robert Haas wrote:
> On Thu, Feb 8, 2024 at 3:18 AM Melanie Plageman
> <[email protected]> wrote:
>> - kill prior tuple
>>
>> This optimization doesn't work with index prefetching with the current
>> design. Kill prior tuple relies on alternating between fetching a
>> single index tuple and visiting the heap. After visiting the heap we
>> can potentially kill the immediately preceding index tuple. Once we
>> fetch multiple index tuples, enqueue their TIDs, and later visit the
>> heap, the next index page we visit may not contain all of the index
>> tuples deemed killable by our visit to the heap.
> 
> Is this maybe just a bookkeeping problem? A Boolean that says "you can
> kill the prior tuple" is well-suited if and only if the prior tuple is
> well-defined. But perhaps it could be replaced with something more
> sophisticated that tells you which tuples are eligible to be killed.
> 

I don't think it's just a bookkeeping problem. In a way, nbtree already
does keep an array of tuples to kill (see btgettuple), but it's always
for the current index page. So it's not that we immediately go and kill
the prior tuple - nbtree already stashes it in an array, and kills all
those tuples when moving to the next index page.

The way I understand the problem is that with prefetching we're bound to
determine the kill_prior_tuple flag with a delay, in which case we might
have already moved to the next index page ...


So to make this work, we'd need to:

1) keep index pages pinned for all "in flight" TIDs (read from the
index, not yet consumed by the index scan)

2) keep a separate array of "to be killed" index tuples for each page

3) have a more sophisticated way to decide when to kill tuples and unpin
the index page (instead of just doing it when moving to the next index page)

Maybe that's what you meant by "more sophisticated bookkeeping", ofc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-19 21:43               ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-23 17:43                 ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 00:51                   ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-24 09:19                     ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 20:20                       ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-02-07 21:48                         ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-02-14 07:10                           ` Re: index prefetching Robert Haas <[email protected]>
  2024-02-14 14:13                             ` Re: index prefetching Tomas Vondra <[email protected]>
@ 2024-02-15 04:29                               ` Robert Haas <[email protected]>
  2024-02-15 05:03                                 ` Re: index prefetching Andres Freund <[email protected]>
  0 siblings, 1 reply; 25+ messages in thread

From: Robert Haas @ 2024-02-15 04:29 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Melanie Plageman <[email protected]>; Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]>

On Wed, Feb 14, 2024 at 7:43 PM Tomas Vondra
<[email protected]> wrote:
> I don't think it's just a bookkeeping problem. In a way, nbtree already
> does keep an array of tuples to kill (see btgettuple), but it's always
> for the current index page. So it's not that we immediately go and kill
> the prior tuple - nbtree already stashes it in an array, and kills all
> those tuples when moving to the next index page.
>
> The way I understand the problem is that with prefetching we're bound to
> determine the kill_prior_tuple flag with a delay, in which case we might
> have already moved to the next index page ...

Well... I'm not clear on all of the details of how this works, but
this sounds broken to me, for the reasons that Peter G. mentions in
his comments about desynchronization. If we currently have a rule that
you hold a pin on the index page while processing the heap tuples it
references, you can't just throw that out the window and expect things
to keep working. Saying that kill_prior_tuple doesn't work when you
throw that rule out the window is probably understating the extent of
the problem very considerably.

I would have thought that the way this prefetching would work is that
we would bring pages into shared_buffers sooner than we currently do,
but not actually pin them until we're ready to use them, so that it's
possible they might be evicted again before we get around to them, if
we prefetch too far and the system is too busy. Alternately, it also
seems OK to read those later pages and pin them right away, as long as
(1) we don't also give up pins that we would have held in the absence
of prefetching and (2) we have some mechanism for limiting the number
of extra pins that we're holding to a reasonable number given the size
of shared_buffers.

However, it doesn't seem OK at all to give up pins that the current
code holds sooner than the current code would do.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

^ permalink  raw  reply  [nested|flat] 25+ messages in thread

* Re: index prefetching
  2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
  2023-12-21 12:30 ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 13:43   ` Re: index prefetching Andres Freund <[email protected]>
  2023-12-21 15:20     ` Re: index prefetching Tomas Vondra <[email protected]>
  2023-12-21 15:43       ` Re: index prefetching Andres Freund <[email protected]>
  2024-01-04 14:55         ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-09 20:31           ` Re: index prefetching Robert Haas <[email protected]>
  2024-01-12 16:42             ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-19 21:43               ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-23 17:43                 ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 00:51                   ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-01-24 09:19                     ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-01-24 20:20                       ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-02-07 21:48                         ` Re: index prefetching Melanie Plageman <[email protected]>
  2024-02-14 07:10                           ` Re: index prefetching Robert Haas <[email protected]>
  2024-02-14 14:13                             ` Re: index prefetching Tomas Vondra <[email protected]>
  2024-02-15 04:29                               ` Re: index prefetching Robert Haas <[email protected]>
@ 2024-02-15 05:03                                 ` Andres Freund <[email protected]>
  0 siblings, 0 replies; 25+ messages in thread

From: Andres Freund @ 2024-02-15 05:03 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: Tomas Vondra <[email protected]>; Melanie Plageman <[email protected]>; PostgreSQL Hackers <[email protected]>; Georgios <[email protected]>; Thomas Munro <[email protected]>; Konstantin Knizhnik <[email protected]>; Dilip Kumar <[email protected]>

Hi,

On 2024-02-15 09:59:27 +0530, Robert Haas wrote:
> I would have thought that the way this prefetching would work is that
> we would bring pages into shared_buffers sooner than we currently do,
> but not actually pin them until we're ready to use them, so that it's
> possible they might be evicted again before we get around to them, if
> we prefetch too far and the system is too busy.

The issue here is that we need to read index leaf pages (synchronously for
now!) to get the tids to do readahead of table data. What you describe is done
for the table data (IMO not a good idea medium term [1]), but the problem at
hand is that once we've done readahead for all the tids on one index page, we
can't do more readahead without looking at the next index leaf page.

Obviously that would lead to a sawtooth like IO pattern, where you'd regularly
have to wait for IO for the first tuples referenced by an index leaf page.

However, if we want to issue table readahead for tids on the neighboring index
leaf page, we'll - as the patch stands - not hold a pin on the "current" index
leaf page. Which makes index prefetching as currently implemented incompatible
with kill_prior_tuple, as that requires the index leaf page pin being held.

> Alternately, it also seems OK to read those later pages and pin them right
> away, as long as (1) we don't also give up pins that we would have held in
> the absence of prefetching and (2) we have some mechanism for limiting the
> number of extra pins that we're holding to a reasonable number given the
> size of shared_buffers.

FWIW, there's already some logic for (2) in LimitAdditionalPins(). Currently
used to limit how many buffers a backend may pin for bulk relation extension.

Greetings,

Andres Freund

[1] The main reasons that I think that just doing readahead without keeping a
pin is a bad idea, at least medium term, are:

a) To do AIO you need to hold a pin on the page while the IO is in progress,
as the target buffer contents will be modified at some moment you don't
control, so that buffer should better not be replaced while IO is in
progress. So at the very least you need to hold a pin until the IO is over.

b) If you do not keep a pin until you actually use the page, you need to
either do another buffer lookup (expensive!) or you need to remember the
buffer id and revalidate that it's still pointing to the same block (cheaper,
but still not cheap).  That's not just bad because it's slow in an absolute
sense, more importantly it increases the potential performance downside of
doing readahead for fully cached workloads, because you don't gain anything,
but pay the price of two lookups/revalidation.

Note that these reasons really just apply to cases where we read ahead because
we are quite certain we'll need exactly those blocks (leaving errors or
queries ending early aside), not for "heuristic" prefetching. If we e.g. were
to issue prefetch requests for neighboring index pages while descending during
an ordered index scan, without checking that we'll need those, it'd make sense
to just do a "throway" prefetch request.

^ permalink  raw  reply  [nested|flat] 25+ messages in thread

end of thread, other threads:[~2024-02-15 05:03 UTC | newest]

Thread overview: 25+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2023-12-20 19:09 Re: index prefetching Robert Haas <[email protected]>
2023-12-21 12:30 ` Tomas Vondra <[email protected]>
2023-12-21 13:43   ` Andres Freund <[email protected]>
2023-12-21 15:20     ` Tomas Vondra <[email protected]>
2023-12-21 15:43       ` Andres Freund <[email protected]>
2024-01-04 14:55         ` Tomas Vondra <[email protected]>
2024-01-09 20:31           ` Robert Haas <[email protected]>
2024-01-12 16:42             ` Tomas Vondra <[email protected]>
2024-01-12 16:52               ` Robert Haas <[email protected]>
2024-01-19 21:43               ` Melanie Plageman <[email protected]>
2024-01-22 04:53                 ` Peter Smith <[email protected]>
2024-01-23 17:43                 ` Tomas Vondra <[email protected]>
2024-01-24 00:51                   ` Melanie Plageman <[email protected]>
2024-01-24 09:19                     ` Tomas Vondra <[email protected]>
2024-01-24 20:20                       ` Melanie Plageman <[email protected]>
2024-02-07 21:48                         ` Melanie Plageman <[email protected]>
2024-02-13 19:00                           ` Tomas Vondra <[email protected]>
2024-02-13 19:54                             ` Peter Geoghegan <[email protected]>
2024-02-14 16:40                             ` Melanie Plageman <[email protected]>
2024-02-14 19:21                               ` Peter Geoghegan <[email protected]>
2024-02-14 21:02                               ` Melanie Plageman <[email protected]>
2024-02-14 07:10                           ` Robert Haas <[email protected]>
2024-02-14 14:13                             ` Tomas Vondra <[email protected]>
2024-02-15 04:29                               ` Robert Haas <[email protected]>
2024-02-15 05:03                                 ` Andres Freund <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox