Message-ID: <4867452a-b853-4813-a6da-9bb06a336f8b@enterprisedb.com>
Date: Wed, 24 Jan 2024 10:19:44 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: index prefetching
To: Melanie Plageman <melanieplageman@gmail.com>
Cc: Robert Haas <robertmhaas@gmail.com>, Andres Freund <andres@anarazel.de>,
 PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Georgios <gkokolatos@protonmail.com>
References: <8ec36f51-b863-60e3-20e2-b9c981c5ce5e@enterprisedb.com>
 <367160ea-b1ed-4481-e804-bca509128878@enterprisedb.com>
 <cd63c8bc-3d3c-00ab-f7eb-fa55969c3234@enterprisedb.com>
 <CA+Tgmoaf0EHzcfWRydntzqhfc1Nv96C-60fEBbdf1pYZmT0SLA@mail.gmail.com>
 <280dc83c-a16f-4424-1319-95e7e3f798bd@enterprisedb.com>
 <CA+Tgmob5Nh7sQO8GENzEDSjYJQuKCgfhWtwDLjMtdueFMCh05A@mail.gmail.com>
 <98ba4b25-fae8-c1f4-1597-8093375a1986@enterprisedb.com>
 <20231221134314.wf2rs62d37u62j7t@alap3.anarazel.de>
 <afd17f3c-d496-5583-e93a-5a162d37235f@enterprisedb.com>
 <20231221154352.ijtg6wloa3nowivh@alap3.anarazel.de>
 <e00dac57-0181-4d9c-9eae-71346383e69b@enterprisedb.com>
 <CA+TgmoacmxNQBZsgMPSn9b7oE6QQLOqr_EP=YZTzdfHrqa5dqw@mail.gmail.com>
 <482ec3ff-52ad-415d-96fd-f3832a894023@enterprisedb.com>
 <CAAKRu_ZPDhNwwFxQwS8NdeTFkycM1c=tNLKdU0J-M6KxCjdEmQ@mail.gmail.com>
 <56176b8d-956c-487e-ab09-310db4581c07@enterprisedb.com>
 <CAAKRu_ad8cywU4_X+5e4A9Wy_PZmrCx2aoLpdHPBRvb9inGDgQ@mail.gmail.com>
Content-Language: en-US
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
In-Reply-To: 
 <CAAKRu_ad8cywU4_X+5e4A9Wy_PZmrCx2aoLpdHPBRvb9inGDgQ@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Archived-At: 
 <https://www.postgresql.org/message-id/4867452a-b853-4813-a6da-9bb06a336f8b%40enterprisedb.com>
Precedence: bulk

On 1/24/24 01:51, Melanie Plageman wrote:
> On Tue, Jan 23, 2024 at 12:43 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 1/19/24 22:43, Melanie Plageman wrote:
>>
>>> We fill a queue with blocks from TIDs that we fetched from the index.
>>> The queue is saved in a scan descriptor that is made available to the
>>> streaming read callback. Once the queue is full, we invoke the table
>>> AM specific index_fetch_tuple() function which calls
>>> pg_streaming_read_buffer_get_next(). When the streaming read API
>>> invokes the callback we registered, it simply dequeues a block number
>>> for prefetching.
>>
>> So in a way there are two queues in IndexFetchTableData. One (blk_queue)
>> is being filled from IndexNext, and then the queue in StreamingRead.
> 
> I've changed the name from blk_queue to tid_queue to fix the issue you
> mention in your later remarks.
> I suppose there are two queues. The tid_queue is just to pass the
> block requests to the streaming read API. The prefetch distance will
> be the smaller of the two sizes.
> 

FWIW I think the two queues are a nice / elegant approach. In hindsight
my problems with trying to utilize the StreamingRead were due to trying
to use the block-oriented API directly from places that work with TIDs,
and this just makes that go away.

I wonder what the overhead of shuffling stuff between queues will be,
but hopefully not too high (that's my assumption).

>>> The only change to the streaming read API is that now, even if the
>>> callback returns InvalidBlockNumber, we may not be finished, so make
>>> it resumable.
>>
>> Hmm, not sure when can the callback return InvalidBlockNumber before
>> reaching the end. Perhaps for the first index_fetch_heap call? Any
>> reason not to fill the blk_queue before calling index_fetch_heap?
> 
> The callback will return InvalidBlockNumber whenever the queue is
> empty. Let's say your queue size is 5 and your effective prefetch
> distance is 10 (some combination of the PgStreamingReadRange sizes and
> PgStreamingRead->max_ios). The first time you call index_fetch_heap(),
> the callback returns InvalidBlockNumber. Then the tid_queue is filled
> with 5 tids. Then index_fetch_heap() is called.
> pg_streaming_read_look_ahead() will prefetch all 5 of these TID's
> blocks, emptying the queue. Once all 5 have been dequeued, the
> callback will return InvalidBlockNumber.
> pg_streaming_read_buffer_get_next() will return one of the 5 blocks in
> a buffer and save the associated TID in the per_buffer_data. Before
> index_fetch_heap() is called again, we will see that the queue is not
> full and fill it up again with 5 TIDs. So, the callback will return
> InvalidBlockNumber 3 times in this scenario.
> 

Thanks for the explanation. Yes, I didn't realize that the queues may be
of different length, at which point it makes sense to return invalid
block to signal the TID queue is empty.

>>> Structurally, this changes the timing of when the heap blocks are
>>> prefetched. Your code would get a tid from the index and then prefetch
>>> the heap block -- doing this until it filled a queue that had the
>>> actual tids saved in it. With my approach and the streaming read API,
>>> you fetch tids from the index until you've filled up a queue of block
>>> numbers. Then the streaming read API will prefetch those heap blocks.
>>
>> And is that a good/desirable change? I'm not saying it's not, but maybe
>> we should not be filling either queue in one go - we don't want to
>> overload the prefetching.
> 
> We can focus on the prefetch distance algorithm maintained in the
> streaming read API and then make sure that the tid_queue is larger
> than the desired prefetch distance maintained by the streaming read
> API.
> 

Agreed. I think I wasn't quite right when concerned about "overloading"
the prefetch, because that depends entirely on the StreamingRead API
queue. A lage TID queue can't cause overload of anything.

What could happen is a TID queue being too small, so the prefetch can't
hit the target distance. But that can happen already, e.g. indexes that
are correlated and/or index-only scans with all-visible pages.

>>> There are also table AM layering violations in my sketch which would
>>> have to be worked out (not to mention some resource leakage I didn't
>>> bother investigating [which causes it to fail tests]).
>>>
>>> 0001 is all of Thomas' streaming read API code that isn't yet in
>>> master and 0002 is my rough sketch of index prefetching using the
>>> streaming read API
>>>
>>> There are also numerous optimizations that your index prefetching
>>> patch set does that would need to be added in some way. I haven't
>>> thought much about it yet. I wanted to see what you thought of this
>>> approach first. Basically, is it workable?
>>
>> It seems workable, yes. I'm not sure it's much simpler than my patch
>> (considering a lot of the code is in the optimizations, which are
>> missing from this patch).
>>
>> I think the question is where should the optimizations happen. I suppose
>> some of them might/should happen in the StreamingRead API itself - like
>> the detection of sequential patterns, recently prefetched blocks, ...
> 
> So, the streaming read API does detection of sequential patterns and
> not prefetching things that are in shared buffers. It doesn't handle
> avoiding prefetching recently prefetched blocks yet AFAIK. But I
> daresay this would be relevant for other streaming read users and
> could certainly be implemented there.
> 

Yes, the "recently prefetched stuff" cache seems like a fairly natural
complement to the pattern detection and shared-buffers check.

FWIW I wonder if we should make some of this customizable, so that
systems with customized storage (e.g. neon or with direct I/O) can e.g.
disable some of these checks. Or replace them with their version.

>> But I'm not sure what to do about optimizations that are more specific
>> to the access path. Consider for example the index-only scans. We don't
>> want to prefetch all the pages, we need to inspect the VM and prefetch
>> just the not-all-visible ones. And then pass the info to the index scan,
>> so that it does not need to check the VM again. It's not clear to me how
>> to do this with this approach.
> 
> Yea, this is an issue I'll need to think about. To really spell out
> the problem: the callback dequeues a TID from the tid_queue and looks
> up its block in the VM. It's all visible. So, it shouldn't return that
> block to the streaming read API to fetch from the heap because it
> doesn't need to be read. But, where does the callback put the TID so
> that the caller can get it? I'm going to think more about this.
> 

Yes, that's the problem for index-only scans. I'd generalize it so that
it's about the callback being able to (a) decide if it needs to read the
heap page, and (b) store some custom info for the TID.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company