Message-ID: <c332fe83-c187-48ea-8520-458d4738f3a1@enterprisedb.com>
Date: Wed, 14 Feb 2024 14:34:40 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: index prefetching
To: Peter Geoghegan <pg@bowt.ie>
Cc: Melanie Plageman <melanieplageman@gmail.com>,
 Robert Haas <robertmhaas@gmail.com>, Andres Freund <andres@anarazel.de>,
 PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Georgios <gkokolatos@protonmail.com>, Thomas Munro <thomas.munro@gmail.com>,
 Konstantin Knizhnik <knizhnik@garret.ru>, Dilip Kumar <dilipbalaut@gmail.com>
References: <8ec36f51-b863-60e3-20e2-b9c981c5ce5e@enterprisedb.com>
 <98ba4b25-fae8-c1f4-1597-8093375a1986@enterprisedb.com>
 <20231221134314.wf2rs62d37u62j7t@alap3.anarazel.de>
 <afd17f3c-d496-5583-e93a-5a162d37235f@enterprisedb.com>
 <20231221154352.ijtg6wloa3nowivh@alap3.anarazel.de>
 <e00dac57-0181-4d9c-9eae-71346383e69b@enterprisedb.com>
 <CA+TgmoacmxNQBZsgMPSn9b7oE6QQLOqr_EP=YZTzdfHrqa5dqw@mail.gmail.com>
 <482ec3ff-52ad-415d-96fd-f3832a894023@enterprisedb.com>
 <CAAKRu_ZPDhNwwFxQwS8NdeTFkycM1c=tNLKdU0J-M6KxCjdEmQ@mail.gmail.com>
 <56176b8d-956c-487e-ab09-310db4581c07@enterprisedb.com>
 <CAAKRu_ad8cywU4_X+5e4A9Wy_PZmrCx2aoLpdHPBRvb9inGDgQ@mail.gmail.com>
 <4867452a-b853-4813-a6da-9bb06a336f8b@enterprisedb.com>
 <CAAKRu_Y=go4u99udRaj3k0vuF6EAfWqM86BONhxB4MV9X4FRpQ@mail.gmail.com>
 <CAAKRu_ZPuaO5XwXfM7wowf4sSmPSJ2LT9+zfmD5=LQ=WhV_j=Q@mail.gmail.com>
 <4f5f16ef-df1e-4e09-9b3f-2e0961ab5117@enterprisedb.com>
 <CAH2-WznsJqDgr_0yUwApgYXi3cRZQbimFkiYRqqXhpMcw4s8ZQ@mail.gmail.com>
Content-Language: en-US
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
In-Reply-To: <CAH2-WznsJqDgr_0yUwApgYXi3cRZQbimFkiYRqqXhpMcw4s8ZQ@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Archived-At: <https://www.postgresql.org/message-id/c332fe83-c187-48ea-8520-458d4738f3a1%40enterprisedb.com>
Precedence: bulk

On 2/13/24 20:54, Peter Geoghegan wrote:
> On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> On 2/7/24 22:48, Melanie Plageman wrote:
>> I admit I haven't thought about kill_prior_tuple until you pointed out.
>> Yeah, prefetching separates (de-synchronizes) the two scans (index and
>> heap) in a way that prevents this optimization. Or at least makes it
>> much more complex :-(
> 
> Another thing that argues against doing this is that we might not need
> to visit any more B-Tree leaf pages when there is a LIMIT n involved.
> We could end up scanning a whole extra leaf page (including all of its
> tuples) for want of the ability to "push down" a LIMIT to the index AM
> (that's not what happens right now, but it isn't really needed at all
> right now).
> 

I'm not quite sure I understand what is "this" that you argue against.
Are you saying we should not separate the two scans? If yes, is there a
better way to do this?

The LIMIT problem is not very clear to me either. Yes, if we get close
to the end of the leaf page, we may need to visit the next leaf page.
But that's kinda the whole point of prefetching - reading stuff ahead,
and reading too far ahead is an inherent risk. Isn't that a problem we
have even without LIMIT? The prefetch distance ramp up is meant to limit
the impact.

> This property of index scans is fundamental to how index scans work.
> Pinning an index page as an interlock against concurrently TID
> recycling by VACUUM is directly described by the index API docs [1],
> even (the docs actually use terms like "buffer pin" rather than
> something more abstract sounding). I don't think that anything
> affecting that behavior should be considered an implementation detail
> of the nbtree index AM as such (nor any particular index AM).
> 

Good point.

> I think that it makes sense to put the index AM in control here --
> that almost follows from what I said about the index AM API. The index
> AM already needs to be in control, in about the same way, to deal with
> kill_prior_tuple (plus it helps with the  LIMIT issue I described).
> 

In control how? What would be the control flow - what part would be
managed by the index AM?

I initially did the prefetching entirely in each index AM, but it was
suggested doing this in the executor would be better. So I gradually
moved it to executor. But the idea to combine this with the streaming
read API seems as a move from executor back to the lower levels ... and
now you're suggesting to make the index AM responsible for this again.

I'm not saying any of those layering options is wrong, but it's not
clear to me which is the right one.

> There doesn't necessarily need to be much code duplication to make
> that work. Offhand I suspect it would be kind of similar to how
> deletion of LP_DEAD-marked index tuples by non-nbtree index AMs gets
> by with generic logic implemented by
> index_compute_xid_horizon_for_tuples -- that's all that we need to
> determine a snapshotConflictHorizon value for recovery conflict
> purposes. Note that index_compute_xid_horizon_for_tuples() reads
> *index* pages, despite not being aware of the caller's index AM and
> index tuple format.
> 
> (The only reason why nbtree needs a custom solution is because it has
> posting list tuples to worry about, unlike GiST and unlike Hash, which
> consistently use unadorned generic IndexTuple structs with heap TID
> represented in the standard/generic way only. While these concepts
> probably all originated in nbtree, they're still not nbtree
> implementation details.)
> 

I haven't looked at the details, but I agree the LP_DEAD deletion seems
like a sensible inspiration.

>>> Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps
>>> there is an easier way to fix this, as I don't think the mvcc test
>>> failed on Tomas' version.
>>>
>>
>> I kinda doubt it worked correctly, considering I simply ignored the
>> optimization. It's far more likely it just worked by luck.
> 
> The test that did fail will have only revealed that the
> kill_prior_tuple wasn't operating as  expected -- which isn't the same
> thing as giving wrong answers.
> 

Possible. But AFAIK it did fail for Melanie, and I don't have a very
good explanation for the difference in behavior.

> Note that there are various ways that concurrent TID recycling might
> prevent _bt_killitems() from setting LP_DEAD bits. It's totally
> unsurprising that breaking kill_prior_tuple in some way could be
> missed. Andres wrote the MVCC test in question precisely because
> certain aspects of kill_prior_tuple were broken for months without
> anybody noticing.
> 
> [1] https://www.postgresql.org/docs/devel/index-locking.html

Yeah. There's clearly plenty of space for subtle issues.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company