MIME-Version: 1.0
References: <CANtu0oiktqQ2pwExoXqDpByXNCJa-KE5vQRodTRnmFHN_+qwHg@mail.gmail.com>
 <CANtu0ohU2XRV9shtu14CffLPDS1x10q7ebOGf-vX0p+45_L8jw@mail.gmail.com> <CANtu0oh0tspW-xWzDGWP9ehz96KPt9aUP1c9JYhdBYxKsB0jpA@mail.gmail.com>
In-Reply-To: <CANtu0oh0tspW-xWzDGWP9ehz96KPt9aUP1c9JYhdBYxKsB0jpA@mail.gmail.com>
From: Michail Nikolaev <michail.nikolaev@gmail.com>
Date: Wed, 31 Jul 2024 22:57:00 +0200
Message-ID: <CANtu0ohUB9ky45iiMAYN1fGyt82+cg=+UYBom=P7drb+=97G9w@mail.gmail.com>
Subject: Re: [BUG?] check_exclusion_or_unique_constraint false negative
To: PostgreSQL Hackers <pgsql-hackers@postgresql.org>, Andres Freund <andres@anarazel.de>, 
	Amit Kapila <amit.kapila16@gmail.com>
Content-Type: multipart/alternative; boundary="00000000000011037a061e915a68"
Archived-At: <https://www.postgresql.org/message-id/CANtu0ohUB9ky45iiMAYN1fGyt82%2Bcg%3D%2BUYBom%3DP7drb%2B%3D97G9w%40mail.gmail.com>
Precedence: bulk

--00000000000011037a061e915a68
Content-Type: text/plain; charset="UTF-8"

It seems like I've identified the cause of the issue.

Currently, any DirtySnapshot (or SnapshotSelf) scan over a B-tree index may
skip (not find the TID for) some records in the case of parallel updates.

The following scenario is possible:

* Session 1 reads a B-tree page using SnapshotDirty and copies item X to
the buffer.
* Session 2 updates item X, inserting a new TID Y into the same page.
* Session 2 commits its transaction.
* Session 1 starts to fetch from the heap and tries to fetch X, but it was
already deleted by session 2. So, it goes to the B-tree for the next TID.
* The B-tree goes to the next page, skipping Y.
* Therefore, the search finds nothing, but tuple Y is still alive.

This situation is somewhat controversial. DirtySnapshot might seem to show
more (or more recent, even uncommitted) data than MVCC, but not less. So,
DirtySnapshot scan over a B-tree does not provide any guarantees, as far as
I understand.
Why does it work for MVCC? Because tuple X will be visible due to the
snapshot, making Y unnecessary.
This might be "as designed," but I think it needs to be clearly documented
(I couldn't find any documentation on this particular case, only
_bt_drop_lock_and_maybe_pin - related).

Here are the potential consequences of the issue:

* check_exclusion_or_unique_constraint

It may not find a record in a UNIQUE index during INSERT ON CONFLICT
UPDATE. However, this is just a minor performance issue.

* Exclusion constraints with B-tree, like ADD CONSTRAINT exclusion_data
EXCLUDE USING btree (data WITH =)

It should work correctly because the first inserter may "skip" the TID from
a concurrent inserter, but the second one should still find the TID from
the first.

* RelationFindReplTupleByIndex

Amit, this is why I've included you in this previously solo thread :)
RelationFindReplTupleByIndex uses DirtySnapshot and may not find some
records if they are updated by a parallel transaction. This could lead to
lost deletes/updates, especially in the case of streaming=parallel mode.
I'm not familiar with how parallel workers apply transactions, so maybe
this isn't possible.

Best regards,
Mikhail

>

--00000000000011037a061e915a68
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr">It seems like I&#39;ve identified the cau=
se of the issue.<br><br>Currently, any DirtySnapshot (or SnapshotSelf) scan=
 over a B-tree index may skip (not find the TID for) some records in the ca=
se of parallel updates.<br><br>The following scenario is possible:<br><br>*=
 Session 1 reads a B-tree page using SnapshotDirty and copies item X to the=
 buffer.<br>* Session 2 updates item X, inserting a new TID Y into the same=
 page.<br>* Session 2 commits its transaction.<br>* Session 1 starts to fet=
ch from the heap and tries to fetch X, but it was already deleted by sessio=
n 2. So, it goes to the B-tree for the next TID.<br>* The B-tree goes to th=
e next page, skipping Y.<br>* Therefore, the search finds nothing, but tupl=
e Y is still alive.<br><br>This situation is somewhat controversial. DirtyS=
napshot might seem to show more (or more recent, even uncommitted) data tha=
n MVCC, but not less. So, DirtySnapshot scan over a B-tree does not provide=
 any guarantees, as far as I understand.<br>Why does it work for MVCC? Beca=
use tuple X will be visible due to the snapshot, making Y unnecessary.<br>T=
his might be &quot;as designed,&quot; but I think it needs to be clearly do=
cumented (I couldn&#39;t find any documentation on this particular case, on=
ly _bt_drop_lock_and_maybe_pin - related).<br><br>Here are the potential co=
nsequences of the issue:<br><br>* check_exclusion_or_unique_constraint<br><=
br>It may not find a record in a UNIQUE index during INSERT ON CONFLICT UPD=
ATE. However, this is just a minor performance issue.<br><br>* Exclusion co=
nstraints with B-tree, like ADD CONSTRAINT exclusion_data EXCLUDE USING btr=
ee (data WITH =3D)<br><br>It should work correctly because the first insert=
er may &quot;skip&quot; the TID from a concurrent inserter, but the second =
one should still find the TID from the first.<br><br>* RelationFindReplTupl=
eByIndex<br><br>Amit, this is why I&#39;ve included you in this previously =
solo thread :)<br>RelationFindReplTupleByIndex uses DirtySnapshot and may n=
ot find some records if they are updated by a parallel transaction. This co=
uld lead to lost deletes/updates, especially in the case of streaming=3Dpar=
allel mode. <br>I&#39;m not familiar with how parallel workers apply transa=
ctions, so maybe this isn&#39;t possible.<br><br>Best regards,<br>Mikhail<b=
r></div><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex">
</blockquote></div></div>

--00000000000011037a061e915a68--