Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sZGOr-00HP62-O1 for pgsql-hackers@arkaria.postgresql.org; Wed, 31 Jul 2024 20:58:21 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sZGOq-00DlAK-9s for pgsql-hackers@arkaria.postgresql.org; Wed, 31 Jul 2024 20:58:20 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sZGOp-00Dl9w-S4 for pgsql-hackers@lists.postgresql.org; Wed, 31 Jul 2024 20:58:20 +0000 Received: from mail-ej1-x635.google.com ([2a00:1450:4864:20::635]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1sZGOj-002RBe-3R for pgsql-hackers@postgresql.org; Wed, 31 Jul 2024 20:58:18 +0000 Received: by mail-ej1-x635.google.com with SMTP id a640c23a62f3a-a7aac70e30dso768537266b.1 for ; Wed, 31 Jul 2024 13:58:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1722459491; x=1723064291; darn=postgresql.org; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=mbS01EZJfh7BKoMV2No4HCawhOrkcCgFDdrvnZI1oNQ=; b=enFvMyfVdOim45uPscUBVZdJazxXB6PMkw5FX58SeG+2chFW3yOhlbfyyJttfeqshT NFzJVv17VO8XadRUjMtWm0QFqyw01Tf1b1CzKRdrUf5KdGb/rkrP7o1vjEcW+ToocXCa Z9IJnFtuVxCJmw5HENd1SoG41Hgy39p2IbysPHRQp+lmQ7fEqE4EZnzAz0sIL8u9A2S3 GV7uoDKiWmht2NQRlV/ljbzfLf5LRMXauUUfozTV4ast3wcUE8l3SeFGCaO3OxUU35jQ 5r+viH+bHiKhcBJLqj6AYOD/tSGBOVoIyAIc92aLMa8jyvZbl1YJLQ2jfU9OtkN5A1lG +pRA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722459491; x=1723064291; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=mbS01EZJfh7BKoMV2No4HCawhOrkcCgFDdrvnZI1oNQ=; b=cuMwDBMhmxCw8sjpFVkwGXJKcZkL2fc1Ap43KeXpQyUNbdEsiRocd9DGx50+xEoakD FuHSWnCg0uzjL8G9mA6Y2dCEKmSIQlJA1G22GU+H4SLwCDJg3BvNsXvzqlR2iY2dy3Pr JUYhdKaF6EizWJ7+QLmzwWZULZDC5zqrrl7r9BmlwTnU8MXX0O/EMV5Tb/ujPzjvdqTS FuWA+N/fBd00KLXIhLBj6ju8p8hKXkNz1Lsq3nVJKt65Xq48aMxkMqWXno/xJU/A2NI8 glK5wiBOtawmcze3spyV4PgMlBvN9bFIX2Qxq70VmOwSLPO7wZ6NkULjHXnJFDGorB/i /+aQ== X-Gm-Message-State: AOJu0Yy2j44+CRACvuwTp/6v2LULa3xGOIx60wRaYaEjmMLBNbLIr2aO t6nLiHwgVydPqy9nnn07hAKt7HQddcvPza5C8hOIFnyx0V3AD3nLD8NvyGp8Mhxe7Yjacnz7cXb 6Rw/h2+vdxMOK4fIydJ4GmfgKh94rzA== X-Google-Smtp-Source: AGHT+IEsrdgMV6XJUqfW+WVLLlEDtohIluLR1EklENLrPcXQ1keOew0Rp+QvNjpRcQpMHYf9ef5LlVGRwPyljXWn21g= X-Received: by 2002:a17:907:7da0:b0:a6f:c9c6:7219 with SMTP id a640c23a62f3a-a7daf6564fdmr27578366b.47.1722459490416; Wed, 31 Jul 2024 13:58:10 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Michail Nikolaev Date: Wed, 31 Jul 2024 22:57:00 +0200 Message-ID: Subject: Re: [BUG?] check_exclusion_or_unique_constraint false negative To: PostgreSQL Hackers , Andres Freund , Amit Kapila Content-Type: multipart/alternative; boundary="00000000000011037a061e915a68" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000011037a061e915a68 Content-Type: text/plain; charset="UTF-8" It seems like I've identified the cause of the issue. Currently, any DirtySnapshot (or SnapshotSelf) scan over a B-tree index may skip (not find the TID for) some records in the case of parallel updates. The following scenario is possible: * Session 1 reads a B-tree page using SnapshotDirty and copies item X to the buffer. * Session 2 updates item X, inserting a new TID Y into the same page. * Session 2 commits its transaction. * Session 1 starts to fetch from the heap and tries to fetch X, but it was already deleted by session 2. So, it goes to the B-tree for the next TID. * The B-tree goes to the next page, skipping Y. * Therefore, the search finds nothing, but tuple Y is still alive. This situation is somewhat controversial. DirtySnapshot might seem to show more (or more recent, even uncommitted) data than MVCC, but not less. So, DirtySnapshot scan over a B-tree does not provide any guarantees, as far as I understand. Why does it work for MVCC? Because tuple X will be visible due to the snapshot, making Y unnecessary. This might be "as designed," but I think it needs to be clearly documented (I couldn't find any documentation on this particular case, only _bt_drop_lock_and_maybe_pin - related). Here are the potential consequences of the issue: * check_exclusion_or_unique_constraint It may not find a record in a UNIQUE index during INSERT ON CONFLICT UPDATE. However, this is just a minor performance issue. * Exclusion constraints with B-tree, like ADD CONSTRAINT exclusion_data EXCLUDE USING btree (data WITH =) It should work correctly because the first inserter may "skip" the TID from a concurrent inserter, but the second one should still find the TID from the first. * RelationFindReplTupleByIndex Amit, this is why I've included you in this previously solo thread :) RelationFindReplTupleByIndex uses DirtySnapshot and may not find some records if they are updated by a parallel transaction. This could lead to lost deletes/updates, especially in the case of streaming=parallel mode. I'm not familiar with how parallel workers apply transactions, so maybe this isn't possible. Best regards, Mikhail > --00000000000011037a061e915a68 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
It seems like I've identified the cau= se of the issue.

Currently, any DirtySnapshot (or SnapshotSelf) scan= over a B-tree index may skip (not find the TID for) some records in the ca= se of parallel updates.

The following scenario is possible:

*= Session 1 reads a B-tree page using SnapshotDirty and copies item X to the= buffer.
* Session 2 updates item X, inserting a new TID Y into the same= page.
* Session 2 commits its transaction.
* Session 1 starts to fet= ch from the heap and tries to fetch X, but it was already deleted by sessio= n 2. So, it goes to the B-tree for the next TID.
* The B-tree goes to th= e next page, skipping Y.
* Therefore, the search finds nothing, but tupl= e Y is still alive.

This situation is somewhat controversial. DirtyS= napshot might seem to show more (or more recent, even uncommitted) data tha= n MVCC, but not less. So, DirtySnapshot scan over a B-tree does not provide= any guarantees, as far as I understand.
Why does it work for MVCC? Beca= use tuple X will be visible due to the snapshot, making Y unnecessary.
T= his might be "as designed," but I think it needs to be clearly do= cumented (I couldn't find any documentation on this particular case, on= ly _bt_drop_lock_and_maybe_pin - related).

Here are the potential co= nsequences of the issue:

* check_exclusion_or_unique_constraint
<= br>It may not find a record in a UNIQUE index during INSERT ON CONFLICT UPD= ATE. However, this is just a minor performance issue.

* Exclusion co= nstraints with B-tree, like ADD CONSTRAINT exclusion_data EXCLUDE USING btr= ee (data WITH =3D)

It should work correctly because the first insert= er may "skip" the TID from a concurrent inserter, but the second = one should still find the TID from the first.

* RelationFindReplTupl= eByIndex

Amit, this is why I've included you in this previously = solo thread :)
RelationFindReplTupleByIndex uses DirtySnapshot and may n= ot find some records if they are updated by a parallel transaction. This co= uld lead to lost deletes/updates, especially in the case of streaming=3Dpar= allel mode.
I'm not familiar with how parallel workers apply transa= ctions, so maybe this isn't possible.

Best regards,
Mikhail
--00000000000011037a061e915a68--