Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uZYBa-0054db-3b for pgsql-general@arkaria.postgresql.org; Wed, 09 Jul 2025 17:02:22 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uZYBY-001fAx-60 for pgsql-general@arkaria.postgresql.org; Wed, 09 Jul 2025 17:02:20 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uZYBX-001fAQ-Ht for pgsql-general@lists.postgresql.org; Wed, 09 Jul 2025 17:02:20 +0000 Received: from mail-wm1-x331.google.com ([2a00:1450:4864:20::331]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1uZYBU-006Qrt-1J for pgsql-general@lists.postgresql.org; Wed, 09 Jul 2025 17:02:19 +0000 Received: by mail-wm1-x331.google.com with SMTP id 5b1f17b1804b1-451d7b50815so674285e9.2 for ; Wed, 09 Jul 2025 10:02:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=element.io; s=google; t=1752080534; x=1752685334; darn=lists.postgresql.org; h=content-language:in-reply-to:references:to:from:subject:user-agent :mime-version:date:message-id:from:to:cc:subject:date:message-id :reply-to; bh=FGuOhzPBgnkdcT7YPTwcJB6kZOhOiuQD33R41VX4SAk=; b=pDOWh0hyXA4MjMermnQOPfQY5CVlOImdf2hfepKJTH8qspJ6uVanqp1V6tWwxgEPYj IsDs0p+v+BjwP9SX2DBAY46D5NvnwEyRul5lY34iNwvRqiWrjgZYeE0SM18dNd9q8tUR 9XFOsa1sSHJN7WM0J/fY8r4k0JRjsTHLd9sbSAOOmb0NnNE46dRVk2aTXs3p+kiqnT19 OYzI1mtS0JpHARaaRdmwJWLrUHnvdTw9CZhOok+ajLMwIMCjJQ+e/7rFKwk4idmB2Kmt pCD8y++MgpjBRKHb7Qku4+5qy+R/TuoUo7epqdtJsjUME8Mr+5iZgtvJ1U9tCXf0a7J9 KEpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752080534; x=1752685334; h=content-language:in-reply-to:references:to:from:subject:user-agent :mime-version:date:message-id:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=FGuOhzPBgnkdcT7YPTwcJB6kZOhOiuQD33R41VX4SAk=; b=ddtmisgKpeIgETRRXVLBYKXen7LHs3t8EuypC1o0Pd/jj8C01WgNJEjE6sa+hRZqmw k+QzYHgGwE7/RjGw3XZupsOwUXPEjTIvh6MJmtZHy/tWudA2fkWJNp44dHqFDQIisQrn RGkRCd9jd9EqkyLOizqaaGlAxX+X2sMpZ3dKgP6jwhKSjimn1zw1oTE8g0ZCS44epof5 3uDs4VSu0/guWQuDvo0+HHjyGePjHDlmAgUwEE3Eec/lJo5P+79c6WlymRjN9qEAhyQg nO9mb9KBPkkpBztjINfdLoTBkR3xzCSjDRTbf6SnaPW3r9l/dBrI6j9BWh6hKXbXgdJ7 YFDw== X-Gm-Message-State: AOJu0Yz65KRMTWFY51bHb0sJXL8R8lHTiJcSgIxZp6HinptFTRaLPqU+ zjGgJ9jrtN1MqZGBoTN0OvmTYfmqOaJEg3GBFrojAAnM4pw/7/KIVZQjKcXFLhRV3MBGtup20Yy mqhpqMojaXLCI0+gQCYQjOZyyIi0TvKPXYvBHmC0/I/ioD4fwEOR56P1phcvcygENKRt11/lnIX 4ZldMqDw== X-Gm-Gg: ASbGnctDSl2m1EpkTRwQ8KxwitKNTULqjkSUiO31uO4ctqwKRxJ7suzF3oJOety3K+9 r1Hd82Y8T5uveThfbzwQ2N/kQK1qGS4lj3O9bf2+sWXDmv5/PUXTERzoGxetp2D5fgWpBa8wUfG yCxQF1NvAi6lzbyTvFmfz8e+Po1iUnnasphqXqcbgDbAzsQRU5yvuO/0GbvNyymjckvCYaKachu UV2KdrPqPotR5rOWTj3SLyFXVd42+K/x7qXSdCgqNdp1TKnjCtcZC2wVfp5AH228W8pqwrKKu7r MO110/AtA+j+F66Eq7/2zNiyYExQpBSRMEqHnpoobFtMvjlJg7U+9fV8aZZx X-Google-Smtp-Source: AGHT+IELgr9arwdGZ1FrgYrpRj6eiFLrWJGj9cMub4H5wLA+epRZ7mPYHz145zUGJaRivshjSglivw== X-Received: by 2002:a05:600c:3585:b0:440:54ef:dfdc with SMTP id 5b1f17b1804b1-454db7f4ec6mr3408185e9.8.1752080533305; Wed, 09 Jul 2025 10:02:13 -0700 (PDT) Received: from [192.168.1.27] ([77.108.142.78]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-454d508e9d7sm31127415e9.36.2025.07.09.10.02.12 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 09 Jul 2025 10:02:12 -0700 (PDT) Message-ID: Date: Wed, 9 Jul 2025 18:02:12 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Corrupt btree index includes rows that don't match From: Erik Johnston To: pgsql-general@lists.postgresql.org References: <01df0ca4-cf32-448f-a2cb-e44a8e9f7fd6@element.io> In-Reply-To: Content-Type: multipart/alternative; boundary="------------HDPHShn0i7UOLnVkdHzELMet" Content-Language: en-GB List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is a multi-part message in MIME format. --------------HDPHShn0i7UOLnVkdHzELMet Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: quoted-printable Hi again, Thanks very much for the replies last week. We=E2=80=99ve been continuing t= o=20 investigate this problem, and I thought I=E2=80=99d share an update on wher= e we are. To recap: the situation is that, looking at our backup from 2025-06-26=20 via pageinspect, we have btree index rows which point to either=20 non-existent heap TIDs, or to heap TIDs with data which does not=20 correspond to the index row. In fact it looks like we have entire index=20 pages which point only to non-existent heap TIDs. (I previously said that these index rows were marked as =E2=80=98dead=E2=80= =99 in the=20 backup. We now suspect this is an artifact of the restore process: we=20 believe they are live in the backup, but were marked as dead during the=20 restore.) Empirically, and surprisingly to us, when one does a SELECT from an=20 index entry that points to a non-existent TID, the index entry is=20 quietly ignored. We therefore suspect that this index corruption has been present for=20 some time (possibly years); more recently those non-existent heap TIDs=20 have been recycled, and that is when we have noticed the effects of the=20 problem. As far as we can tell, the corruption only affects one index on one=20 table, and only a specific region of that index/table. Specifically, it=20 only appears to affect rows which would have been inserted between 2018=20 and January 2021. At least 1B rows appear to be affected (the table as a=20 whole has 29B rows). One thing that surprised us is that =E2=80=98amcheck=E2=80=99 didn=E2=80=99= t find any sign of=20 the corruption. We=E2=80=99re not completely sure if this is because we are= =20 holding it wrong, or because it=E2=80=99s simply out of scope or unsupporte= d for=20 amcheck. Any advice on this, or suggestions for other tooling we could=20 use to check the consistency of our other indexes, would be much=20 appreciated. We=E2=80=99re still very interested in trying to understand the root cause = of=20 the corruption, mostly to confirm that it=E2=80=99s not an ongoing problem.= =20 Thanks Tom for the suggestion of=20 https://git.postgresql.org/gitweb/?p=3Dpostgresql.git&a=3Dcommitdiff&h=3D49= 34d3875.=20 We agree with your assessment that this is unlikely. For one thing, it=20 looks like that bug could only conceivably cause this corruption if it=20 affected an UPDATE query, and we=E2=80=99re reasonably sure we never do any= =20 UPDATE queries on that table. (The table is mostly append-only. We do=20 sometimes run cleanup/compression jobs which amount to large amounts of=20 interleaved DELETEs and INSERTs, but no UPDATEs.) Back in 2021, we were running Postgres 10.11. We=E2=80=99ve taken a pass th= rough=20 the release notes since then to see if we can find any likely-looking=20 bugs. We found the one that causes BRIN index corruption (this is not a=20 BRIN index), and the one that causes CREATE INDEX CONCURRENTLY to end up=20 with too *few* entries (this one has the opposite problem), but no=20 particularly likely candidate. Any other suggestions would be welcome here. At the moment, a historical hardware-level problem seems like it might=20 be the most likely culprit, though we are a bit mystified about how any=20 hardware failure could have caused such widespread damage to a single=20 index, whilst apparently leaving the rest of the database intact. Any thoughts or suggestions are very much appreciated. Thanks, Erik On 04/07/2025 15:59, Erik Johnston wrote: > > > On Fri, 4 Jul 2025, 15:38 Ron Johnson, wrote: > > On Fri, Jul 4, 2025 at 9:49=E2=80=AFAM Erik Johnston wrote: > > Hi, a quick update: > > - We have discovered that the corruption was present from > before libicu update. > - We ran `pg_amcheck --index state_groups_state_type_idx > --heapallindexed matrix`, which returned nothing > - We believe that means that (and matches what we see > sampling) the index has gained extra entries, i.e. that for a > given state group it does return all the relevant rows in the > table /plus/ extra rows. > > We are also seeing old state groups starting to point at rows > that have only just been inserted. For example, querying for > 353864583 on the primary it returns that row plus four rows > that have been inserted today, but on the backup from last > week an index only scan for 353864583 only returns one row. > This makes it feel like the corruption is ongoing? Nothing > should have modified that state group in the interim (they are > generally immutable). > > This naively feels like when inserting a new row we sometimes > add the row to the index twice: once pointing from the correct > state group to the new row, and once from an old state group > to the new row? > > > Are checksums enabled in=C2=A0the instance? > > > Alas not. > > We've also now found that the index on the backup does in fact point=20 > to those ctids after all, but they are marked as dead. So at some=20 > point between then and when we inserted the new row at that ctid today=20 > those entries were marked undead. --=20 Element Logo _Copyright =C2=A9 2023 Element - All rights reserved. The Element name, log= o=20 and device are registered trademarks of New Vector Ltd. Registered=20 number: 10873661. Registered in England and Wales. Registered address:=20 10 Queen Street Place, London, United Kingdom, EC4R 1AG. This message is intended for the addressee only and may contain private=20 and confidential information or material which may be privileged. If=20 this message has come to you in error please delete it immediately and=20 do not copy it or show it to any other person. --=20 Copyright =C2=A9 2025 Element - All rights reserved. The Element name, logo= =20 and device are registered trademarks of New Vector Ltd. Registered number:= =20 10873661. Registered in England and Wales. Registered address: 10 Queen=20 Street Place, London, United Kingdom, EC4R 1AG. This message is intended=20 for the addressee only and may contain private and confidential information= =20 or material which may be privileged. If this message has come to you in=20 error please delete it immediately and do not copy it or show it to any=20 other person. --------------HDPHShn0i7UOLnVkdHzELMet Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Hi again,

Thanks very much for the replies last week. We=E2=80=99ve been contin= uing to investigate this problem, and I thought I=E2=80=99d share an updat= e on where we are.

To recap: the situation is that, looking at our backup from 2025-06-26 via pageinspect, we have btree index rows which point to either non-existent heap TIDs, or to heap TIDs with data which does not correspond to the index row. In fact it looks like we have entire index pages which point only to non-existent heap TIDs.

(I previously said that these index rows were marked as =E2=80=98dead= =E2=80=99 in the backup. We now suspect this is an artifact of the restore process: we believe they are live in the backup, but were marked as dead during the restore.)

Empirically, and surprisingly to us, when one does a SELECT from an index entry that points to a non-existent TID, the index entry is quietly ignored.

We therefore suspect that this index corruption has been present for some time (possibly years); more recently those non-existent heap TIDs have been recycled, and that is when we have noticed the effects of the problem.


As far as we can tell, the corruption only affects one index on one table, and only a specific region of that index/table. Specifically, it only appears to affect rows which would have been inserted between 2018 and January 2021. At least 1B rows appear to be affected (the table as a whole has 29B rows).

One thing that surprised us is that =E2=80=98amcheck=E2=80=99 didn=E2= =80=99t find any sign of the corruption. We=E2=80=99re not completely sure if this is becau= se we are holding it wrong, or because it=E2=80=99s simply out of scope or unsupported for amcheck. Any advice on this, or suggestions for other tooling we could use to check the consistency of our other indexes, would be much appreciated.

We=E2=80=99re still very interested in trying to understand the root = cause of the corruption, mostly to confirm that it=E2=80=99s not an ongoing problem. Thanks Tom for the suggestion of https://git.pos= tgresql.org/gitweb/?p=3Dpostgresql.git&a=3Dcommitdiff&h=3D4934d3875= . We agree with your assessment that this is unlikely. For one thing, it looks like that bug could only conceivably cause this corruption if it affected an UPDATE query, and we=E2=80=99re reasonab= ly sure we never do any UPDATE queries on that table. (The table is mostly append-only. We do sometimes run cleanup/compression jobs which amount to large amounts of interleaved DELETEs and INSERTs, but no UPDATEs.)

Back in 2021, we were running Postgres 10.11. We=E2=80=99ve taken a p= ass through the release notes since then to see if we can find any likely-looking bugs. We found the one that causes BRIN index corruption (this is not a BRIN index), and the one that causes CREATE INDEX CONCURRENTLY to end up with too *few* entries (this one has the opposite problem), but no particularly likely candidate. Any other suggestions would be welcome here.

At the moment, a historical hardware-level problem seems like it might be the most likely culprit, though we are a bit mystified about how any hardware failure could have caused such widespread damage to a single index, whilst apparently leaving the rest of the database intact.


Any thoughts or suggestions are very much appreciated.

Thanks,
Erik


On 04/07/2025 15:59, Erik Johnston wrote:


On Fri, 4 Jul 2025, 15:38 Ron Johnson, <ronl= johnsonjr@gmail.com> wrote:
On Fri, Jul 4, 2025 at 9:49=E2=80=AFAM Erik Johnston <= erikj@element.io> wrote:

Hi, a quick update:

- We have discovered that the corruption was present from before libicu update.
- We ran `pg_amcheck --index state_groups_state_type_idx --heapallindexed matrix`, which returned nothing
- We believe that means that (and matches what we see sampling) the index has gained extra entries, i.e. that for a given state group it does return all the relevant rows in the table plus extra rows.

We are also seeing old state groups starting to point at rows that have only just been inserted. For example, querying for 353864583 on the primary it returns that row plus four rows that have been inserted today, but on the backup from last week an index only scan for 353864583 only returns one row. This makes it feel like the corruption is ongoing? Nothing should have modified that state group in the interim (they are generally immutable).

This naively feels like when inserting a new row we sometimes add the row to the index twice: once pointing from the correct state group to the new row, and once from an old state group to the new row?


Are checksums enabled in=C2=A0the instance?

Alas not.=C2=A0

We've also now found that the index on the backup does in fact point to those ctids after all, but they are marked as dead. So at some point between then and when we inserted the new row at that ctid today those entries were marked undead.=C2=A0
--
3D"Element

Copyright =C2=A9 2023 Elem= ent - All rights reserved. The Element name, logo and device are registered trademarks of New Vector Ltd. Registered number: 10873661. Registered in England and Wales. Registered address: 10 Queen Street Place, London, United Kingdom, EC4R 1AG.

This message is intended for the addressee only and may contain private and confidential information or material which may be privileged. If this message has come to you in error please delete it immediately and do not copy it or show it to any other person.

Copyright =C2=A9 2025 Element - All rights reserved. The El= ement name, logo and device are registered trademarks of New Vector Ltd. Re= gistered number: 10873661. Registered in England and Wales. Registered addr= ess: 10 Queen Street Place, London, United Kingdom, EC4R 1AG.=

This message is intended for the addresse= e only and may contain private and confidential information or material whi= ch may be privileged. If this message has come to you in error please delet= e it immediately and do not copy it or show it to any other person.<= /span>

--------------HDPHShn0i7UOLnVkdHzELMet--