Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uXbsP-00C30C-VO for pgsql-general@arkaria.postgresql.org; Fri, 04 Jul 2025 08:34:34 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uXbsO-00HKGo-1R for pgsql-general@arkaria.postgresql.org; Fri, 04 Jul 2025 08:34:32 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uXNPj-00Emz8-4O for pgsql-general@lists.postgresql.org; Thu, 03 Jul 2025 17:07:59 +0000 Received: from mail-lf1-x135.google.com ([2a00:1450:4864:20::135]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1uXNPh-005Xd9-0r for pgsql-general@lists.postgresql.org; Thu, 03 Jul 2025 17:07:59 +0000 Received: by mail-lf1-x135.google.com with SMTP id 2adb3069b0e04-553b60de463so148161e87.3 for ; Thu, 03 Jul 2025 10:07:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=element.io; s=google; t=1751562474; x=1752167274; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=2FwaFGl7Z2rgkbQtKmQNNz/KXnr628OTwcJ6wjJFRq0=; b=JBEBw7Lf6g8bcFvKaXeujDDvXBhPg8Bl0acmt1qRHv6h2LeIUemZJv/kfUo4vR4hDa zYIemkMuCmytgctyW1v7oXOnQe8hA4+6Wc+WZmrHNKagYm0eEbAmVZ6Hg0KA7vx8pU7b RSMrxNv3sCWxU87rcjEXZ8AFfBrYHkwAEDT3sNELt0W5vcFkRwBSS5oACnHBJEwywk+c jcWpyc4lI0YqSeniuQ+FCRka/wafc0siGDr8br/QfU8sf642BG6ANO6ZKeg/4sHOTJAR 4+ECX/3pnwIyfQnWw2IB1jblCQOwShHJ5zj/yUZcdZsE1U64FK0e9C6fc7rXODzz44/E irIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751562474; x=1752167274; h=cc:to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=2FwaFGl7Z2rgkbQtKmQNNz/KXnr628OTwcJ6wjJFRq0=; b=vIV4sFdWgV/ZDTufm+fFBequeSz7ci5LLauiezWPrc8Sg4f9Rw/bojkdxeN+PxZg0o M8qwrvvcUN1VcDim46FuvSIS3D5EKkhY9SqKQOf1QW42xrbKBgaN69mRM9GlPYTlrAj3 vdR9wGyCFcIaYnK+H94B4sjqginaC9+LTwPKkAoDSnukWpbmYOtXJda9+9TmdNlaexQS ajCPiF+wPTlf8iPinr05kuUPHpWl0NgjOECcquGyBeqKr6595OZAb2Vb5evfqh4V6+zm WllxiNZTkRbhMEoQ1QwbcQZhZjChs7EU78BUvRVnJzBNPV+hr51nz7J85P0R9Cl9L38K 2ang== X-Gm-Message-State: AOJu0YyYgGt5xfn8bhEftN4u5iISFW60w8bi+Cntrowp7SoUrTpVL4NK ouIBaE2G7Rv/La5FMK2qcAykU4M+7CcgpK08xHFX7qvA9HEPXngLk7gqR0NI2wyDX3xNG7L5gy/ IJNFaTDKbasjcR0ISFue/IjuwKKnsqgOPFpUbcroIXFAPEXFn01Sg9gIwNmm3ZXHODp7DA0yO5D mbno4EdfEkjOKMRQVGZ0PY6D97oGSwQoTE2VRh+Ajp/Dn3jLnWz+ffxxtn X-Gm-Gg: ASbGnctE5LhOiM4uoc6uYJdwZ8k4dOlgLegQ+bPG+3qoxnqrZ/yrCH8YX8O6FwuY2ih Kbl1kv5q3Jw/qSmbFnkh2Fjl3UFSxFTfo7tO2Ieu2YF0IJfrAcAat/ZlrANeuG8/Lg8iFbt4fPR WyUrC8Nd+gAfa8teH3GhjiuiDff9eEswvtrbsR3AIOS2s6qw== X-Google-Smtp-Source: AGHT+IEau+dbo2OOrdVQNtE5isYjzSMC6KVwDtIkhYtGoSlX0R96jqYK7I3JvcvLqZ7nVzgIED5NDFHlf5tYZYTlOJA= X-Received: by 2002:a05:6512:b89:b0:554:f74b:78ae with SMTP id 2adb3069b0e04-5562eed59femr1597392e87.31.1751562474092; Thu, 03 Jul 2025 10:07:54 -0700 (PDT) MIME-Version: 1.0 From: Erik Johnston Date: Thu, 3 Jul 2025 18:07:43 +0100 X-Gm-Features: Ac12FXzjsz-BD1NOcFYycNfSoEl039iLXfVKL_jjULX6pr-BnP-HeZu7YmTI_z4 Message-ID: Subject: Corrupt btree index includes rows that don't match To: pgsql-general@lists.postgresql.org Cc: Oliver Wilkinson Content-Type: multipart/alternative; boundary="0000000000001216200639096b38" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000001216200639096b38 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello, We're looking into a problem with our application and have tracked it down to index corruption, whereby we have many index rows pointing to the wrong tuples in the heap. Our table looks like: ``` Table "matrix.state_groups_state" Column | Type | Collation | Nullable | Default -------------+--------+-----------+----------+--------- state_group | bigint | | | room_id | text | | | type | text | | | state_key | text | | | event_id | text | | | Indexes: "state_groups_state_room_id_idx" brin (room_id) WITH (pages_per_range=3D'1') "state_groups_state_type_idx" btree (state_group, type, state_key), tablespace "postgres_second" Triggers: check_state_groups_state_deletion_trigger AFTER DELETE ON state_groups_state DEFERRABLE INITIALLY DEFERRED FOR EACH ROW EXECUTE FUNCTION check_state_groups_state_deletion() ``` The symptoms we are noticing are that a DELETE or SELECT query includes rows that don't match the condition, as long as we issue a query that results in an Index Scan (not Index Only Scan): For example, including `ctid` in the query is enough to make the planner use an Index Scan: ``` SELECT ctid, state_group FROM state_groups_state WHERE state_group =3D 483128098; ctid | state_group ----------------+------------- (16669607,1) | 483128098 (424940858,20) | 963361875 (16669606,53) | 483128098 (3 rows) ``` But with an Index Only Scan: ``` SELECT state_group FROM state_groups_state WHERE state_group =3D 483128098; state_group ------------- 483128098 483128098 483128098 (3 rows) ``` Since including `ctid` in the SELECT columns causes the query to use an Index Scan (fetching tuples from the heap), this inconsistency leads us to believe that our index and heap disagree. Forcing a sequential scan with that same query only returns two rows matching that state group, which suggests that the index thinks there are more rows in the table than there actually are. (We do not believe anything can have deleted a row with state group 483128098). Also interestingly, querying (with the index re-enabled) for 963361875 returns the same row as returned above, so the row is in the index twice. Another example state group (147961623) should only have a single row associated with it, and yet the index returns nearly 7000 rows (including the one we expect). The unexpected state groups are all in the range 794390760=E2=80=93794393085 (except one in 794411694), and also have ctids = in range (93454823,48) =E2=80=93 (93455621,49). The fact that these are reasonably t= ight ranges feels suspicious. Note that the state group is a simple incrementing ID here. This table is quite large (about 6 TB) but we have sampled a few small ranges of it and found many instances of this type of corruption, in the first (approximate) half of the key range (0..561M out of 0..1034M). For historical reasons, the table and the index are on different tablespaces, but the same filesystem. We have sampled the table on our secondary server, and we see the same sort of corruption going on (though given the size of the data we don=E2=80=99t = actually know if it's exactly the same). One coincidence is that we started seeing the first symptoms of this around the same time as libicu was updated with a security patch. However, postgres hasn=E2=80=99t been restarted and doesn=E2=80=99t reference the ne= w version in its process maps. Plus state groups are integers anyway. We also use the C locale, not ICU. We=E2=80=99re currently running =E2=80=9Cpg_amcheck --index state_groups_st= ate_type_idx --heapallindexed=E2=80=9D on our secondary to see what it says, but we expe= ct that to take a long time to complete. Thankfully, we have database backups so hopefully we should be able to restore the data. However, any thoughts on how this happened or where to look next would be greatly appreciated. Thoughts on how to check our other indexes for corruption would also be very welcome. Thanks, Erik Further details of our setup: - 2 servers in physical replication (one primary, one secondary as a hot standby) - both servers display the corruption - ECC RAM - 8 NVME SSD, raid10 (mdraid), LVM, ext4 filesystem. - smartctl and mdadm report healthy disks - Debian, postgres installed via apt. - Postgres version: PostgreSQL 14.11 (Debian 14.11-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit - Kernel version 6.1.0-22-amd64, GLIBC 2.36-9+deb12u10 --=20 Copyright =C2=A9 2025 Element - All rights reserved. The Element name, logo= =20 and device are registered trademarks of New Vector Ltd. Registered number:= =20 10873661. Registered in England and Wales. Registered address: 10 Queen=20 Street Place, London, United Kingdom, EC4R 1AG. This message is intended=20 for the addressee only and may contain private and confidential information= =20 or material which may be privileged. If this message has come to you in=20 error please delete it immediately and do not copy it or show it to any=20 other person. --0000000000001216200639096b38 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Hello,


We're= looking into a problem with our application and have tracked it down to in= dex corruption, whereby we have many index rows pointing to the wrong tuple= s in the heap.


Our table looks like:


```

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Table "mat= rix.state_groups_state"
=C2=A0 =C2=A0Column =C2=A0 =C2=A0| =C2=A0Ty= pe =C2=A0| Collation | Nullable | Default
-------------+--------+------= -----+----------+---------
=C2=A0state_group | bigint | =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 | =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|
=C2=A0room_i= d =C2=A0 =C2=A0 | text =C2=A0 | =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0|
=C2=A0type =C2=A0 =C2=A0 =C2=A0 =C2=A0| t= ext =C2=A0 | =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0|
=C2=A0state_key =C2=A0 | text =C2=A0 | =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 | =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|
=C2=A0event_id =C2= =A0 =C2=A0| text =C2=A0 | =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0|
Indexes:
=C2=A0 =C2=A0 "state_groups_= state_room_id_idx" brin (room_id) WITH (pages_per_range=3D'1')=
=C2=A0 =C2=A0 "state_groups_state_type_idx" btree (state_grou= p, type, state_key), tablespace "postgres_second"
Triggers:=C2=A0 =C2=A0 check_state_groups_state_deletion_trigger AFTER DELETE ON st= ate_groups_state DEFERRABLE INITIALLY DEFERRED FOR EACH ROW EXECUTE FUNCTIO= N check_state_groups_state_deletion()

```


The symptoms we= are noticing are that a DELETE or SELECT query includes rows that don'= t match the condition, as long as we issue a query that results in an Index= Scan (not Index Only Scan):


For example, including `ctid` in the = query is enough to make the planner use an Index Scan:


```<= /p>

SELECT ctid, state_group FROM state_groups_stat= e WHERE state_group =3D 483128098;

=C2=A0 =C2=A0 =C2=A0 ctid =C2=A0 = =C2=A0 =C2=A0| state_group
----------------+-------------
=C2=A0(166= 69607,1) =C2=A0 | =C2=A0 483128098
=C2=A0(424940858,20) | =C2=A0 9633618= 75
=C2=A0(16669606,53) =C2=A0| =C2=A0 483128098
(3 rows)

```


But with an Index Only Scan:


```

SELECT state_group FROM state_groups_state WHERE state_group =3D 48= 3128098;
=C2=A0state_group
-------------
=C2=A0 =C2=A0483128098=C2=A0 =C2=A0483128098
=C2=A0 =C2=A0483128098
(3 rows)

```=


Since including `ctid` in the SELECT columns causes the query to = use an Index Scan (fetching tuples from the heap), this inconsistency leads= us to believe that our index and heap disagree.


Forcing a sequen= tial scan with that same query only returns two rows matching that state gr= oup, which suggests that the index thinks there are more rows in the table = than there actually are. (We do not believe anything can have deleted a row= with state group 483128098). Also interestingly, querying (with the index = re-enabled) for 963361875 returns the same row as returned above, so the ro= w is in the index twice.


Another example state group (147961623) s= hould only have a single row associated with it, and yet the index returns = nearly 7000 rows (including the one we expect). The unexpected state groups= are all in the range 794390760=E2=80=93794393085 (except one in 794411694)= , and also have ctids in range (93454823,48) =E2=80=93 (93455621,49). The f= act that these are reasonably tight ranges feels suspicious. Note that the = state group is a simple incrementing ID here.


This table is quite = large (about 6 TB) but we have sampled a few small ranges of it and found m= any instances of this type of corruption, in the first (approximate) half o= f the key range (0..561M out of 0..1034M).


For historical reasons,= the table and the index are on different tablespaces, but the same filesys= tem.


We have sampled the table on our secondary server, and we see= the same sort of corruption going on (though given the size of the data we= don=E2=80=99t actually know if it's exactly the same).


<= p dir=3D"ltr" style=3D"line-height:1.38;margin-top:0pt;margin-bottom:0pt"><= span style=3D"font-size:11pt;font-family:Arial,sans-serif;color:rgb(0,0,0);= background-color:transparent;font-variant-numeric:normal;font-variant-east-= asian:normal;font-variant-alternates:normal;vertical-align:baseline">One co= incidence is that we started seeing the first symptoms of this around the s= ame time as libicu was updated with a security patch. However, postgres has= n=E2=80=99t been restarted and doesn=E2=80=99t reference the new version in= its process maps. Plus state groups are integers anyway. We also use the C= locale, not ICU.


We=E2=80=99re currently running =E2=80=9Cpg_amch= eck --index state_groups_state_type_idx --heapallindexed=E2=80=9D on our se= condary to see what it says, but we expect that to take a long time to comp= lete.


Thankfully, we have database backups so hopefully we should be able to restore the data. However, any thoughts = on how this happened or where to look next would be greatly appreciated. Th= oughts on how to check our other indexes for corruption would also be very = welcome.


Thanks,

Erik



Further details of our se= tup:

  • 2 servers in physical replication (one primary, one s= econdary as a hot standby)
    • both servers display the corruptio= n
  • ECC RAM
  • 8 NVME SSD, raid10 (mdraid), LVM, ext4 f= ilesystem.
    • smartctl and mdadm report healthy disks
  • Debian, postgres installed via apt.
  • Postgres version: Postgr= eSQL 14.11 (Debian 14.11-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by g= cc (Debian 12.2.0-14) 12.2.0, 64-bit
  • Kernel version 6.1.0-22-amd6= 4, GLIBC 2.36-9+deb12u10



Copyright =C2=A9 2025 Element - All rights reserved. The El= ement name, logo and device are registered trademarks of New Vector Ltd. Re= gistered number: 10873661. Registered in England and Wales. Registered addr= ess: 10 Queen Street Place, London, United Kingdom, EC4R 1AG.=

This message is intended for the addresse= e only and may contain private and confidential information or material whi= ch may be privileged. If this message has come to you in error please delet= e it immediately and do not copy it or show it to any other person.<= /span>

--0000000000001216200639096b38--