public inbox for [email protected]
help / color / mirror / Atom feedCorrupt btree index includes rows that don't match
6+ messages / 3 participants
[nested] [flat]
* Corrupt btree index includes rows that don't match
@ 2025-07-03 17:07 Erik Johnston <[email protected]>
2025-07-04 13:49 ` Re: Corrupt btree index includes rows that don't match Erik Johnston <[email protected]>
0 siblings, 1 reply; 6+ messages in thread
From: Erik Johnston @ 2025-07-03 17:07 UTC (permalink / raw)
To: [email protected]; +Cc: Oliver Wilkinson <[email protected]>
Hello,
We're looking into a problem with our application and have tracked it down
to index corruption, whereby we have many index rows pointing to the wrong
tuples in the heap.
Our table looks like:
```
Table "matrix.state_groups_state"
Column | Type | Collation | Nullable | Default
-------------+--------+-----------+----------+---------
state_group | bigint | | |
room_id | text | | |
type | text | | |
state_key | text | | |
event_id | text | | |
Indexes:
"state_groups_state_room_id_idx" brin (room_id) WITH
(pages_per_range='1')
"state_groups_state_type_idx" btree (state_group, type, state_key),
tablespace "postgres_second"
Triggers:
check_state_groups_state_deletion_trigger AFTER DELETE ON
state_groups_state DEFERRABLE INITIALLY DEFERRED FOR EACH ROW EXECUTE
FUNCTION check_state_groups_state_deletion()
```
The symptoms we are noticing are that a DELETE or SELECT query includes
rows that don't match the condition, as long as we issue a query that
results in an Index Scan (not Index Only Scan):
For example, including `ctid` in the query is enough to make the planner
use an Index Scan:
```
SELECT ctid, state_group FROM state_groups_state WHERE state_group =
483128098;
ctid | state_group
----------------+-------------
(16669607,1) | 483128098
(424940858,20) | 963361875
(16669606,53) | 483128098
(3 rows)
```
But with an Index Only Scan:
```
SELECT state_group FROM state_groups_state WHERE state_group = 483128098;
state_group
-------------
483128098
483128098
483128098
(3 rows)
```
Since including `ctid` in the SELECT columns causes the query to use an
Index Scan (fetching tuples from the heap), this inconsistency leads us to
believe that our index and heap disagree.
Forcing a sequential scan with that same query only returns two rows
matching that state group, which suggests that the index thinks there are
more rows in the table than there actually are. (We do not believe anything
can have deleted a row with state group 483128098). Also interestingly,
querying (with the index re-enabled) for 963361875 returns the same row as
returned above, so the row is in the index twice.
Another example state group (147961623) should only have a single row
associated with it, and yet the index returns nearly 7000 rows (including
the one we expect). The unexpected state groups are all in the range
794390760–794393085 (except one in 794411694), and also have ctids in range
(93454823,48) – (93455621,49). The fact that these are reasonably tight
ranges feels suspicious. Note that the state group is a simple incrementing
ID here.
This table is quite large (about 6 TB) but we have sampled a few small
ranges of it and found many instances of this type of corruption, in the
first (approximate) half of the key range (0..561M out of 0..1034M).
For historical reasons, the table and the index are on different
tablespaces, but the same filesystem.
We have sampled the table on our secondary server, and we see the same sort
of corruption going on (though given the size of the data we don’t actually
know if it's exactly the same).
One coincidence is that we started seeing the first symptoms of this around
the same time as libicu was updated with a security patch. However,
postgres hasn’t been restarted and doesn’t reference the new version in its
process maps. Plus state groups are integers anyway. We also use the C
locale, not ICU.
We’re currently running “pg_amcheck --index state_groups_state_type_idx
--heapallindexed” on our secondary to see what it says, but we expect that
to take a long time to complete.
Thankfully, we have database backups so hopefully we should be able to
restore the data. However, any thoughts on how this happened or where to
look next would be greatly appreciated. Thoughts on how to check our other
indexes for corruption would also be very welcome.
Thanks,
Erik
Further details of our setup:
- 2 servers in physical replication (one primary, one secondary as a hot
standby)
- both servers display the corruption
- ECC RAM
- 8 NVME SSD, raid10 (mdraid), LVM, ext4 filesystem.
- smartctl and mdadm report healthy disks
- Debian, postgres installed via apt.
- Postgres version: PostgreSQL 14.11 (Debian 14.11-1.pgdg120+1) on
x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
- Kernel version 6.1.0-22-amd64, GLIBC 2.36-9+deb12u10
--
Copyright © 2025 Element - All rights reserved. The Element name, logo
and device are registered trademarks of New Vector Ltd. Registered number:
10873661. Registered in England and Wales. Registered address: 10 Queen
Street Place, London, United Kingdom, EC4R 1AG.
This message is intended
for the addressee only and may contain private and confidential information
or material which may be privileged. If this message has come to you in
error please delete it immediately and do not copy it or show it to any
other person.
^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Corrupt btree index includes rows that don't match
2025-07-03 17:07 Corrupt btree index includes rows that don't match Erik Johnston <[email protected]>
@ 2025-07-04 13:49 ` Erik Johnston <[email protected]>
2025-07-04 14:38 ` Re: Corrupt btree index includes rows that don't match Ron Johnson <[email protected]>
0 siblings, 1 reply; 6+ messages in thread
From: Erik Johnston @ 2025-07-04 13:49 UTC (permalink / raw)
To: [email protected]
Hi, a quick update:
- We have discovered that the corruption was present from before libicu
update.
- We ran `pg_amcheck --index state_groups_state_type_idx
--heapallindexed matrix`, which returned nothing
- We believe that means that (and matches what we see sampling) the
index has gained extra entries, i.e. that for a given state group it
does return all the relevant rows in the table /plus/ extra rows.
We are also seeing old state groups starting to point at rows that have
only just been inserted. For example, querying for 353864583 on the
primary it returns that row plus four rows that have been inserted
today, but on the backup from last week an index only scan for 353864583
only returns one row. This makes it feel like the corruption is ongoing?
Nothing should have modified that state group in the interim (they are
generally immutable).
This naively feels like when inserting a new row we sometimes add the
row to the index twice: once pointing from the correct state group to
the new row, and once from an old state group to the new row?
Thanks,
Erik
On 03/07/2025 18:07, Erik Johnston wrote:
>
> Hello,
>
>
> We're looking into a problem with our application and have tracked it
> down to index corruption, whereby we have many index rows pointing to
> the wrong tuples in the heap.
>
>
> Our table looks like:
>
>
> ```
>
> Table "matrix.state_groups_state"
> Column | Type | Collation | Nullable | Default
> -------------+--------+-----------+----------+---------
> state_group | bigint | | |
> room_id | text | | |
> type | text | | |
> state_key | text | | |
> event_id | text | | |
> Indexes:
> "state_groups_state_room_id_idx" brin (room_id) WITH
> (pages_per_range='1')
> "state_groups_state_type_idx" btree (state_group, type,
> state_key), tablespace "postgres_second"
> Triggers:
> check_state_groups_state_deletion_trigger AFTER DELETE ON
> state_groups_state DEFERRABLE INITIALLY DEFERRED FOR EACH ROW EXECUTE
> FUNCTION check_state_groups_state_deletion()
>
> ```
>
>
> The symptoms we are noticing are that a DELETE or SELECT query
> includes rows that don't match the condition, as long as we issue a
> query that results in an Index Scan (not Index Only Scan):
>
>
> For example, including `ctid` in the query is enough to make the
> planner use an Index Scan:
>
>
> ```
>
> SELECT ctid, state_group FROM state_groups_state WHERE state_group =
> 483128098;
>
> ctid | state_group
> ----------------+-------------
> (16669607,1) | 483128098
> (424940858,20) | 963361875
> (16669606,53) | 483128098
> (3 rows)
>
> ```
>
>
> But with an Index Only Scan:
>
>
> ```
>
> SELECT state_group FROM state_groups_state WHERE state_group = 483128098;
> state_group
> -------------
> 483128098
> 483128098
> 483128098
> (3 rows)
>
> ```
>
>
> Since including `ctid` in the SELECT columns causes the query to use
> an Index Scan (fetching tuples from the heap), this inconsistency
> leads us to believe that our index and heap disagree.
>
>
> Forcing a sequential scan with that same query only returns two rows
> matching that state group, which suggests that the index thinks there
> are more rows in the table than there actually are. (We do not believe
> anything can have deleted a row with state group 483128098). Also
> interestingly, querying (with the index re-enabled) for 963361875
> returns the same row as returned above, so the row is in the index twice.
>
>
> Another example state group (147961623) should only have a single row
> associated with it, and yet the index returns nearly 7000 rows
> (including the one we expect). The unexpected state groups are all in
> the range 794390760–794393085 (except one in 794411694), and also have
> ctids in range (93454823,48) – (93455621,49). The fact that these are
> reasonably tight ranges feels suspicious. Note that the state group is
> a simple incrementing ID here.
>
>
> This table is quite large (about 6 TB) but we have sampled a few small
> ranges of it and found many instances of this type of corruption, in
> the first (approximate) half of the key range (0..561M out of 0..1034M).
>
>
> For historical reasons, the table and the index are on different
> tablespaces, but the same filesystem.
>
>
> We have sampled the table on our secondary server, and we see the same
> sort of corruption going on (though given the size of the data we
> don’t actually know if it's exactly the same).
>
>
> One coincidence is that we started seeing the first symptoms of this
> around the same time as libicu was updated with a security patch.
> However, postgres hasn’t been restarted and doesn’t reference the new
> version in its process maps. Plus state groups are integers anyway. We
> also use the C locale, not ICU.
>
>
> We’re currently running “pg_amcheck --index
> state_groups_state_type_idx --heapallindexed” on our secondary to see
> what it says, but we expect that to take a long time to complete.
>
>
> Thankfully, we have database backups so hopefullywe should be able to
> restore the data. However, any thoughts on how this happened or where
> to look next would be greatly appreciated. Thoughts on how to check
> our other indexes for corruption would also be very welcome.
>
>
> Thanks,
>
> Erik
>
>
>
> Further details of our setup:
>
> * 2 servers in physical replication (one primary, one secondary as a
> hot standby)
> o both servers display the corruption
> * ECC RAM
> * 8 NVME SSD, raid10 (mdraid), LVM, ext4 filesystem.
> o smartctl and mdadm report healthy disks
> * Debian, postgres installed via apt.
> * Postgres version: PostgreSQL 14.11 (Debian 14.11-1.pgdg120+1) on
> x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
> * Kernel version 6.1.0-22-amd64, GLIBC 2.36-9+deb12u10
>
>
>
> Copyright © 2025 Element - All rights reserved. The Element name, logo
> and device are registered trademarks of New Vector Ltd. Registered
> number: 10873661. Registered in England and Wales. Registered address:
> 10 Queen Street Place, London, United Kingdom, EC4R 1AG.
>
> This message is intended for the addressee only and may contain
> private and confidential information or material which may be
> privileged. If this message has come to you in error please delete it
> immediately and do not copy it or show it to any other person.
>
--
Copyright © 2025 Element - All rights reserved. The Element name, logo
and device are registered trademarks of New Vector Ltd. Registered number:
10873661. Registered in England and Wales. Registered address: 10 Queen
Street Place, London, United Kingdom, EC4R 1AG.
This message is intended
for the addressee only and may contain private and confidential information
or material which may be privileged. If this message has come to you in
error please delete it immediately and do not copy it or show it to any
other person.
^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Corrupt btree index includes rows that don't match
2025-07-03 17:07 Corrupt btree index includes rows that don't match Erik Johnston <[email protected]>
2025-07-04 13:49 ` Re: Corrupt btree index includes rows that don't match Erik Johnston <[email protected]>
@ 2025-07-04 14:38 ` Ron Johnson <[email protected]>
2025-07-04 14:59 ` Re: Corrupt btree index includes rows that don't match Erik Johnston <[email protected]>
0 siblings, 1 reply; 6+ messages in thread
From: Ron Johnson @ 2025-07-04 14:38 UTC (permalink / raw)
To: pgsql-generallists.postgresql.org <[email protected]>
On Fri, Jul 4, 2025 at 9:49 AM Erik Johnston <[email protected]> wrote:
> Hi, a quick update:
>
> - We have discovered that the corruption was present from before libicu
> update.
> - We ran `pg_amcheck --index state_groups_state_type_idx --heapallindexed
> matrix`, which returned nothing
> - We believe that means that (and matches what we see sampling) the index
> has gained extra entries, i.e. that for a given state group it does return
> all the relevant rows in the table *plus* extra rows.
>
> We are also seeing old state groups starting to point at rows that have
> only just been inserted. For example, querying for 353864583 on the primary
> it returns that row plus four rows that have been inserted today, but on
> the backup from last week an index only scan for 353864583 only returns one
> row. This makes it feel like the corruption is ongoing? Nothing should have
> modified that state group in the interim (they are generally immutable).
>
> This naively feels like when inserting a new row we sometimes add the row
> to the index twice: once pointing from the correct state group to the new
> row, and once from an old state group to the new row?
>
>
Are checksums enabled in the instance?
--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!
^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Corrupt btree index includes rows that don't match
2025-07-03 17:07 Corrupt btree index includes rows that don't match Erik Johnston <[email protected]>
2025-07-04 13:49 ` Re: Corrupt btree index includes rows that don't match Erik Johnston <[email protected]>
2025-07-04 14:38 ` Re: Corrupt btree index includes rows that don't match Ron Johnson <[email protected]>
@ 2025-07-04 14:59 ` Erik Johnston <[email protected]>
2025-07-04 15:41 ` Re: Corrupt btree index includes rows that don't match Tom Lane <[email protected]>
2025-07-09 17:02 ` Re: Corrupt btree index includes rows that don't match Erik Johnston <[email protected]>
0 siblings, 2 replies; 6+ messages in thread
From: Erik Johnston @ 2025-07-04 14:59 UTC (permalink / raw)
To: [email protected]
On Fri, 4 Jul 2025, 15:38 Ron Johnson, <[email protected]> wrote:
> On Fri, Jul 4, 2025 at 9:49 AM Erik Johnston <[email protected]> wrote:
>
>> Hi, a quick update:
>>
>> - We have discovered that the corruption was present from before libicu
>> update.
>> - We ran `pg_amcheck --index state_groups_state_type_idx --heapallindexed
>> matrix`, which returned nothing
>> - We believe that means that (and matches what we see sampling) the index
>> has gained extra entries, i.e. that for a given state group it does return
>> all the relevant rows in the table *plus* extra rows.
>>
>> We are also seeing old state groups starting to point at rows that have
>> only just been inserted. For example, querying for 353864583 on the primary
>> it returns that row plus four rows that have been inserted today, but on
>> the backup from last week an index only scan for 353864583 only returns one
>> row. This makes it feel like the corruption is ongoing? Nothing should have
>> modified that state group in the interim (they are generally immutable).
>>
>> This naively feels like when inserting a new row we sometimes add the row
>> to the index twice: once pointing from the correct state group to the new
>> row, and once from an old state group to the new row?
>>
>>
> Are checksums enabled in the instance?
>
Alas not.
We've also now found that the index on the backup does in fact point to
those ctids after all, but they are marked as dead. So at some point
between then and when we inserted the new row at that ctid today those
entries were marked undead.
--
Copyright © 2025 Element - All rights reserved. The Element name, logo
and device are registered trademarks of New Vector Ltd. Registered number:
10873661. Registered in England and Wales. Registered address: 10 Queen
Street Place, London, United Kingdom, EC4R 1AG.
This message is intended
for the addressee only and may contain private and confidential information
or material which may be privileged. If this message has come to you in
error please delete it immediately and do not copy it or show it to any
other person.
^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Corrupt btree index includes rows that don't match
2025-07-03 17:07 Corrupt btree index includes rows that don't match Erik Johnston <[email protected]>
2025-07-04 13:49 ` Re: Corrupt btree index includes rows that don't match Erik Johnston <[email protected]>
2025-07-04 14:38 ` Re: Corrupt btree index includes rows that don't match Ron Johnson <[email protected]>
2025-07-04 14:59 ` Re: Corrupt btree index includes rows that don't match Erik Johnston <[email protected]>
@ 2025-07-04 15:41 ` Tom Lane <[email protected]>
1 sibling, 0 replies; 6+ messages in thread
From: Tom Lane @ 2025-07-04 15:41 UTC (permalink / raw)
To: Erik Johnston <[email protected]>; +Cc: [email protected]
Erik Johnston <[email protected]> writes:
> We've also now found that the index on the backup does in fact point to
> those ctids after all, but they are marked as dead. So at some point
> between then and when we inserted the new row at that ctid today those
> entries were marked undead.
I wonder if this behavior could be related to this 14.18 fix:
https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4934d3875
It probably isn't, because AFAICS that would only lead to transiently
wrong answers --- but maybe you have a workload that occasionally hits
that bug in the context of a query that will update and re-insert the
recently-dead tuples? Still a bit far-fetched though, and if the
index is actually corrupt this doesn't explain how it got that way.
I'm more inclined to just say "once a btree index is out of order it
can do some very strange things, both during inserts and searches".
If you had some evidence about when and how the index became corrupt,
it'd be worth studying that, but it sounds like you don't.
regards, tom lane
^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Corrupt btree index includes rows that don't match
2025-07-03 17:07 Corrupt btree index includes rows that don't match Erik Johnston <[email protected]>
2025-07-04 13:49 ` Re: Corrupt btree index includes rows that don't match Erik Johnston <[email protected]>
2025-07-04 14:38 ` Re: Corrupt btree index includes rows that don't match Ron Johnson <[email protected]>
2025-07-04 14:59 ` Re: Corrupt btree index includes rows that don't match Erik Johnston <[email protected]>
@ 2025-07-09 17:02 ` Erik Johnston <[email protected]>
1 sibling, 0 replies; 6+ messages in thread
From: Erik Johnston @ 2025-07-09 17:02 UTC (permalink / raw)
To: [email protected]
Hi again,
Thanks very much for the replies last week. We’ve been continuing to
investigate this problem, and I thought I’d share an update on where we are.
To recap: the situation is that, looking at our backup from 2025-06-26
via pageinspect, we have btree index rows which point to either
non-existent heap TIDs, or to heap TIDs with data which does not
correspond to the index row. In fact it looks like we have entire index
pages which point only to non-existent heap TIDs.
(I previously said that these index rows were marked as ‘dead’ in the
backup. We now suspect this is an artifact of the restore process: we
believe they are live in the backup, but were marked as dead during the
restore.)
Empirically, and surprisingly to us, when one does a SELECT from an
index entry that points to a non-existent TID, the index entry is
quietly ignored.
We therefore suspect that this index corruption has been present for
some time (possibly years); more recently those non-existent heap TIDs
have been recycled, and that is when we have noticed the effects of the
problem.
As far as we can tell, the corruption only affects one index on one
table, and only a specific region of that index/table. Specifically, it
only appears to affect rows which would have been inserted between 2018
and January 2021. At least 1B rows appear to be affected (the table as a
whole has 29B rows).
One thing that surprised us is that ‘amcheck’ didn’t find any sign of
the corruption. We’re not completely sure if this is because we are
holding it wrong, or because it’s simply out of scope or unsupported for
amcheck. Any advice on this, or suggestions for other tooling we could
use to check the consistency of our other indexes, would be much
appreciated.
We’re still very interested in trying to understand the root cause of
the corruption, mostly to confirm that it’s not an ongoing problem.
Thanks Tom for the suggestion of
https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4934d3875.
We agree with your assessment that this is unlikely. For one thing, it
looks like that bug could only conceivably cause this corruption if it
affected an UPDATE query, and we’re reasonably sure we never do any
UPDATE queries on that table. (The table is mostly append-only. We do
sometimes run cleanup/compression jobs which amount to large amounts of
interleaved DELETEs and INSERTs, but no UPDATEs.)
Back in 2021, we were running Postgres 10.11. We’ve taken a pass through
the release notes since then to see if we can find any likely-looking
bugs. We found the one that causes BRIN index corruption (this is not a
BRIN index), and the one that causes CREATE INDEX CONCURRENTLY to end up
with too *few* entries (this one has the opposite problem), but no
particularly likely candidate. Any other suggestions would be welcome here.
At the moment, a historical hardware-level problem seems like it might
be the most likely culprit, though we are a bit mystified about how any
hardware failure could have caused such widespread damage to a single
index, whilst apparently leaving the rest of the database intact.
Any thoughts or suggestions are very much appreciated.
Thanks,
Erik
On 04/07/2025 15:59, Erik Johnston wrote:
>
>
> On Fri, 4 Jul 2025, 15:38 Ron Johnson, <[email protected]> wrote:
>
> On Fri, Jul 4, 2025 at 9:49 AM Erik Johnston <[email protected]> wrote:
>
> Hi, a quick update:
>
> - We have discovered that the corruption was present from
> before libicu update.
> - We ran `pg_amcheck --index state_groups_state_type_idx
> --heapallindexed matrix`, which returned nothing
> - We believe that means that (and matches what we see
> sampling) the index has gained extra entries, i.e. that for a
> given state group it does return all the relevant rows in the
> table /plus/ extra rows.
>
> We are also seeing old state groups starting to point at rows
> that have only just been inserted. For example, querying for
> 353864583 on the primary it returns that row plus four rows
> that have been inserted today, but on the backup from last
> week an index only scan for 353864583 only returns one row.
> This makes it feel like the corruption is ongoing? Nothing
> should have modified that state group in the interim (they are
> generally immutable).
>
> This naively feels like when inserting a new row we sometimes
> add the row to the index twice: once pointing from the correct
> state group to the new row, and once from an old state group
> to the new row?
>
>
> Are checksums enabled in the instance?
>
>
> Alas not.
>
> We've also now found that the index on the backup does in fact point
> to those ctids after all, but they are marked as dead. So at some
> point between then and when we inserted the new row at that ctid today
> those entries were marked undead.
--
Element Logo
_Copyright © 2023 Element - All rights reserved. The Element name, logo
and device are registered trademarks of New Vector Ltd. Registered
number: 10873661. Registered in England and Wales. Registered address:
10 Queen Street Place, London, United Kingdom, EC4R 1AG.
This message is intended for the addressee only and may contain private
and confidential information or material which may be privileged. If
this message has come to you in error please delete it immediately and
do not copy it or show it to any other person.
--
Copyright © 2025 Element - All rights reserved. The Element name, logo
and device are registered trademarks of New Vector Ltd. Registered number:
10873661. Registered in England and Wales. Registered address: 10 Queen
Street Place, London, United Kingdom, EC4R 1AG.
This message is intended
for the addressee only and may contain private and confidential information
or material which may be privileged. If this message has come to you in
error please delete it immediately and do not copy it or show it to any
other person.
^ permalink raw reply [nested|flat] 6+ messages in thread
end of thread, other threads:[~2025-07-09 17:02 UTC | newest]
Thread overview: 6+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-07-03 17:07 Corrupt btree index includes rows that don't match Erik Johnston <[email protected]>
2025-07-04 13:49 ` Erik Johnston <[email protected]>
2025-07-04 14:38 ` Ron Johnson <[email protected]>
2025-07-04 14:59 ` Erik Johnston <[email protected]>
2025-07-04 15:41 ` Tom Lane <[email protected]>
2025-07-09 17:02 ` Erik Johnston <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox