Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?
4+ messages / 4 participants
[nested] [flat]

* Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?
@ 2025-09-14 12:47 Pawel Kudzia <[email protected]>
  2025-09-14 13:40 ` Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively? Laurenz Albe <[email protected]>
  2025-09-16 13:25 ` Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively? Vladlen Popolitov <[email protected]>
  0 siblings, 2 replies; 4+ messages in thread

From: Pawel Kudzia @ 2025-09-14 12:47 UTC (permalink / raw)
  To: [email protected]

On Sun, Sep 14, 2025 at 12:35 PM Laurenz Albe <[email protected]> wrote:
>
> On Sun, 2025-09-14 at 10:30 +0200, Pawel Kudzia wrote:
> > I've hit a silent data corruption for TOAST data - leading to some
> > infinite loop when accessing bytea column for very particular row. I
> > did not suffer data loss - data from streaming replica was fine, I've
> > used it to rebuild the main server.
> >
> > I'm wondering if there's any proactive way of detecting that type of
> > an issue rather than discovering pile-up of SELECT queries leading to
> > CPU starvation or finding hanged backup jobs.
> >
> > Thanks in advance for your suggestions!
> >
> > I was originally running PostgreSQL 17.2 installed from project's deb
> > packages, under Debian 12 on amd64. Environment is - without any
> > recent crashes, with ECC memory and server-grade hardware. I run
> > pg_dumpall every 24h, on a certain day it hanged. On re-run it hanged
> > again, each time leaving single PosgreSQL process using 100% of single
> > CPU core.
> >
> > [...]
> >
> > PostgreSQL's log does not have any error messages indicating data corruption.
> >
> > This server does not have checksums enabled, but - as I understand -
> > such checksums cannot be checked online anyway. So - how can I detect
> > similar corruption?
>
> A proactive way of detecting TOAST corruption... how about
>
>   pg_dump -f /dev/null yourdatabase
>
> If there is TOAST corruption, that should give you an error.
>
> If your standby does not have the same problem, that increases the
> likelihood that the cause is a hardware problem.  Of course, it could
> still be a software bug.
>
> Yours,
> Laurenz Albe

Thanks for your answer! pg_dump mydatabase ends with ~the same as the
SELECT statement.
It hangs leaving 'postgres: 17/main: postgres application_cache
[local] COPY' using 100% of single CPU core and no messages in PG's
logs.

gdb stack trace for that process:

#0  0x000055cb571ef444 in hash_search_with_hash_value ()
#1  0x000055cb5706217a in BufTableLookup ()
#2  0x000055cb5706715f in StartReadBuffer ()
#3  0x000055cb57068671 in ReadBufferExtended ()
#4  0x000055cb56d91573 in _bt_relandgetbuf ()
#5  0x000055cb56d96b9b in _bt_moveright ()
#6  0x000055cb56d96d69 in _bt_search ()
#7  0x000055cb56d97b83 in _bt_first ()
#8  0x000055cb56d93bbf in btgettuple ()
#9  0x000055cb56d8b289 in index_getnext_tid ()
#10 0x000055cb56d8b3ed in index_getnext_slot ()
#11 0x000055cb56d8a957 in systable_getnext_ordered ()
#12 0x000055cb56d812bd in heap_fetch_toast_slice ()
#13 0x000055cb56d3abae in ?? ()
#14 0x000055cb56d3afc5 in detoast_attr ()
#15 0x000055cb571b7bd6 in byteaout ()
#16 0x000055cb571eb9a1 in OutputFunctionCall ()
#17 0x000055cb56e711c7 in ?? ()
#18 0x000055cb56e71b2f in DoCopyTo ()
#19 0x000055cb56e6c06f in DoCopy ()
#20 0x000055cb570a2726 in standard_ProcessUtility ()
#21 0x00007fd02f890a81 in ?? () from
/usr/lib/postgresql/17/lib/pg_stat_statements.so

I was curious if there's any other tool that can spot that type of
corruption and state it explicitly rather than guess it by monitoring
execution time.
#22 0x000055cb570a0c01 in ?? ()
#23 0x000055cb570a0d18 in ?? ()
#24 0x000055cb570a1296 in PortalRun ()
#25 0x000055cb5709d4a5 in ?? ()
#26 0x000055cb5709e82d in PostgresMain ()
#27 0x000055cb57099a3f in BackendMain ()
#28 0x000055cb570072aa in postmaster_child_launch ()
#29 0x000055cb5700adb9 in ?? ()
#30 0x000055cb5700ccc8 in PostmasterMain ()
#31 0x000055cb56d2ea11 in main ()





--
regards,
Pawel Kudzia






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?
  2025-09-14 12:47 Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively? Pawel Kudzia <[email protected]>
@ 2025-09-14 13:40 ` Laurenz Albe <[email protected]>
  1 sibling, 0 replies; 4+ messages in thread

From: Laurenz Albe @ 2025-09-14 13:40 UTC (permalink / raw)
  To: Pawel Kudzia <[email protected]>; [email protected]

On Sun, 2025-09-14 at 14:47 +0200, Pawel Kudzia wrote:
> On Sun, Sep 14, 2025 at 12:35 PM Laurenz Albe <[email protected]> wrote:
> > On Sun, 2025-09-14 at 10:30 +0200, Pawel Kudzia wrote:
> > > I've hit a silent data corruption for TOAST data - leading to some
> > > infinite loop when accessing bytea column for very particular row. I
> > > did not suffer data loss - data from streaming replica was fine, I've
> > > used it to rebuild the main server.
> > > 
> > > I'm wondering if there's any proactive way of detecting that type of
> > > an issue rather than discovering pile-up of SELECT queries leading to
> > > CPU starvation or finding hanged backup jobs.
> > > 
> > > PostgreSQL's log does not have any error messages indicating data corruption.
> > > 
> > > This server does not have checksums enabled, but - as I understand -
> > > such checksums cannot be checked online anyway. So - how can I detect
> > > similar corruption?
> > 
> > A proactive way of detecting TOAST corruption... how about
> > 
> >   pg_dump -f /dev/null yourdatabase
> > 
> > If there is TOAST corruption, that should give you an error.
> 
> Thanks for your answer! pg_dump mydatabase ends with ~the same as the
> SELECT statement.
> It hangs leaving 'postgres: 17/main: postgres application_cache
> [local] COPY' using 100% of single CPU core and no messages in PG's
> logs.
> 
> I was curious if there's any other tool that can spot that type of
> corruption and state it explicitly rather than guess it by monitoring
> execution time.

Data checksums would certainly be a step in the right direction.
They are checked whenever the data are read from storage.

There is no tool that will reliably detect all kinds of data corruption.

Another tool ar your disposal are the functions from the amcheck
contrib module (or the pg_amcheck executable that calls them).

Yours,
Laurenz Albe






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?
  2025-09-14 12:47 Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively? Pawel Kudzia <[email protected]>
@ 2025-09-16 13:25 ` Vladlen Popolitov <[email protected]>
  2025-09-16 16:41   ` Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively? Merlin Moncure <[email protected]>
  1 sibling, 1 reply; 4+ messages in thread

From: Vladlen Popolitov @ 2025-09-16 13:25 UTC (permalink / raw)
  To: Pawel Kudzia <[email protected]>; +Cc: [email protected]

Pawel Kudzia писал(а) 2025-09-14 15:47:
> On Sun, Sep 14, 2025 at 12:35 PM Laurenz Albe 
> <[email protected]> wrote:

> 
> gdb stack trace for that process:
> 
> #0  0x000055cb571ef444 in hash_search_with_hash_value ()
> #1  0x000055cb5706217a in BufTableLookup ()

Hi,

  Probably, it does not hang in function hash_search_with_hash_value(),
probably you caught it in this function in that moment of the time.
This function itself do finite work and returns, it is hard to
harm it by wrong data.
  This function is called by code, that goes to leafs of the B+tree.
If this code enters destroyed block, it goes to wrong blocks and
behave unexpectedly. For example, it goes to block zero, that is not
leaf block (meta-page of the index) and probably has a lot of zero.
Btree code get new block address -  0 again,  and again goes to block 0.

  You have two options:
1) enable checksums (that is highly recommended)
and get error message immediately in case of failure, and restore 
database
from backup (and probably consider the change of the provider)
2) continue with disabled checksums, get programs crashes and finally
restore from backup.

Checksum calculation takes ~0.5% of query time, it is not bottleneck
in PostgreSQL.

P.S. Databases have a lot of code, that rely on correctness of original
data on the disk. It is impossible to to check every byte, is it correct
or not.

-- 
Best regards,

Vladlen Popolitov.

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?
  2025-09-14 12:47 Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively? Pawel Kudzia <[email protected]>
  2025-09-16 13:25 ` Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively? Vladlen Popolitov <[email protected]>
@ 2025-09-16 16:41   ` Merlin Moncure <[email protected]>
  0 siblings, 0 replies; 4+ messages in thread

From: Merlin Moncure @ 2025-09-16 16:41 UTC (permalink / raw)
  To: Vladlen Popolitov <[email protected]>; +Cc: Pawel Kudzia <[email protected]>; [email protected]

On Tue, Sep 16, 2025 at 7:25 AM Vladlen Popolitov <
[email protected]> wrote:

> Checksum calculation takes ~0.5% of query time, it is not bottleneck
> in PostgreSQL.

I consider checksums=on to be a mandatory setting.  Often, these types of
things are not bugs in postgres itself, but bugs in storage, the underlying
operating system, or extensions.  Checksums can and will protect you, and
may even bring you close to the thing causing the corruption.  Given that
your replica is ok, this very much smells like a similar type of issue.

In a prior case, I was using pl/sh to load data to the database with
'copy', and for what I believe to be o/s issues, was getting corruption.
Enabling checksums completely addressed the source of the damage.  Turn
them on!

merlin

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

end of thread, other threads:[~2025-09-16 16:41 UTC | newest]

Thread overview: 4+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-09-14 12:47 Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively? Pawel Kudzia <[email protected]>
2025-09-14 13:40 ` Laurenz Albe <[email protected]>
2025-09-16 13:25 ` Vladlen Popolitov <[email protected]>
2025-09-16 16:41   ` Merlin Moncure <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox