public inbox for [email protected]
help / color / mirror / Atom feedFrom: Pawel Kudzia <[email protected]>
To: [email protected]
Subject: Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?
Date: Sun, 14 Sep 2025 14:47:19 +0200
Message-ID: <CAJYBUS9pUjZbqsndbjdHUNugonfQ4DRr_HEn0QkyTXsds0VLMQ@mail.gmail.com> (raw)
In-Reply-To: <[email protected]>
References: <CAJYBUS9T=JCVGkb9yXn5GVYC_g-e5mRi4MFoNHKEZV0keuHHwQ@mail.gmail.com>
<[email protected]>
On Sun, Sep 14, 2025 at 12:35 PM Laurenz Albe <[email protected]> wrote:
>
> On Sun, 2025-09-14 at 10:30 +0200, Pawel Kudzia wrote:
> > I've hit a silent data corruption for TOAST data - leading to some
> > infinite loop when accessing bytea column for very particular row. I
> > did not suffer data loss - data from streaming replica was fine, I've
> > used it to rebuild the main server.
> >
> > I'm wondering if there's any proactive way of detecting that type of
> > an issue rather than discovering pile-up of SELECT queries leading to
> > CPU starvation or finding hanged backup jobs.
> >
> > Thanks in advance for your suggestions!
> >
> > I was originally running PostgreSQL 17.2 installed from project's deb
> > packages, under Debian 12 on amd64. Environment is - without any
> > recent crashes, with ECC memory and server-grade hardware. I run
> > pg_dumpall every 24h, on a certain day it hanged. On re-run it hanged
> > again, each time leaving single PosgreSQL process using 100% of single
> > CPU core.
> >
> > [...]
> >
> > PostgreSQL's log does not have any error messages indicating data corruption.
> >
> > This server does not have checksums enabled, but - as I understand -
> > such checksums cannot be checked online anyway. So - how can I detect
> > similar corruption?
>
> A proactive way of detecting TOAST corruption... how about
>
> pg_dump -f /dev/null yourdatabase
>
> If there is TOAST corruption, that should give you an error.
>
> If your standby does not have the same problem, that increases the
> likelihood that the cause is a hardware problem. Of course, it could
> still be a software bug.
>
> Yours,
> Laurenz Albe
Thanks for your answer! pg_dump mydatabase ends with ~the same as the
SELECT statement.
It hangs leaving 'postgres: 17/main: postgres application_cache
[local] COPY' using 100% of single CPU core and no messages in PG's
logs.
gdb stack trace for that process:
#0 0x000055cb571ef444 in hash_search_with_hash_value ()
#1 0x000055cb5706217a in BufTableLookup ()
#2 0x000055cb5706715f in StartReadBuffer ()
#3 0x000055cb57068671 in ReadBufferExtended ()
#4 0x000055cb56d91573 in _bt_relandgetbuf ()
#5 0x000055cb56d96b9b in _bt_moveright ()
#6 0x000055cb56d96d69 in _bt_search ()
#7 0x000055cb56d97b83 in _bt_first ()
#8 0x000055cb56d93bbf in btgettuple ()
#9 0x000055cb56d8b289 in index_getnext_tid ()
#10 0x000055cb56d8b3ed in index_getnext_slot ()
#11 0x000055cb56d8a957 in systable_getnext_ordered ()
#12 0x000055cb56d812bd in heap_fetch_toast_slice ()
#13 0x000055cb56d3abae in ?? ()
#14 0x000055cb56d3afc5 in detoast_attr ()
#15 0x000055cb571b7bd6 in byteaout ()
#16 0x000055cb571eb9a1 in OutputFunctionCall ()
#17 0x000055cb56e711c7 in ?? ()
#18 0x000055cb56e71b2f in DoCopyTo ()
#19 0x000055cb56e6c06f in DoCopy ()
#20 0x000055cb570a2726 in standard_ProcessUtility ()
#21 0x00007fd02f890a81 in ?? () from
/usr/lib/postgresql/17/lib/pg_stat_statements.so
I was curious if there's any other tool that can spot that type of
corruption and state it explicitly rather than guess it by monitoring
execution time.
#22 0x000055cb570a0c01 in ?? ()
#23 0x000055cb570a0d18 in ?? ()
#24 0x000055cb570a1296 in PortalRun ()
#25 0x000055cb5709d4a5 in ?? ()
#26 0x000055cb5709e82d in PostgresMain ()
#27 0x000055cb57099a3f in BackendMain ()
#28 0x000055cb570072aa in postmaster_child_launch ()
#29 0x000055cb5700adb9 in ?? ()
#30 0x000055cb5700ccc8 in PostmasterMain ()
#31 0x000055cb56d2ea11 in main ()
--
regards,
Pawel Kudzia
view thread (4+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected]
Subject: Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?
In-Reply-To: <CAJYBUS9pUjZbqsndbjdHUNugonfQ4DRr_HEn0QkyTXsds0VLMQ@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox