Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w1uaB-000iDX-1I for pgsql-general@arkaria.postgresql.org; Sun, 15 Mar 2026 23:09:16 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w1uaA-006Bem-0b for pgsql-general@arkaria.postgresql.org; Sun, 15 Mar 2026 23:09:14 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w1ua9-006Bed-2X for pgsql-general@lists.postgresql.org; Sun, 15 Mar 2026 23:09:14 +0000 Received: from relay4-d.mail.gandi.net ([2001:4b98:dc4:8::224]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1w1ua7-00000000Kby-2DCc for pgsql-general@lists.postgresql.org; Sun, 15 Mar 2026 23:09:13 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id F028D3E804; Sun, 15 Mar 2026 23:09:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1773616146; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=h6MsBGCp1VzgtuKcRwF6eMJCQl/jhuiaxV8Z50yDorM=; b=f059fmQyQjpwo9CQH/5/a9tIH3jzMt1/aPsYgLK+IVTvzuyYcB7ZtFKvGlwM5v5KW2afAg OBVy/nMT3vVbPx7j+5xoQBJ2aBybJdVWEy1JzIULhZpEDcPQHms+a0pwdqYXdnOJ+k0Qb2 Yj3nksD9Awc92glO9f9+LPA+U3dLbN4RXf9b7qH1zIumIxX3GgEswSDNUeTE4ILmJVWs8G vZxOo7Oe1qKmrOCX1c3kH9QKMDckgSSFQsqwG5DEZs8naKduLQQh90IDRgyMAZtQ+uNH+H wKlogq+oVf2LhT2ZbmPtoBDUD1bAqKCZiSsrZYjwT5QTjKL/dHUJVLWg/er2rQ== Message-ID: Date: Mon, 16 Mar 2026 00:09:05 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Replication to standby broke with WAL file corruption To: Ishan joshi , "pgsql-general@lists.postgresql.org" References: Content-Language: en-US From: Tomas Vondra In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-GND-Sasl: tomas@vondra.me X-GND-State: clean X-GND-Score: -100 X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddvleeijeejucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuifetpfffkfdpucggtfgfnhhsuhgsshgtrhhisggvnecuuegrihhlohhuthemuceftddunecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjughrpefkffggfgfuvfhfhfgjtgfgsehtkeertddtvdejnecuhfhrohhmpefvohhmrghsucggohhnughrrgcuoehtohhmrghssehvohhnughrrgdrmhgvqeenucggtffrrghtthgvrhhnpeekffdvudegteefieelffetkeelffeggffhuefffefhleekleethfefieeggfffkeenucfkphepkeeirdegledrvdeftddrvddtieenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepihhnvghtpeekiedrgeelrddvfedtrddvtdeipdhhvghloheplgdutddrudefjedrtddrvdgnpdhmrghilhhfrhhomhepthhomhgrshesvhhonhgurhgrrdhmvgdpqhhiugephfdtvdekffefgfektdegpdhmohguvgepshhmthhpohhuthdpnhgspghrtghpthhtohepvddprhgtphhtthhopehishhhrghnjhhoshhhiheslhhivhgvrdgtohhmpdhrtghpthhtohepphhgshhqlhdqghgvnhgvrhgrlheslhhishhtshdrphhoshhtghhrvghsqhhlrdhorhhg List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 3/13/26 11:41, Ishan joshi wrote: > Hi Team, > > I found an issue with PG v16.9 patroni setup where our standby node > replication and disaster replication site replication broken with below > error. It looks like WAL corruption which later part of archive file. > > > CONTEXT:  WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117, > off:35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0: rel > 1663/33195/410203483, blk 25329" > PANIC:  WAL contains references to invalid pages" > CONTEXT:  WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117, > off:35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0: > rel1663/33195/410203483, blk 25329" > WARNING:  page 25329 of relation base/33195/410203483 does not exist" > INFO: no action. I am (pg-patroni-node1-0), a secondary, and following a > leader (pg-patroni-node2-0)" > [61]LOG:  terminating any other active server processes" > [61]LOG:  startup process (PID 72) was terminated by signal 6: Aborted" > [61]LOG:  shutting down due to startup process failure" > [61]LOG:  database system is shut down" > INFO: establishing a new patroni heartbeat connection to postgres" > INFO: Lock owner: pg-patroni-node2-0; I am pg-patroni-node1-0" > WARNING: Retry got exception: connection problems" > WARNING: Failed to determine PostgreSQL state from the connection, > fallingback to cached role" > INFO: Error communicating with PostgreSQL. Will try again later" > WARNING: Postgresql is not running." > > > Primary db was not impacted, however standby node and DR site > replication broken, I tried to reinit with latest backup + archive > loading from pgbackrest backup but it fails with same error once the > corrupt wal/archive file applying the changes. I had to reinit with > pgbasebackup with 40TB database which took about 45 hrs of time. > > As I understand the transcation create table ->performed DML and then > drop the table or transaction could be rollback that makes RACE > condition in WAL file creation and got failed while applying the same in > standby/DR site. > It's hard to say what caused this, but it might be interesting to look at the WAL using pg_waldump. First at the WAL segment containing the record triggering the failure, and then also at WAL segments before that containing references to relation 1663/33195/410203483 (and especially page 25329). It is interesting this succeeded on a primary, but failed on standby. Is there anything special about the relation 1663/33195/410203483? Do you know if it's a regular / temporary table, etc? regards -- Tomas Vondra