Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rwNHh-001EdG-SZ for pgsql-general@arkaria.postgresql.org; Mon, 15 Apr 2024 14:26:13 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1rwNHf-00DLAu-Ax for pgsql-general@arkaria.postgresql.org; Mon, 15 Apr 2024 14:26:11 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rwNHe-00DLAl-VF for pgsql-general@lists.postgresql.org; Mon, 15 Apr 2024 14:26:10 +0000 Received: from mail-4323.proton.ch ([185.70.43.23]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rwNHY-002yU5-44 for pgsql-general@lists.postgresql.org; Mon, 15 Apr 2024 14:26:08 +0000 Date: Mon, 15 Apr 2024 14:25:47 +0000 To: "pgsql-general@lists.postgresql.org" From: Nicolas Seinlet Subject: Re: Failing streaming replication on PostgreSQL 14 Message-ID: In-Reply-To: References: <2UjR_dq61Dzm98Vci6oYDLK4PlmvAzhbVg0uYCunualePnYCN8sYSXVuJW2N4D1kHqiWPw7VvEZMYS0Hy4UHfRI6lVApRrs_1-LZ3E0jzNE=@seinlet.com> Feedback-ID: 32582315:user:proton MIME-Version: 1.0 Content-Type: multipart/signed; protocol="application/pgp-signature"; micalg=pgp-sha512; boundary="------acdca20396c4994f075755b27d682831e86f120ac9376f06857358ee602b354d"; charset=utf-8 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------acdca20396c4994f075755b27d682831e86f120ac9376f06857358ee602b354d Content-Type: multipart/mixed;boundary=---------------------d57287471de8986d1f71cb417256de19 -----------------------d57287471de8986d1f71cb417256de19 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain;charset=utf-8 On Monday, April 15th, 2024 at 14:36, Ron Johnson wrote: > On Mon, Apr 15, 2024 at 2:53=E2=80=AFAM Nicolas Seinlet wrote: > = > > Hello everyone, > > = > > Since I moved some clusters from PostgreSQL 12 to 14, I noticed random= failures in streaming replication. I say "random" mostly because I haven'= t got the source of the issue. > > = > > I'm using the Ubuntu/cyphered ZFS/PostgreSQL combination. I'm using Ub= untu LTS (20.04 22.04) and provided ZFS/PostgreSQL with LTS (PostgreSQL 12= on Ubuntu 20.04 and 14 on 22.04). > > = > > The streaming replication of PostgreSQL is configured with `primary_co= nninfo 'host=3Dmain_server port=3D5432 user=3Dreplicant password=3Da_very_= secure_password sslmode=3Drequire application_name=3Dreplication_postgresq= l_app' ` , no replication slot nor restore command, and the wal is configu= red with `full_page_writes =3D off wal_init_zero =3D off wal_recycle =3D o= ff` > > = > > If this works like a charm on PostgreSQL 12, it's sometimes failing wi= th PostgreSQL 14. As we also changed the OS, maybe the issue relies somewh= ere else. > > = > > When the issue is detected, the WAL on the primary is correct. A piece= of the WAL is wrong on the secondary. Only some bytes. Some bytes later, = the wal is again correct. Stopping PostgreSQL on the secondary, removing t= he wrong WAL file, and restarting PostgreSQL solves the issue. > > = > > We've added another secondary and noticed the issue can appear on one = of the secondaries, not both at the same time. > > = > > What can I do to detect the origin of this issue? > = > = > 1. Minor version number? > 2. Using replication_slots? > 3. Error message(s)? > = > = Hi, 1. PostgreSQL 14.11 2. No. no replication slot nor restore command. As we've understood the r= eplication slot, it's a mechanism to keep on the primary side everything n= eeded for the secondary to recover. Will this make the primary acknowledge= that the secondary received the good wal file? 3. incorrect resource manager data checksum Looking at the WAL files with xxd gives the following diff: The bad one: 006c9160: 0a6e 7514 5030 2e31 0e35 016c 0f07 0009 2f62 6568 6100 7669 6f72= 3a6e 6f72 be6d .nu.P0.1.5.l..../beha.vior:nor.m 006c9180: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000= 0000 0000 0000 ................................ 006c91a0: 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000= 0000 0000 0000 ................................ 006c91c0: 437a 4263 7500 7273 6f72 3a70 6f69 0302 4503 9023 3237 3665 3720= 323b 223e 5527 CzBcu.rsor:poi..E..#276e7 2;">U' The good one contains the same 1st and 4th lines, but the 2nd and 3rd line= s contain the correct values, as if a packet was missed. Thanks for helping, Nicolas. -----------------------d57287471de8986d1f71cb417256de19-- --------acdca20396c4994f075755b27d682831e86f120ac9376f06857358ee602b354d Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: ProtonMail wnUEARYKACcFgmYdON0JkJG8s0TkXG4EFiEEyut/r/ADqlpVUS/JkbyzRORc bgQAAPoIAP0QGd2bAAq93jAhxSHSRTa5PJPTbQuK/jxa/qfVimS1JAEAyQzf 65+RiIYtfQz3p5mFmKH/TSBvLdwGQ+OlUInl3Ag= =lAHw -----END PGP SIGNATURE----- --------acdca20396c4994f075755b27d682831e86f120ac9376f06857358ee602b354d--