Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1um4LW-009Scr-FA for pgsql-general@arkaria.postgresql.org; Wed, 13 Aug 2025 05:48:22 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1um4LU-00CM1n-Tw for pgsql-general@arkaria.postgresql.org; Wed, 13 Aug 2025 05:48:21 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1um4LU-00CM1e-Hw for pgsql-general@lists.postgresql.org; Wed, 13 Aug 2025 05:48:20 +0000 Received: from mail-vk1-xa2e.google.com ([2607:f8b0:4864:20::a2e]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1um4LS-000TuW-1N for pgsql-general@lists.postgresql.org; Wed, 13 Aug 2025 05:48:20 +0000 Received: by mail-vk1-xa2e.google.com with SMTP id 71dfb90a1353d-5392abfdaa0so1716631e0c.1 for ; Tue, 12 Aug 2025 22:48:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1755064097; x=1755668897; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=iIQcXIwuxvbNL75if96ki20/qce/1jTYPodO4efb2iw=; b=QtuNXHlZTBY1OwLTP0Fv7dyyuNbsuPzH2lR0apUt0YMbFmXNI4hv3tkRj3JaapAzIv npEJmofx3eArDoHkvkqcZR1YiCctzZyeBwEihqcDjSRF90pj8A7k82smu5+rJ34+CsfF 9W09v/+v0+uOao+Cq+8Z/2zujTRdfBVuCSXVR81bGVD4dTExg6PhCn3yy0jwCSycXu68 38OpF855LSrh7j6HySys9jBrCmt/ZvgGfN91RuqZm0Y8PCWY5vZ3dZxCId2v070occ/Y qAgJ1Ynu1NbuX79CRcJpOMn7RfH47TAj1WlfrH3E0+iABOGvlQ41YTKbmH6YWh3hyDfY tUqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755064097; x=1755668897; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=iIQcXIwuxvbNL75if96ki20/qce/1jTYPodO4efb2iw=; b=JD0Fa+DeT+dSykKOI4cGsIIjh5/5AkOOan0Wmv+7PvNOtGJSpyjzsQc566CWP0qRBd o3pZee4LgStiChEz/1TbUkSFZYiaxxzvaGnMWeZnkgYdQkyzIyhH/OH+mbFF/FdUv5sz zuh/x7t/CcZ4hpr7UV4763U0xzMiMjbLLx4xxnxfyHFv1x7CXo+JpdokBqecdFJu+kh6 ndHX6uVsHBCcR0/wG/0GICDrQ9BLmdTElYpXeEoHsF6PW4p0/doo6tzvmXuF1u+AG9n3 cS342c8B9ZleXFtioPx+8Kl4gb9RnmGQZHutMOMiHbZ4aubQbsPnZtxr/myg/+DeUNCy yUTQ== X-Forwarded-Encrypted: i=1; AJvYcCWy7KiAEEM9u+afumsGBf4hqlXNZZRGn09T7x8iDmwTopdLzSTxIxf2HDxjZc6jvBXTZCoT5rfU5J+T5u0I@lists.postgresql.org X-Gm-Message-State: AOJu0YxKrsdh/qQSsqOS1rq+b47j43UTADQT9rL5fQOnodDqZWBzNnbC iaPnejx/kmpEfx7bmabcpH5V900spza4Yogaf3tiJ5JR6kND1pprwtI+zceBoCO2lVBZ/ps0vjA 13WEFMgsSKn2FcgYpfZk38IwQJY8paGE= X-Gm-Gg: ASbGncs+lbt10j6ot+h2T6NwzDdDI1dQhBkMZXevgpJxNqj2cascwZXjdeG1OjTxqhY W20nTmcDheJ3En2dRhiumEJH8X3g2EBDsCFd9MtLHyxdw/hqx1cD0oscIbo46FAcuphOdUo7Wwh 3IwwsvqzF8KWjvemaEAhH2Bty0cFp1Ko0M/y+eIkqtiROnXDqjiQdwUCswx/tbiAYVgGq34Yleb H/KjnDyx7pk2AMK86oP/mbZHZJSfaoMh2jBEg== X-Google-Smtp-Source: AGHT+IEdWbkJS0stG0o3q7/XYsCY7u6YuzyMdupsJVvt8fAc89JRxrWMxiepQp4J2x2Z+ish8Cn/ymy76OYpTvR3vsw= X-Received: by 2002:a05:6122:6581:b0:531:394a:23f3 with SMTP id 71dfb90a1353d-53b0b5f3f08mr444398e0c.10.1755064096913; Tue, 12 Aug 2025 22:48:16 -0700 (PDT) MIME-Version: 1.0 References: <73f3723b-f279-43c6-884d-d12b3151ec9e@aklaver.com> In-Reply-To: From: px shi Date: Wed, 13 Aug 2025 13:48:05 +0800 X-Gm-Features: Ac12FXw0Jnz0r31xH1R5RG3_VG7_bfZHDFq4ASevGR6zwTCPDpapl7TkXp92IYw Message-ID: Subject: Re: Questions about the continuity of WAL archiving To: Justin Cc: Ron Johnson , "pgsql-generallists.postgresql.org" Content-Type: multipart/alternative; boundary="0000000000000dff30063c38b470" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000000dff30063c38b470 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Here=E2=80=99s a scenario: The latest WAL file on the primary node is 0000000100000000000000AF, and the standby node has also received up to 0000000100000000000000AF. However, the latest WAL file that has been successfully archived from the primary is only 0000000100000000000000A1 (WAL files from A2 to AE have not yet been archived). If the primary crashes at this point, triggering a failover, the new primary will start generating and archiving WAL on a new timeline (2), beginning with 0000000200000000000000AF. It will not backfill the missing WAL files from timeline 1 (0000000100000000000000A2 to 0000000100000000000000AE). As a result, while the new primary does not have any local WAL gaps, the archive directory will contain a gap in that WAL range. I=E2=80=99m not sure if I explained it clearly. Justin =E4=BA=8E2025=E5=B9=B48=E6=9C=8813=E6=97=A5= =E5=91=A8=E4=B8=89 10:51=E5=86=99=E9=81=93=EF=BC=9A > > > On Tue, Aug 12, 2025 at 10:24=E2=80=AFPM px shi wro= te: > >> How often does your primary node crash, and then not recover due to WALs >>> corruption or WALs not existing? >>> >>> If it's _ever_ happened, you should _fix that_ instead of rolling your >>> own WAL archival.process. >>> >> >> I once encountered a case where the recovery process failed to restore >> to the latest LSN due to missing WAL files in the archive. The root caus= e >> was multiple failovers between primary and standby. During one of the >> switchovers, the primary crashed before completing the archiving of all = WAL >> files. When the standby was promoted to primary, it began archiving WAL >> files for the new timeline, resulting in a gap between the WAL files of = the >> two timelines. Moreover, no base backup was taken during this period. >> >> > > I am not sure what the problem is here either, other than something > seriously wrong with configuration with PostgreSQL and PgBackrest. > > The replica should be receiving the WAL via a replication slot using > Streaming. Meaning the primary will keep the WAL until the replica is > caught up, if the replica becomes disconnected due to > max_slot_wal_keep_size aka wal_keep_segments is exceeded the replicas > recovery_command can take offer and fetch from the WAL Archive to catch t= he > replica up. This assumes hot_feedback is on so the WAL replay won't beco= me > delayed due to snapshot locks on the replica. > > If all the above is true the replica should never lag behind unless the > disk IO layer is way undersized compared to the Primary. S3 is being > talked about so it makes me wonder about DISK IO configuration on the > primary vs the replica. I see this causing lag under high load where the > replica IO layer is the bottleneck. > > If PgBackrest can't keep up with WAL archiving, as others have stated yo= u > need to configure Asynchronous Archiving. The number of workers depends o= n > the load. I have a server running 8 parallel workers to archive 1TB of WA= L > daily.... And another server during maintenance tasks generates around > 10,000 WAL files in about 2 hours using 6 PgBAckrest workers All to S3 > buckets. > > The above statement makes me wonder if there is some kind of High > Availability monitor running like pg_autofailover, that is promoting a > replica then converting the former primary to a replica of the recently > "promoted replica" > > If the above matches to what is happening, it is very easy to mess up the > configuration for WAL archiving and backups. Part of the process of > promoting a replica is to make sure WAL archiving is working. The replic= a > after being promoted immediately kicks of autovacuum to rebuild things li= ke > FSM which generates a lot of WAL files. > > If you are losing WAL files the configuration is wrong somewhere.. > > Just not enough information on the series of events and the configuration > to tell what the root cause is other than miss-configuration. > > > Thanks > Justin > > > > --0000000000000dff30063c38b470 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Here=E2=80=99s a scenario: The latest WAL file on the= primary node is 0000000100000000000000AF, and the standby node has also re= ceived up to 0000000100000000000000AF. However, the latest WAL file that ha= s been successfully archived from the primary is only 000000010000000000000= 0A1 (WAL files from A2 to AE have not yet been archived). If the primary cr= ashes at this point, triggering a failover, the new primary will start gene= rating and archiving WAL on a new timeline (2), beginning with 000000020000= 0000000000AF. It will not backfill the missing WAL files from timeline 1 (0= 000000100000000000000A2 to 0000000100000000000000AE). As a result, while th= e new primary does not have any local WAL gaps, the archive directory will = contain a gap in that WAL range.
I=E2=80=99m not sure if I ex= plained it clearly.


Justin &l= t;zzzzz.graf@gmail.com> =E4= =BA=8E2025=E5=B9=B48=E6=9C=8813=E6=97=A5=E5=91=A8=E4=B8=89 10:51=E5=86=99= =E9=81=93=EF=BC=9A


On Tue, Aug 12, 2025 at 10:24=E2=80=AFPM px shi <spxlyy123@gmail.com> wrote= :
How often doe= s your primary node crash, and then not recover due to WALs corruption or W= ALs not existing?

If it's _ever_ happened, you should _fix that_= instead of rolling your own WAL archival.process.
=C2=A0I once encountered a case where the recovery process fail= ed to restore to the latest LSN due to missing WAL files in the archive. Th= e root cause was multiple failovers between primary and standby. During one= of the switchovers, the primary crashed before completing the archiving of= all WAL files. When the standby was promoted to primary, it began archivin= g WAL files for the new timeline, resulting in a gap between the WAL files = of the two timelines. Moreover, no base backup was taken during this period= .



I am not sure w= hat the problem is=C2=A0 here either, other than something seriously wrong = with configuration with=C2=A0PostgreSQL and PgBackrest.

The replica = should be receiving the WAL=C2=A0via a replication=C2=A0slot using Streamin= g.=C2=A0 Meaning the primary will keep the WAL until the replica is caught = up,=C2=A0 if the replica becomes disconnected due to=C2=A0 max_slot_wal_kee= p_size aka=C2=A0wal_keep_segments=C2=A0 is exceeded the replicas recovery_c= ommand can take offer and fetch from the WAL Archive to catch the replica u= p.=C2=A0 This assumes hot_feedback is on so the WAL replay won't become= delayed due to snapshot locks on the replica.=C2=A0

If=C2=A0 a= ll the above is true the replica should never lag behind unless the disk IO= layer is way undersized compared to the Primary.=C2=A0 S3 is being talked = about=C2=A0 so it makes me wonder about DISK IO configuration on the primar= y vs the replica.=C2=A0 I see this causing lag under high load where the re= plica IO layer is the bottleneck.

If PgBackrest can't keep up wi= th WAL archiving, as others have stated=C2=A0 you need to configure Asynchr= onous Archiving. The number of workers depends on the load. I have a server= running 8 parallel=C2=A0workers to archive 1TB of WAL daily....=C2=A0 And = another server=C2=A0 during maintenance tasks generates=C2=A0around 10,000 = WAL files in about 2 hours using 6 PgBAckrest workers=C2=A0 All=C2=A0to S3 = buckets.

The above statement makes me wonder if there is some kind o= f High Availability monitor running like pg_autofailover, that is promoting= a replica then=C2=A0 converting the former primary to a replica of the rec= ently "promoted replica"=C2=A0=C2=A0

If the above matches = to what=C2=A0is happening, it is very easy to mess up the configuration for= WAL archiving and backups. Part of the process of promoting a replica is t= o make sure WAL archiving is working.=C2=A0 The replica after being promote= d immediately kicks of autovacuum to rebuild things like FSM which generate= s a lot of WAL files.

If you are=C2=A0losing=C2=A0 WAL files the con= figuration is wrong somewhere..

Just not enough information on the s= eries of events and the=C2=A0configuration to tell what the root cause is o= ther than miss-configuration.


Thanks
Justin


=C2=A0=
--0000000000000dff30063c38b470--