Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vNXPW-005oBX-17 for pgsql-admin@arkaria.postgresql.org; Mon, 24 Nov 2025 14:19:22 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vNXPS-0016D4-2g for pgsql-admin@arkaria.postgresql.org; Mon, 24 Nov 2025 14:19:19 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vNWtu-0013Fc-23 for pgsql-admin@lists.postgresql.org; Mon, 24 Nov 2025 13:46:43 +0000 Received: from mail-lj1-x22b.google.com ([2a00:1450:4864:20::22b]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vNWts-001DAq-0t for pgsql-admin@lists.postgresql.org; Mon, 24 Nov 2025 13:46:42 +0000 Received: by mail-lj1-x22b.google.com with SMTP id 38308e7fff4ca-37b99da107cso37542491fa.1 for ; Mon, 24 Nov 2025 05:46:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudkitchens.com; s=google; t=1763991997; x=1764596797; darn=lists.postgresql.org; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=oRH6v35Ym22DGm3RJWgu7AhXjot8/VvwaPQ3BAr58Ms=; b=NMOzTuYHGLt2mguh6AKq82XAYuzZ4MnyA50GLTAazQo//RL46WV+PVgM041CCzKs86 Hx7oTyGhANNZF/heZ9JwOxesi8/WK7djac0XmXM/9XZpS3GaTqafSmoQHCYihDMb5517 O1xJj02B2owA3DX70azk/PCc9MENAWVxG2OEJLmHU4CpwVFWsvCfl5ywr9WNZIkLAQ4T 1C4FlvQpat7loGG1bbvUGY8ZK/lhvQnfGE48DLd7NvzRFgmla69bfzsFxvjYuCxilj3k Qko29i1mnH8EfO2OBV/Cs1zjMZvqOSRsKetBNnOl/+d47rrarrlxqLqbkZkUk1ifSn2T 6gvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1763991997; x=1764596797; h=to:subject:message-id:date:from:mime-version:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=oRH6v35Ym22DGm3RJWgu7AhXjot8/VvwaPQ3BAr58Ms=; b=YuNV6aJ4rd/bzm46s44gEktdYpkNRxo1JymNVuAAwcyIhTFW0N+TT8sfZyq4JIRVGn knF7WCLqXFUIaf6k3sabSAMF//jE024oDZq3610ux6ZIviR9svKFPnZTilxTYf9o9ZT5 dCks+g4+WVqkIk1a41WliO5gAfFT4V7K/XivjSvGmtwJgFnAnjQ8Aw89XNymjMZewU6H 5dSeiPCvf4BuOzQ+QucOTKPsgctK0MRsrumy86X3CRsg+4pikoO00Le6H6vNEisK3WhA ucebjcE0DKc7kb1wTQfBYU+KUwqEf9WJemo6XX3UFpyOrkfpZQitRnzTgd4Ts0RV65pf Z0uQ== X-Gm-Message-State: AOJu0Yw3Hg3yhk0Pco5FBn8qjobL3q7gDgDKem0JkCaEXiJAvkuGx/Nd 3SIFLSjPKqBi5KswstYa0m/SB+mm/M4aj3hryj1qYu3CxVFcl1ickitsMh3wlWjc3HPlC73+juj Yxtc4IYn92GSQJyx4VETpq38LU1mHmw6p/Yylohqqz3f+CeyrmNh0W6s= X-Gm-Gg: ASbGncudDQlbiBBdvfpazPxvqD5scJHpa2fyfRnvivgS3c+D4w0dZTDY6+AySC7fh/h f177iUDB5H5pV1UhQ6/YLysSLwXee0H51eoG7mZE+x6rE4vAq0gFfs93twau9cp3zlM9YOQGESS 4phggM5L+iPQkAczGLUbFuSa5g6NbqukeF3LuFjT2FCGuS9GvMCagSKWMLj4o0iC8NHzGeSyRSn FGkVZ5MnRotuCZgu/zZL8iM4wzPnq9K66MkdjiFPMsEX7a4VB9yHgFxyB0r0RdvmtEkppIHsZUj /Wi+8ECI+bamR8eJ5CQbaRWSxAO2wCrK8YQ2JXXxtiijQIklv4PhVBBGFuM= X-Google-Smtp-Source: AGHT+IESECCL3woeZIU79GTUE0ONxvE318fixO0DvQFqMiVP1qlJkUZ7BJ45vQyBL+c6/K31dnJ/HWfW32qo2tosGQI= X-Received: by 2002:a2e:9b14:0:b0:37b:991a:544b with SMTP id 38308e7fff4ca-37cd9280ed1mr23924221fa.34.1763991997110; Mon, 24 Nov 2025 05:46:37 -0800 (PST) MIME-Version: 1.0 From: =?UTF-8?B?S2FzcGVyIEbDuG5z?= Date: Mon, 24 Nov 2025 14:46:26 +0100 X-Gm-Features: AWmQ_bkzvVCD6I4gvb-m1-O14zMIa-R-fWiYgbr-nXD8-ie2Kcz2VgdYyfGIqq0 Message-ID: Subject: restore_command on high-throughput cluster never switches to streaming replication To: pgsql-admin@lists.postgresql.org Content-Type: multipart/alternative; boundary="0000000000006006810644576481" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000006006810644576481 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi PostgreSQL community. I debugged an instance where a PostgreSQL standby would not switch to streaming replication when the `restore_command` fails. *Expectation* I expect PostgreSQL to try switching to streaming replication if the `restore_command` fails. *What happens* PostgreSQL attempts to restore the previously restored WAL segment and then retries the failed segment. However, because the primary produces WAL at a high rate, the WAL file now exists and PostgreSQL does not try to switch to streaming replication. *Context* Running PostgreSQL 15.7 in Kubernetes using CloudNative PostgreSQL Operator= . *Logs* I configured PostgreSQL to emit DEBUG3 level logs. Newest logs first, oldest last. got WAL segment from archive executing restore command "/controller/manager wal-restore --log-destination /controller/log/postgres.json *000000410000A7BA00000058* pg_wal/RECOVERYXLOG" got WAL segment from archive executing restore command "/controller/manager wal-restore --log-destination /controller/log/postgres.json *000000410000A7BA00000057* pg_wal/RECOVERYXLOG" could not open file "pg_wal/*000000410000A7BA00000058*": No such file or directory could not restore file "*000000410000A7BA00000058*" from archive: child process exited with exit code 1 executing restore command "/controller/manager wal-restore --log-destination /controller/log/postgres.json *000000410000A7BA00000058* pg_wal/RECOVERYXLOG" got WAL segment from archive executing restore command "/controller/manager wal-restore --log-destination /controller/log/postgres.json *000000410000A7BA00000057* pg_wal/RECOVERYXLOG" Notice that when *000000410000A7BA00000058* failed, PostgreSQL asked for *000000410000A7BA00000057* which it had already restored. Aftwards, it asks about *000000410000A7BA00000058* once again. *Problem* This is problematic because the standby will never switch to streaming replication. *Workaround* We can get the PostgreSQL replica to become in-sync if we change the command to `/bin/false` when we are withing `wal_keep_size`. *Question* Is this the expected behaviour? I expect the function `WaitForWALToBecomeAvailable` to switch to streaming replication once a single `restore_command` fails. This also happens when `/bin/false` is used instead. Any help would be greatly appreciated /Kasper F=C3=B8ns --0000000000006006810644576481 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi PostgreSQL community.

I debugged an = instance where a PostgreSQL standby would not switch to streaming replicati= on when the `restore_command` fails.

Expectatio= n
I expect PostgreSQL to try switching to streaming replicati= on if the `restore_command` fails.

What happens=
PostgreSQL attempts to restore the previously restored WAL s= egment and then retries the failed segment. However, because the primary pr= oduces WAL at a high rate, the WAL file now exists and PostgreSQL does not = try to switch to streaming replication.

Context=
Running PostgreSQL 15.7 in Kubernetes using CloudNative Post= greSQL Operator.

Logs
I configure= d PostgreSQL to emit DEBUG3 level logs. Newest logs first, oldest last.

got WAL segment= from archive
executing restore command "/controller/manager wal-re= store --log-destination /controller/log/postgres.json 000000410000A7BA00= 000058 pg_wal/RECOVERYXLOG"
got WAL segment from archive
exe= cuting restore command "/controller/manager wal-restore --log-destinat= ion /controller/log/postgres.json 000000410000A7BA00000057 pg_wal/RE= COVERYXLOG"
could not open file "pg_wal/000000410000A7BA000= 00058": No such file or directory
could not restore file "= 000000410000A7BA00000058" from archive: child process exited wi= th exit code 1
executing restore command "/controller/manager wal-r= estore --log-destination /controller/log/postgres.json 000000410000A7BA0= 0000058 pg_wal/RECOVERYXLOG"
got WAL segment from archive
ex= ecuting restore command "/controller/manager wal-restore --log-destina= tion /controller/log/postgres.json 000000410000A7BA00000057 pg_wal/R= ECOVERYXLOG"

Notice that when=C2= =A0000000410000A7BA00000058 failed, PostgreSQL asked for=C2=A0000= 000410000A7BA00000057 which it had already restored. Aftwards, it asks = about=C2=A0000000410000A7BA00000058 once again.

=
Problem
This is problematic because the standby = will never switch to streaming replication.

Wor= karound
We can get the PostgreSQL replica to become in-sync i= f we change the command to `/bin/false` when we are withing `wal_keep_size`= .

Question
Is this the expected b= ehaviour?

I expect the function `WaitForWALT= oBecomeAvailable` to switch to streaming replication once a single `restore= _command` fails. This also happens when `/bin/false` is used instead.
=

Any help would be greatly appreciated
/Kasper= F=C3=B8ns
--0000000000006006810644576481--