Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vQ3QS-007mzW-23 for pgsql-general@arkaria.postgresql.org; Mon, 01 Dec 2025 12:54:44 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vQ3QQ-002hTU-0m for pgsql-general@arkaria.postgresql.org; Mon, 01 Dec 2025 12:54:42 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vQ0Xd-001qBT-1b for pgsql-general@lists.postgresql.org; Mon, 01 Dec 2025 09:49:57 +0000 Received: from mail-lj1-x22c.google.com ([2a00:1450:4864:20::22c]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vQ0Xa-002YIL-23 for pgsql-general@lists.postgresql.org; Mon, 01 Dec 2025 09:49:56 +0000 Received: by mail-lj1-x22c.google.com with SMTP id 38308e7fff4ca-37a56a475e8so52070601fa.3 for ; Mon, 01 Dec 2025 01:49:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudkitchens.com; s=google; t=1764582593; x=1765187393; darn=lists.postgresql.org; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=C4JdYQP+VisnXaXMJWyGiHUGvREU38VXbR0tMBkSZ44=; b=d+1YxZZ7YD8YxP6zHLs2mvCfVnBBzVtdQmduYdTooiRaWehk5oIINOWIuLBZcVYKkM fsmZfPtGlKXvrzJ5Rr90+7CPJpC0aeWBVEpZqXDgSIpG1vlqKOSxCVkMHvpSrw2OtCQe R3VeQQXf6XJgSkUDrqe96P0JpQ0iUYN/3ujpaFv3bgdeGws3bmig8rWgR+K8BP6qxwpO uPzKmmK7qS2luZPLqUsNYmzP23xJTfKyMEWvItcYd+hXbSWvHEVF2TCEg8peFRusXoBc DnaABolZr2b7xeR+ntnPsedk0iQ9WS7HWsoib+6EIiImdAcN/k6j1yZzWbA7EBhvKQpD gcIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764582593; x=1765187393; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=C4JdYQP+VisnXaXMJWyGiHUGvREU38VXbR0tMBkSZ44=; b=HMqvSLAIkrdhi5/Yre/cQMkTg1sawTyrA0Ac7TOhYI0+qqyw3t6kCrQJ2pbilwQwPn atSC367v5t6z6VGa7+W1KpK0F0lSy6OYDt0V1lnHWOmPUCu0Lzntbl9+8g4p8v8++/jG oOrPKfs+j0oJKOhVNuD5M9nsFvo5dXY+2rlvQ45oVhaxvii0UYZcCDx9OoTA9lShN63i YdYzjCfSFRkkxjCPMfDaF17iyqFKP8rLQeAG8knC7HNe+yLhRbzv8kWZyx2q0q54uXYV BtLntbHUC/YP0gh2WjEMGY98jq+gdTBX+0umbl/MdqvM+b0eHKILvawMQiNWsdVU1rcT eP+g== X-Gm-Message-State: AOJu0YyaXR2pkzDtECHw/DxZgnHZ8nbzghG2QCW1UrM3tv/HgzcbdIPL pJIVRhmeCL2ZL5Yzz+CvHm6jUK3r+odv1yGmuIiUwMO6XT9ZY7JAZbcXqkEVZqpF++BIeDbu5gY arRQi598I/rQytXQonoXYwZvlGW+vACHz6W1HISzWzV9nAcQwvWsX X-Gm-Gg: ASbGnctfoo9GLzETgXC/fxQ+2xIzXsxMrYtSMLWd9IU8QffHS90gMh2pwg5Lhaf3Psx SOaUmIeexFZNmBTvbfrklL/NVLhXLUuwTK77UDy+pPC882Nrep9X0mLt61OclkocQggmzKA80PA o8nfLRJSm25+T4hwjrrK4JpCKwuqGiM3MBxBXktmbf6GIP6X+ROnglm/PxaKXyO1Z2UK1g4GgIB 569Hze6+h4AqvkOAjvYLSmp7hU6rC5JtS19/A+XHjt7Hr2OPUEZy24WY2GqVLUPSn2sqSghvpaR yIkPllbFy1hfSKtqdOI7OtXHkE5UUM32K8KAXg== X-Google-Smtp-Source: AGHT+IGpdHCiAehGqK4FKIoFtW8PN5fkEZrLdjo9id0Pz4dc24ZKqCdmMClIyUo8KU9ow9tb9Pt9zXfBEWY91fQn4xc= X-Received: by 2002:a2e:964f:0:b0:37b:9a1d:dee8 with SMTP id 38308e7fff4ca-37cd91a1636mr76226151fa.15.1764582593177; Mon, 01 Dec 2025 01:49:53 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: =?UTF-8?B?S2FzcGVyIEbDuG5z?= Date: Mon, 1 Dec 2025 10:49:42 +0100 X-Gm-Features: AWmQ_bn31XmQcyu1rJ6Up_W_AoieoTxl1rAHYxOUYyMUQvZr1hBR6ZeB_a37vLk Message-ID: Subject: Fwd: restore_command on high-throughput cluster never switches to streaming replication To: pgsql-general@lists.postgresql.org Content-Type: multipart/alternative; boundary="000000000000a4c2170644e0e6d7" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000a4c2170644e0e6d7 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi PostgreSQL community. I debugged an instance where a PostgreSQL standby would not switch to streaming replication when the `restore_command` fails. I first posted this to pgsql-admin mailing list, but now trying here as I got no response. *Expectation* I expect PostgreSQL to try switching to streaming replication if the `restore_command` fails. *What happens* PostgreSQL attempts to restore the previously restored WAL segment and then retries the failed segment. However, because the primary produces WAL at a high rate, the WAL file now exists and PostgreSQL does not try to switch to streaming replication. *Context* Running PostgreSQL 15.7 in Kubernetes using CloudNative PostgreSQL Operator= . *Logs* I configured PostgreSQL to emit DEBUG3 level logs. Newest logs first, oldest last. got WAL segment from archive executing restore command "/controller/manager wal-restore --log-destination /controller/log/postgres.json *000000410000A7BA00000058* pg_wal/RECOVERYXLOG" got WAL segment from archive executing restore command "/controller/manager wal-restore --log-destination /controller/log/postgres.json *000000410000A7BA00000057* pg_wal/RECOVERYXLOG" could not open file "pg_wal/*000000410000A7BA00000058*": No such file or directory could not restore file "*000000410000A7BA00000058*" from archive: child process exited with exit code 1 executing restore command "/controller/manager wal-restore --log-destination /controller/log/postgres.json *000000410000A7BA00000058* pg_wal/RECOVERYXLOG" got WAL segment from archive executing restore command "/controller/manager wal-restore --log-destination /controller/log/postgres.json *000000410000A7BA00000057* pg_wal/RECOVERYXLOG" Notice that when *000000410000A7BA00000058* failed, PostgreSQL asked for *000000410000A7BA00000057* which it had already restored. Aftwards, it asks about *000000410000A7BA00000058* once again. *Problem* This is problematic because the standby will never switch to streaming replication. *Workaround* We can get the PostgreSQL replica to become in-sync if we change the command to `/bin/false` when we are withing `wal_keep_size`. *Question* Is this the expected behaviour? I expect the function `WaitForWALToBecomeAvailable` to switch to streaming replication once a single `restore_command` fails. This also happens when `/bin/false` is used instead. Any help would be greatly appreciated /Kasper F=C3=B8ns --000000000000a4c2170644e0e6d7 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi PostgreSQL community.

I debugged an instance= where a PostgreSQL standby would not switch to streaming replication when = the `restore_command` fails.
I first posted this to pgsql-admin m= ailing list, but now trying here as I got no response.

=
Expectation
I expect PostgreSQL to try switching to s= treaming replication if the `restore_command` fails.

What happens
PostgreSQL attempts to restore the previou= sly restored WAL segment and then retries the failed segment. However, beca= use the primary produces WAL at a high rate, the WAL file now exists and Po= stgreSQL does not try to switch to streaming replication.

Context
Running PostgreSQL 15.7 in Kubernetes usin= g CloudNative PostgreSQL Operator.

Logs
I configured PostgreSQL to emit DEBUG3 level logs. Newest logs first= , oldest last.

got WAL segment from archive
executing restore command "/control= ler/manager wal-restore --log-destination /controller/log/postgres.json = 000000410000A7BA00000058 pg_wal/RECOVERYXLOG"
got WAL segment f= rom archive
executing restore command "/controller/manager wal-rest= ore --log-destination /controller/log/postgres.json 000000410000A7BA0000= 0057 pg_wal/RECOVERYXLOG"
could not open file "pg_wal/0= 00000410000A7BA00000058": No such file or directory
could not r= estore file "000000410000A7BA00000058" from archive: child= process exited with exit code 1
executing restore command "/contro= ller/manager wal-restore --log-destination /controller/log/postgres.json 000000410000A7BA00000058 pg_wal/RECOVERYXLOG"
got WAL segment = from archive
executing restore command "/controller/manager wal-res= tore --log-destination /controller/log/postgres.json 000000410000A7BA000= 00057 pg_wal/RECOVERYXLOG"

Not= ice that when=C2=A0000000410000A7BA00000058 failed, PostgreSQL asked= for=C2=A0000000410000A7BA00000057 which it had already restored. Af= twards, it asks about=C2=A0000000410000A7BA00000058 once again.

Problem
This is problematic beca= use the standby will never switch to streaming replication.

<= /div>
Workaround
We can get the PostgreSQL replica to = become in-sync if we change the command to `/bin/false` when we are withing= `wal_keep_size`.

Question
Is thi= s the expected behaviour?

I expect the funct= ion `WaitForWALToBecomeAvailable` to switch to streaming replication once a= single `restore_command` fails. This also happens when `/bin/false` is use= d instead.

Any help would be greatly appreciated
/Kasper F=C3=B8ns
--000000000000a4c2170644e0e6d7--