MIME-Version: 1.0
From: Koen De Groote <kdg.dev@gmail.com>
Date: Fri, 31 Jan 2025 10:47:17 +0100
Message-ID: <CAGbX52HkW6926c4tY781+iH01x_0qw6Sfo=kT+LHjo_mENqOfQ@mail.gmail.com>
Subject: Postgres restore sometimes restores to a point 2 days in the past
To: PostgreSQL General <pgsql-general@lists.postgresql.org>
Content-Type: multipart/alternative; boundary="0000000000003e4577062cfd6e80"
Archived-At: <https://www.postgresql.org/message-id/CAGbX52HkW6926c4tY781%2BiH01x_0qw6Sfo%3DkT%2BLHjo_mENqOfQ%40mail.gmail.com>
Precedence: bulk

--0000000000003e4577062cfd6e80
Content-Type: text/plain; charset="UTF-8"

I'm running postgres 16.6

My backup strategy is: basebackup and WAL archive. These get uploaded to
the cloud.

The restore is on an isolated machine and is performed daily. It downloads
the basebackup, unpacks it, sets a recovery.signal, and a script is
provided as restore_command, to download the WAL archives %f and unpack
them into %p

In the script, the final unpacking is simply "gzip -dc %f > %p". The gz
files are first checked with "gzip -t".

If a WAL archive is asked that doesn't exist yet, the script naturally
cannot find it, and exits with status code 1. This is the end of the
recovery.

There are a few tables that are known to receive new entries multiple times
per day. However, the state of the recovery showed the latest item to be 2
days in the past. Checking the live DB, there are an expected amount of
items since that ID.

I checked the logs, the last WAL archive that got downloaded is indeed the
last one that was available. The one that failed to download on the restore
machine, was uploaded to the cloud 8 minutes later, according to the upload
logs on the live DB.

The postgres logs themselves seem perfectly normal. It logs all these WAL
recoveries, switches the timeline, and becomes available.

What could be going wrong? My main issue is that I don't know where to
start looking, since nothing in the logs seems abnormal.

Regards,
Koen De Groote

--0000000000003e4577062cfd6e80
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I&#39;m running postgres 16.6<div><br></div><div>My backup=
 strategy is: basebackup and WAL archive. These get uploaded to the cloud.<=
/div><div><br></div><div>The restore is on an isolated machine and is perfo=
rmed daily. It downloads the basebackup, unpacks it, sets a recovery.signal=
, and a script is provided as restore_command, to download the WAL archives=
=C2=A0%f and unpack them into %p</div><div><br></div><div>In the script, th=
e final unpacking is simply &quot;gzip -dc %f &gt; %p&quot;. The gz files a=
re first checked with &quot;gzip -t&quot;.</div><div><span style=3D"color:r=
gb(122,126,133);font-family:&quot;Victor Mono SemiBold&quot;,monospace;font=
-size:12pt;background-color:rgb(30,31,34)"><br></span></div>If a WAL archiv=
e is asked that doesn&#39;t exist yet, the script naturally cannot find it,=
 and exits with status code 1. This is the end of the recovery.<div><br></d=
iv><div>There are a few tables that are known to receive new entries multip=
le times per day. However, the state of the recovery showed the latest item=
 to be 2 days in the past. Checking the live DB, there are an expected amou=
nt of items since that ID.</div><div><br></div><div>I checked the logs, the=
 last WAL archive that got downloaded is indeed the last one that was avail=
able. The one that failed to download on the restore machine, was uploaded =
to the cloud 8 minutes later, according to the upload logs on the live DB.<=
/div><div><br></div><div>The postgres logs themselves seem perfectly normal=
. It logs all these WAL recoveries, switches the timeline, and becomes avai=
lable.</div><div><br></div><div>What could be going wrong? My main issue is=
 that I don&#39;t know where to start looking, since nothing in the logs se=
ems abnormal.</div><div><br></div><div>Regards,</div><div>Koen De Groote</d=
iv></div>

--0000000000003e4577062cfd6e80--