MIME-Version: 1.0
References: <CAGbX52HkW6926c4tY781+iH01x_0qw6Sfo=kT+LHjo_mENqOfQ@mail.gmail.com>
 <70c20a65-4624-4509-ac6a-ef7f0119ea28@aklaver.com>
In-Reply-To: <70c20a65-4624-4509-ac6a-ef7f0119ea28@aklaver.com>
From: Koen De Groote <kdg.dev@gmail.com>
Date: Fri, 31 Jan 2025 21:10:38 +0100
Message-ID: <CAGbX52E5_7fPZVW0YSjoTmj5qN-aORXpPqFF+bKCiicM+h3EZQ@mail.gmail.com>
Subject: Re: Postgres restore sometimes restores to a point 2 days in the past
To: Adrian Klaver <adrian.klaver@aklaver.com>
Cc: PostgreSQL General <pgsql-general@lists.postgresql.org>
Content-Type: multipart/alternative; boundary="00000000000091b167062d062344"
Archived-At: <https://www.postgresql.org/message-id/CAGbX52E5_7fPZVW0YSjoTmj5qN-aORXpPqFF%2BbKCiicM%2Bh3EZQ%40mail.gmail.com>
Precedence: bulk

--00000000000091b167062d062344
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

> What is the complete pg_basebackup command?

The command: pg_basebackup -h <IP> -p <PORT> -U <USERNAME> -D
<ABSOLUTE_PATH> -Ft -z -P -v --wal-method=3Dnone

So basically the same as the 2nd example here:
https://www.postgresql.org/docs/16/app-pgbasebackup.html except for the
verbose flag and the wal-method flag.

The wal-method is none for 2 reasons:
1/ Experience teaches that, in the event of storage being on a network,
timeouts to write WAL archives tot he network location can cause WAL
creation during a basebackup to be considered failed, and that causes the
entire basebackup to be considered failed, even if a retry occurs. Any
failure during a basebackup will cause postgres to auto-delete it at the
very end of pg_basebackup, declaring it "unusable". This is extremely bad
in backups that take very long. Better to not include WAL files in the
basebackup, and just get them after the fact.
2/ All my WAL files are archived and uploaded to the cloud. So, I can just
have them downloaded.

This has worked for months on end, and so has restoring.

> I don't understand the above.

> What is determining that a particular WAL file should be asked for?

The postgres server itself does this. Here's the documentation:
https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-RESTORE-COMM=
AND

And here: https://www.postgresql.org/docs/current/warm-standby.html

In practice, Postgres will see the "standby.signal" file and start asking
for WAL files. It will read the database it has and determine what the next
WAL filename should be. And then it asks for it. And it will keep asking
for these hexadecimal filenames, 1 at a time, for as long as the command or
set of commands provided to "restore_command" returns exit code 0. If the
process receives any other exit code, it stops recovery, switches timeline,
and considers the database to be up and running at the state its in.

It's constantly asking "I want this file now" and the script I have as the
restore command will attempt to download it from the cloud. Then it will
attempt to unzip it and move it into place. If any of these steps fails, I
return exit code 1.

> How active is the primary database you are pulling from?

Very active, plus automated testing to ensure everything is still running,
this will generate multiple items per day on its own.

> Available where?

The cloud, as I stated: WAL files get archived and these archived files are
then uploaded to the cloud.

See documentation:
https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-ARCHIVE-COMM=
AND

Regards,
Koen De Groote


On Fri, Jan 31, 2025 at 5:50=E2=80=AFPM Adrian Klaver <adrian.klaver@aklave=
r.com>
wrote:

> On 1/31/25 01:47, Koen De Groote wrote:
>
> Comments in line.
>
> > I'm running postgres 16.6
> >
> > My backup strategy is: basebackup and WAL archive. These get uploaded t=
o
> > the cloud.
> >
> > The restore is on an isolated machine and is performed daily. It
> > downloads the basebackup, unpacks it, sets a recovery.signal, and a
> > script is provided as restore_command, to download the WAL archives %f
> > and unpack them into %p
> >
>
> What is the complete pg_basebackup command?
>
> > In the script, the final unpacking is simply "gzip -dc %f > %p". The gz
> > files are first checked with "gzip -t".
> >
> > If a WAL archive is asked that doesn't exist yet, the script naturally
> > cannot find it, and exits with status code 1. This is the end of the
> > recovery.
>
> I don't understand the above.
>
> What is determining that a particular WAL file should be asked for?
>
> >
> > There are a few tables that are known to receive new entries multiple
> > times per day. However, the state of the recovery showed the latest ite=
m
> > to be 2 days in the past. Checking the live DB, there are an expected
> > amount of items since that ID.
>
> How active is the primary database you are pulling from?
>
> >
> > I checked the logs, the last WAL archive that got downloaded is indeed
> > the last one that was available. The one that failed to download on the
> > restore machine, was uploaded to the cloud 8 minutes later, according t=
o
> > the upload logs on the live DB.
>
> Available where?
>
> If that was the last one available how could the subsequent one be a
> failure to download?
>
> >
> > The postgres logs themselves seem perfectly normal. It logs all these
> > WAL recoveries, switches the timeline, and becomes available.
> >
> > What could be going wrong? My main issue is that I don't know where to
> > start looking, since nothing in the logs seems abnormal.
> >
> > Regards,
> > Koen De Groote
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com
>
>

--00000000000091b167062d062344
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">&gt; What is the complete pg_basebackup command?<div><br><=
/div><div>The command: pg_basebackup -h &lt;IP&gt; -p &lt;PORT&gt; -U &lt;U=
SERNAME&gt; -D &lt;ABSOLUTE_PATH&gt; -Ft -z -P -v --wal-method=3Dnone</div>=
<div><br></div><div>So basically the same as the 2nd example here:=C2=A0<a =
href=3D"https://www.postgresql.org/docs/16/app-pgbasebackup.html">https://w=
ww.postgresql.org/docs/16/app-pgbasebackup.html</a> except for the verbose =
flag and the wal-method flag.<br><br>The wal-method is none for 2 reasons:<=
br>1/ Experience teaches that, in the event of storage being on a network, =
timeouts to write WAL archives tot he network location can cause WAL creati=
on during a basebackup to be considered failed, and that causes the entire =
basebackup to be considered failed, even if a retry occurs. Any failure dur=
ing a basebackup will cause postgres to auto-delete it at the very end of p=
g_basebackup, declaring it &quot;unusable&quot;. This is extremely bad in b=
ackups that take very long. Better to not include WAL files in the baseback=
up, and just get them after the fact.<br>2/ All my WAL files are archived a=
nd uploaded to the cloud. So, I can just have them downloaded.<br><br>This =
has worked for months on end, and so has restoring.</div><div><br></div><di=
v>&gt; I don&#39;t understand the above.</div><br>&gt; What is determining =
that a particular WAL file should be asked for?<div><br></div><div>The post=
gres server itself does this. Here&#39;s the documentation:=C2=A0<a href=3D=
"https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-RESTORE-COM=
MAND">https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-RESTOR=
E-COMMAND</a></div><div><br></div><div>And here:=C2=A0<a href=3D"https://ww=
w.postgresql.org/docs/current/warm-standby.html">https://www.postgresql.org=
/docs/current/warm-standby.html</a></div><div><br></div><div>In practice, P=
ostgres will see the &quot;standby.signal&quot; file and start asking for W=
AL files. It will read the database it has and determine what the next WAL =
filename should be. And then it asks for it. And it will keep asking for th=
ese hexadecimal filenames, 1 at a time, for as long as the command or set o=
f commands provided to &quot;restore_command&quot; returns exit code 0. If =
the process receives any other exit code, it stops recovery, switches timel=
ine, and considers the database to be up and running at the state its in.</=
div><div><br></div><div>It&#39;s constantly asking &quot;I want this file n=
ow&quot; and the script I have as the restore command will attempt to downl=
oad it from the cloud. Then it will attempt to unzip it and move it into pl=
ace. If any of these steps fails, I return exit code 1.</div><div><br></div=
><div>&gt; How active is the primary database you are pulling from?<span cl=
ass=3D"gmail-im" style=3D"color:rgb(80,0,80)"><br></span><div><br></div><di=
v>Very active, plus automated testing to ensure everything is still running=
, this will generate multiple items per day on its own.</div><div><br></div=
><div>&gt; Available where?</div><div><br></div><div>The cloud, as I stated=
: WAL files get archived and these archived files are then uploaded to the =
cloud.</div></div><div><br></div><div>See documentation:=C2=A0<a href=3D"ht=
tps://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-ARCHIVE-COMMAN=
D">https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-ARCHIVE-C=
OMMAND</a></div><div><br></div><div>Regards,</div><div>Koen De Groote</div>=
<div><br></div><div><br></div></div><br><div class=3D"gmail_quote gmail_quo=
te_container"><div dir=3D"ltr" class=3D"gmail_attr">On Fri, Jan 31, 2025 at=
 5:50=E2=80=AFPM Adrian Klaver &lt;<a href=3D"mailto:adrian.klaver@aklaver.=
com">adrian.klaver@aklaver.com</a>&gt; wrote:<br></div><blockquote class=3D=
"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(2=
04,204,204);padding-left:1ex">On 1/31/25 01:47, Koen De Groote wrote:<br>
<br>
Comments in line.<br>
<br>
&gt; I&#39;m running postgres 16.6<br>
&gt; <br>
&gt; My backup strategy is: basebackup and WAL archive. These get uploaded =
to <br>
&gt; the cloud.<br>
&gt; <br>
&gt; The restore is on an isolated machine and is performed daily. It <br>
&gt; downloads the basebackup, unpacks it, sets a recovery.signal, and a <b=
r>
&gt; script is provided as restore_command, to download the WAL archives=C2=
=A0%f <br>
&gt; and unpack them into %p<br>
&gt; <br>
<br>
What is the complete pg_basebackup command?<br>
<br>
&gt; In the script, the final unpacking is simply &quot;gzip -dc %f &gt; %p=
&quot;. The gz <br>
&gt; files are first checked with &quot;gzip -t&quot;.<br>
&gt; <br>
&gt; If a WAL archive is asked that doesn&#39;t exist yet, the script natur=
ally <br>
&gt; cannot find it, and exits with status code 1. This is the end of the <=
br>
&gt; recovery.<br>
<br>
I don&#39;t understand the above.<br>
<br>
What is determining that a particular WAL file should be asked for?<br>
<br>
&gt; <br>
&gt; There are a few tables that are known to receive new entries multiple =
<br>
&gt; times per day. However, the state of the recovery showed the latest it=
em <br>
&gt; to be 2 days in the past. Checking the live DB, there are an expected =
<br>
&gt; amount of items since that ID.<br>
<br>
How active is the primary database you are pulling from?<br>
<br>
&gt; <br>
&gt; I checked the logs, the last WAL archive that got downloaded is indeed=
 <br>
&gt; the last one that was available. The one that failed to download on th=
e <br>
&gt; restore machine, was uploaded to the cloud 8 minutes later, according =
to <br>
&gt; the upload logs on the live DB.<br>
<br>
Available where?<br>
<br>
If that was the last one available how could the subsequent one be a <br>
failure to download?<br>
<br>
&gt; <br>
&gt; The postgres logs themselves seem perfectly normal. It logs all these =
<br>
&gt; WAL recoveries, switches the timeline, and becomes available.<br>
&gt; <br>
&gt; What could be going wrong? My main issue is that I don&#39;t know wher=
e to <br>
&gt; start looking, since nothing in the logs seems abnormal.<br>
&gt; <br>
&gt; Regards,<br>
&gt; Koen De Groote<br>
<br>
-- <br>
Adrian Klaver<br>
<a href=3D"mailto:adrian.klaver@aklaver.com" target=3D"_blank">adrian.klave=
r@aklaver.com</a><br>
<br>
</blockquote></div>

--00000000000091b167062d062344--