Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tdxLr-00Azph-MK for pgsql-general@arkaria.postgresql.org; Fri, 31 Jan 2025 20:10:56 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1tdxLq-002O4Y-Pu for pgsql-general@arkaria.postgresql.org; Fri, 31 Jan 2025 20:10:54 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tdxLq-002O4Q-Bh for pgsql-general@lists.postgresql.org; Fri, 31 Jan 2025 20:10:54 +0000 Received: from mail-ej1-x62e.google.com ([2a00:1450:4864:20::62e]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1tdxLn-002adD-1D for pgsql-general@lists.postgresql.org; Fri, 31 Jan 2025 20:10:54 +0000 Received: by mail-ej1-x62e.google.com with SMTP id a640c23a62f3a-ab6ed8a3f6aso342505466b.2 for ; Fri, 31 Jan 2025 12:10:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1738354250; x=1738959050; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=AxGCLOoYPfgcuKob1Ue4maZ2LM3nCGqocc2Y/0Lwi5w=; b=fnJw9Em32jyg/2KHqtmDep83ixWBxcT72cT+dOErTT9eSrYfxG7Q8NEtJxtGJlniKl 1pv9dEREnEnXVkBx7pZnQS9XVJIO56NyPiiJf3uz7exSzssjqoCCbqRHp58WQQNYc2La 2Ct0nbQsEIo329tUXnzkJlwQuRldchT4ZKJkZBz7vJiqYljU/GmmrHPpKmLXe41GdEg2 i+Agrt8ePIxHNf081Ct21CJ30E+dMbMzJwgFcyCRL7SR7d/thGrtjo9B5ikdiLo2x1eR Q+3vM8xT1jeN0g4vkqc5qD1D7dvmW2mmRFHkz9eTmip50rzdZYZmqziLWm2V4tfKlHj5 Ftuw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738354250; x=1738959050; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=AxGCLOoYPfgcuKob1Ue4maZ2LM3nCGqocc2Y/0Lwi5w=; b=Y+ddny+UO8fTqSkSr9DqiVBtejKxtFouCah4OuhwlcKjbiGpOSUXL/4YkG+qYEfiPG V08IX+waE8xAn3sdqTwktfjTPZ01kE9PDe4t/j9OhNOZunP7e3PvgVbbBySoYRQf7uZP S9U2VfzUg91YUsUdGLik6i4Bp/wLpKti6MZT3bTc9ThWlO7o3UTZI2nojaQ72/xcXpmU quXjowRSDjqNm3NNDGdkWj8lnU9tx6RXJTxvJ5WG4PEf8/GSqcxZ8XiLYrzDHgOg/y04 xdUlhrvMf5BcEsy0jkGkI4qgNL9PUfGwzSG9+1HTitjKPExN+XTT3Jt+MzZH1IyCX5Oc xcFQ== X-Gm-Message-State: AOJu0YxQLSEKHvdez71+NCfCjG/jH2JSMOK6l+1JZB3qhChfaGvybYNI aMGlHeXCV7DKl4wDETAxI349ph1BR7S712q71vdH1eB79OeTqm9P/lMAKR3DZIUHpL4AdRFlNt4 1urYAo4wb263VehO/Slyj11N5Ih8= X-Gm-Gg: ASbGncvXiVV6l04YNUHEmBI28olE9TYg4yOEQ2OXsJIwCteT/6WHGfF1qWC+Xw7AHn2 7LeF/HAqPfCYdpPkqVyWGWUeAASs8a+Stw7N+pXcrDfCBvQG3EKP3IdChvPfqEM02p207LK+VpM dX6lJw5XTrNAtvBv3WH/+dz3tM+0VcXZk= X-Google-Smtp-Source: AGHT+IGaaQudWhdHbK+D0gCma0fc5kcQ+eiMntTwIXf/1eM99CAqbqwowk/NJ9W0tEVYnHgGnO3w7DUVU6mqD1gdNDY= X-Received: by 2002:a17:907:2d8f:b0:aaf:117f:1918 with SMTP id a640c23a62f3a-ab6cfcb35a6mr1482191866b.5.1738354250065; Fri, 31 Jan 2025 12:10:50 -0800 (PST) MIME-Version: 1.0 References: <70c20a65-4624-4509-ac6a-ef7f0119ea28@aklaver.com> In-Reply-To: <70c20a65-4624-4509-ac6a-ef7f0119ea28@aklaver.com> From: Koen De Groote Date: Fri, 31 Jan 2025 21:10:38 +0100 X-Gm-Features: AWEUYZnuTaAelpVUToMvUTSvZllqCfOpS2gMRY4Av6-nn9QsMv4JavIdZbkU9vs Message-ID: Subject: Re: Postgres restore sometimes restores to a point 2 days in the past To: Adrian Klaver Cc: PostgreSQL General Content-Type: multipart/alternative; boundary="00000000000091b167062d062344" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000091b167062d062344 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable > What is the complete pg_basebackup command? The command: pg_basebackup -h -p -U -D -Ft -z -P -v --wal-method=3Dnone So basically the same as the 2nd example here: https://www.postgresql.org/docs/16/app-pgbasebackup.html except for the verbose flag and the wal-method flag. The wal-method is none for 2 reasons: 1/ Experience teaches that, in the event of storage being on a network, timeouts to write WAL archives tot he network location can cause WAL creation during a basebackup to be considered failed, and that causes the entire basebackup to be considered failed, even if a retry occurs. Any failure during a basebackup will cause postgres to auto-delete it at the very end of pg_basebackup, declaring it "unusable". This is extremely bad in backups that take very long. Better to not include WAL files in the basebackup, and just get them after the fact. 2/ All my WAL files are archived and uploaded to the cloud. So, I can just have them downloaded. This has worked for months on end, and so has restoring. > I don't understand the above. > What is determining that a particular WAL file should be asked for? The postgres server itself does this. Here's the documentation: https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-RESTORE-COMM= AND And here: https://www.postgresql.org/docs/current/warm-standby.html In practice, Postgres will see the "standby.signal" file and start asking for WAL files. It will read the database it has and determine what the next WAL filename should be. And then it asks for it. And it will keep asking for these hexadecimal filenames, 1 at a time, for as long as the command or set of commands provided to "restore_command" returns exit code 0. If the process receives any other exit code, it stops recovery, switches timeline, and considers the database to be up and running at the state its in. It's constantly asking "I want this file now" and the script I have as the restore command will attempt to download it from the cloud. Then it will attempt to unzip it and move it into place. If any of these steps fails, I return exit code 1. > How active is the primary database you are pulling from? Very active, plus automated testing to ensure everything is still running, this will generate multiple items per day on its own. > Available where? The cloud, as I stated: WAL files get archived and these archived files are then uploaded to the cloud. See documentation: https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-ARCHIVE-COMM= AND Regards, Koen De Groote On Fri, Jan 31, 2025 at 5:50=E2=80=AFPM Adrian Klaver wrote: > On 1/31/25 01:47, Koen De Groote wrote: > > Comments in line. > > > I'm running postgres 16.6 > > > > My backup strategy is: basebackup and WAL archive. These get uploaded t= o > > the cloud. > > > > The restore is on an isolated machine and is performed daily. It > > downloads the basebackup, unpacks it, sets a recovery.signal, and a > > script is provided as restore_command, to download the WAL archives %f > > and unpack them into %p > > > > What is the complete pg_basebackup command? > > > In the script, the final unpacking is simply "gzip -dc %f > %p". The gz > > files are first checked with "gzip -t". > > > > If a WAL archive is asked that doesn't exist yet, the script naturally > > cannot find it, and exits with status code 1. This is the end of the > > recovery. > > I don't understand the above. > > What is determining that a particular WAL file should be asked for? > > > > > There are a few tables that are known to receive new entries multiple > > times per day. However, the state of the recovery showed the latest ite= m > > to be 2 days in the past. Checking the live DB, there are an expected > > amount of items since that ID. > > How active is the primary database you are pulling from? > > > > > I checked the logs, the last WAL archive that got downloaded is indeed > > the last one that was available. The one that failed to download on the > > restore machine, was uploaded to the cloud 8 minutes later, according t= o > > the upload logs on the live DB. > > Available where? > > If that was the last one available how could the subsequent one be a > failure to download? > > > > > The postgres logs themselves seem perfectly normal. It logs all these > > WAL recoveries, switches the timeline, and becomes available. > > > > What could be going wrong? My main issue is that I don't know where to > > start looking, since nothing in the logs seems abnormal. > > > > Regards, > > Koen De Groote > > -- > Adrian Klaver > adrian.klaver@aklaver.com > > --00000000000091b167062d062344 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
> What is the complete pg_basebackup command?

<= /div>
The command: pg_basebackup -h <IP> -p <PORT> -U <U= SERNAME> -D <ABSOLUTE_PATH> -Ft -z -P -v --wal-method=3Dnone
=

So basically the same as the 2nd example here:=C2=A0https://w= ww.postgresql.org/docs/16/app-pgbasebackup.html except for the verbose = flag and the wal-method flag.

The wal-method is none for 2 reasons:<= br>1/ Experience teaches that, in the event of storage being on a network, = timeouts to write WAL archives tot he network location can cause WAL creati= on during a basebackup to be considered failed, and that causes the entire = basebackup to be considered failed, even if a retry occurs. Any failure dur= ing a basebackup will cause postgres to auto-delete it at the very end of p= g_basebackup, declaring it "unusable". This is extremely bad in b= ackups that take very long. Better to not include WAL files in the baseback= up, and just get them after the fact.
2/ All my WAL files are archived a= nd uploaded to the cloud. So, I can just have them downloaded.

This = has worked for months on end, and so has restoring.

> I don't understand the above.

> What is determining = that a particular WAL file should be asked for?

The post= gres server itself does this. Here's the documentation:=C2=A0https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-RESTOR= E-COMMAND


In practice, P= ostgres will see the "standby.signal" file and start asking for W= AL files. It will read the database it has and determine what the next WAL = filename should be. And then it asks for it. And it will keep asking for th= ese hexadecimal filenames, 1 at a time, for as long as the command or set o= f commands provided to "restore_command" returns exit code 0. If = the process receives any other exit code, it stops recovery, switches timel= ine, and considers the database to be up and running at the state its in.

It's constantly asking "I want this file n= ow" and the script I have as the restore command will attempt to downl= oad it from the cloud. Then it will attempt to unzip it and move it into pl= ace. If any of these steps fails, I return exit code 1.

> How active is the primary database you are pulling from?

Very active, plus automated testing to ensure everything is still running= , this will generate multiple items per day on its own.

> Available where?

The cloud, as I stated= : WAL files get archived and these archived files are then uploaded to the = cloud.


Regards,
Koen De Groote
=



On Fri, Jan 31, 2025 at= 5:50=E2=80=AFPM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 1/31/25 01:47, Koen De Groote wrote:

Comments in line.

> I'm running postgres 16.6
>
> My backup strategy is: basebackup and WAL archive. These get uploaded = to
> the cloud.
>
> The restore is on an isolated machine and is performed daily. It
> downloads the basebackup, unpacks it, sets a recovery.signal, and a > script is provided as restore_command, to download the WAL archives=C2= =A0%f
> and unpack them into %p
>

What is the complete pg_basebackup command?

> In the script, the final unpacking is simply "gzip -dc %f > %p= ". The gz
> files are first checked with "gzip -t".
>
> If a WAL archive is asked that doesn't exist yet, the script natur= ally
> cannot find it, and exits with status code 1. This is the end of the <= br> > recovery.

I don't understand the above.

What is determining that a particular WAL file should be asked for?

>
> There are a few tables that are known to receive new entries multiple =
> times per day. However, the state of the recovery showed the latest it= em
> to be 2 days in the past. Checking the live DB, there are an expected =
> amount of items since that ID.

How active is the primary database you are pulling from?

>
> I checked the logs, the last WAL archive that got downloaded is indeed=
> the last one that was available. The one that failed to download on th= e
> restore machine, was uploaded to the cloud 8 minutes later, according = to
> the upload logs on the live DB.

Available where?

If that was the last one available how could the subsequent one be a
failure to download?

>
> The postgres logs themselves seem perfectly normal. It logs all these =
> WAL recoveries, switches the timeline, and becomes available.
>
> What could be going wrong? My main issue is that I don't know wher= e to
> start looking, since nothing in the logs seems abnormal.
>
> Regards,
> Koen De Groote

--
Adrian Klaver
adrian.klave= r@aklaver.com

--00000000000091b167062d062344--