public inbox for [email protected]
help / color / mirror / Atom feedFrom: Torsten Förtsch <[email protected]>
To: Craig McIlwee <[email protected]>
Cc: [email protected]
Subject: Re: Trouble using pg_rewind to undo standby promotion
Date: Thu, 7 Nov 2024 10:46:49 +0100
Message-ID: <CAKkG4_kt9zZNPwB2Qd0-jAtRn3m=JPWW_-hc8VSyhfZ+VjfKQw@mail.gmail.com> (raw)
In-Reply-To: <CAGqBcTbWpCuCRuWHd9209k2Q0CHA4j+2MnRR2SwszCXnErG9hw@mail.gmail.com>
References: <CAGqBcTbWpCuCRuWHd9209k2Q0CHA4j+2MnRR2SwszCXnErG9hw@mail.gmail.com>
Your point of divergence is in the middle of the 7718/000000BF file. So,
you should have 2 such files eventually, one on timeline 1 and the other on
timeline 2.
Are you archiving WAL on the promoted machine in a way that your
restore_command can find it? Check archive_command and archive_mode on the
promoted machine.
Also, do your archive/restore scripts work properly for history files?
On Wed, Nov 6, 2024 at 7:48 PM Craig McIlwee <[email protected]> wrote:
> I have a primary -> standby 1 -> standby 2 setup with all servers running
> PG 13.8 (this effort is part of getting on to a newer version, but I think
> those details aren't relevant to this problem). The first standby uses
> streaming replication from the primary and the second standby is using a
> WAL archive with a restore_command. To make this standby chain work,
> standby 1 is set to archive_mode = always with a command that populates the
> WAL archive.
>
> I would like to be able to promote standby 2 (hereon referred to just as
> 'standby'), perform some writes, then rewind it back to the point before
> promotion so it can become a standby again. The documentation for
> pg_rewind says that this is supported and it seems like it should be
> straightforward, but I'm not having any luck getting this to work so I'm
> hoping someone can point out what I'm doing wrong. Here's what I did:
>
> First, observe that WAL is properly being applied from the archive. Note
> that we are currently on timeline 1.
>
> 2024-11-06 09:51:23.286 EST [5438] LOG: restored log file
> "0000000100007711000000F9" from archive
> 2024-11-06 09:51:23.434 EST [5438] LOG: restored log file
> "0000000100007711000000FA" from archive
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/0000000100007711000000FB.gz: No such file or
> directory
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/00000002.history.gz: No such file or directory
>
> Next, stop postgres, set wal_log_hints = on as required by pg_rewind, and
> restart postgres. I also make a copy of the data directory while the
> postgres is not running so I can repeat my test, which works fine on a
> small test database but won't be possible for the multi TB database that I
> will eventually be doing this on.
>
> Now promote the standby using "select pg_promote()" and see that it
> switches to a new timeline. You can also see that the last WAL applied
> from the archive is 7718/BF.
>
> 2024-11-06 12:10:10.831 EST [4336] LOG: restored log file
> "0000000100007718000000BD" from archive
> 2024-11-06 12:10:10.996 EST [4336] LOG: restored log file
> "0000000100007718000000BE" from archive
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/0000000100007718000000BF.gz: No such file or
> directory
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/00000002.history.gz: No such file or directory
> 2024-11-06 12:10:15.384 EST [4336] LOG: restored log file
> "0000000100007718000000BF" from archive
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/0000000100007718000000C0.gz: No such file or
> directory
> 2024-11-06 12:10:15.457 EST [4336] LOG: received promote request
> 2024-11-06 12:10:15.457 EST [4336] LOG: redo done at 7718/BFFFFF30
> 2024-11-06 12:10:15.457 EST [4336] LOG: last completed transaction was at
> log time 2024-11-06 12:10:22.627074-05
> 2024-11-06 12:10:15.593 EST [4336] LOG: restored log file
> "0000000100007718000000BF" from archive
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/00000002.history.gz: No such file or directory
> 2024-11-06 12:10:15.611 EST [4336] LOG: selected new timeline ID: 2
> 2024-11-06 12:10:15.640 EST [4336] LOG: archive recovery complete
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/00000001.history.gz: No such file or directory
> 2024-11-06 12:10:17.028 EST [4329] LOG: database system is ready to
> accept connections
>
> Next, insert a record into just to make some changes that I want to
> rollback later. (What I will eventually be doing is creating a publication
> to ship data to a newer version, but again, that's not what's important
> here.)
>
> Finally, shutdown postgres and attempt a rewind. The address used in the
> --source-server connection string is the address of the primary.
>
> 2024-11-06 12:11:11.139 EST [4329] LOG: received fast shutdown request
> 2024-11-06 12:11:11.143 EST [4329] LOG: aborting any active transactions
> 2024-11-06 12:11:11.144 EST [4329] LOG: background worker "logical
> replication launcher" (PID 5923) exited with exit code 1
> 2024-11-06 12:11:40.933 EST [4342] LOG: shutting down
> 2024-11-06 12:11:41.753 EST [4329] LOG: database system is shut down
>
> /usr/pgsql-13/bin/pg_rewind --target-pgdata=/data/pgsql/operational
> --source-server="host=x.x.x.x dbname=postgres user=xxx password=xxx"
> --dry-run --progress --restore-target-wal
>
> pg_rewind: connected to server
> pg_rewind: servers diverged at WAL location 7718/BFFFFFE8 on timeline 1
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/0000000200007718000000BF.gz: No such file or
> directory
> pg_rewind: error: could not restore file "0000000200007718000000BF" from
> archive
> pg_rewind: fatal: could not find previous WAL record at 7718/BFFFFFE8
>
> pg_rewind shows the point of divergence as 7718/BF on timeline 1, but when
> it tries to replay WAL using the restore command it is trying to find WAL
> from timeline 2 rather than picking back up on timeline 1. I tried
> setting recovery_target_timeline on the target database to 'current' and
> '1' but that gave the same result. Searching the archives, [1] mentions the
> need to force a checkpoint after promotion which I tried even though the
> problem description isn't the same. [2] mentions a problem that looks more
> like the one I am facing but has no responses. At this point I don't know
> what to do next and hope someone can point me in the right direction.
>
> [1]
> https://www.postgresql.org/message-id/e7b16ddea93a92575cb6d143b6ef602cab22432e.camel%40cybertec.at
> [2]
> https://www.postgresql.org/message-id/[email protected]...
>
> Craig
>
view thread (3+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected]
Subject: Re: Trouble using pg_rewind to undo standby promotion
In-Reply-To: <CAKkG4_kt9zZNPwB2Qd0-jAtRn3m=JPWW_-hc8VSyhfZ+VjfKQw@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox