From: Jerry Sievers <gsievers19@comcast.net>
To: richard@kojedz.in
Cc:  pgsql-admin@lists.postgresql.org
Subject: Re: In-place upgrade with streaming replicas
In-Reply-To: <71ed34f8c1ed858262a7cf50b633682f@kojedz.in> (richard@kojedz.in's
	message of "Thu, 20 Feb 2025 08:55:31 +0100")
References: <202502191554.6asefyczl7jn@alvherre.pgsql>
	<d438397ad2c67e0ce683bc3158746691@kojedz.in>
	<m2y0y1kqds.fsf@comcast.net>
	<71ed34f8c1ed858262a7cf50b633682f@kojedz.in>
Date: Thu, 20 Feb 2025 21:57:43 -0600
Message-ID: <m2tt8okkzs.fsf@comcast.net>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/m2tt8okkzs.fsf%40comcast.net>
Precedence: bulk

richard@kojedz.in writes:

> Dear Jerry,
>
> So, yes it turns out that some kind of loop must be involved here, as
> you described:
>
> 1. ensure cluster is running
> 2. stop primary
> 3. wait some time
> 4. stop replicas
> 5. check if checkpoint locations match. repeat from step 1 if
> out-of-sync.
>
> My question here is, the unreliable step here is 3rd one. Can we query
> the replica runtime if he did catch up? I mean, that after stopping
> the primary, we can obtain the checkpoint location from
> pg_controldata, then, can we somehow query the running replica about
> that?
Assuming your client traffic has been stopped ahead of time and perhaps
you did a lockout via HBA or other means, including forcible termination
of persistent clients (we usually do a restart of the primary to insure
this)...

We don't wait more than a few seconds before also stopping the replicas
and the vast majority of times all nodes are at the same checkpoint.

Cheers!

>
> Thanks in advance,
> Richard
>
> 2025-02-20 08:49 id=C5=91pontban Jerry Sievers ezt =C3=ADrta:
>> richard@kojedz.in writes:
>>=20
>>> Dear Alvaro,
>>> Thanks for your answers. Unfortunately, I was unaware of a shutdown
>>> record, that makes a difference then. So, I definitely must stop the
>>> primary first, then use pg_controldata to obtain checkpoint
>>> info. Then, can I query the replicas while they are up and running if
>>> they've received the shutdown record or not? So, after shutting down
>>> the primary, how will I know if a replica has received the mentioned
>>> record, and is safe to shutdown?
>> Hmmm, not sure about that but what we do, is stop primary, wait a
>> $short time, then stop replicas...
>> Then run pg_controldata on all nodes | filter out only the line
>> indicating latest checkpoint and sort -u the output.  Expect only a
>> single line if all are matched.
>> You may also wish to first insure that you got the same number of
>> lines as total node count before doing the sorting and uniqueing.
>> Very rarely on our huge systems, we'd have a mismatch after the
>> verification in in those cases, our automated upgrade procedure
>> restarts all nodes and then does the shutdown and verify check again.
>> HTH
>>=20
>>> Thanks for the clarifications.
>>> Best regards,
>>> Richard
>>> 2025-02-19 16:54 id=C5=91pontban =C3=81lvaro Herrera ezt =C3=ADrta:
>>>> On 2025-Feb-19, richard@kojedz.in wrote:
>>>>=20
>>>>> With this, I have the question, that after the shutdown of primary,
>>>>> what is
>>>>> the guarantee for replicas having the same checkpoint location? Why
>>>>> does the
>>>>> order of shutting down the servers matter? What would be the really
>>>>> exact
>>>>> and reliable way to ensure that replicas will have the same
>>>>> checkpoint
>>>>> location as the primary?
>>>> The replicas can't write WAL by themselves, but they will replay
>>>> whatever the primary has sent; by shutting down the primary first and
>>>> letting the replicas catch up, you ensure that the replicas will
>>>> actually receive the shutdown record and replay it.  If you shut down
>>>> the replicas first, they can obviously never catch up with the
>>>> shutdown
>>>> checkpoint of the primary.
>>>> As I recall, if you do shut down the primary first, one potential
>>>> danger
>>>> is that the primary fails to send the checkpoint record before
>>>> shutting
>>>> down, so the replicas won't receive it and obviously will not replay
>>>> it;
>>>> or simply that they are behind enough that they receive it but don't
>>>> replay it.
>>>> You could use pg_controldata to read the last checkpoint info from
>>>> all
>>>> nodes.  You can run it on the primary after shutting it down, and
>>>> then
>>>> on each replica while it's still running to ensure that the correct
>>>> restartpoint has been created.