MIME-Version: 1.0
Date: Mon, 24 Feb 2025 10:16:20 +0100
From: richard@kojedz.in
To: Jerry Sievers <gsievers19@comcast.net>
Cc: pgsql-admin@lists.postgresql.org
Subject: Re: In-place upgrade with streaming replicas
In-Reply-To: <m2tt8okkzs.fsf@comcast.net>
References: <202502191554.6asefyczl7jn@alvherre.pgsql>
 <d438397ad2c67e0ce683bc3158746691@kojedz.in> <m2y0y1kqds.fsf@comcast.net>
 <71ed34f8c1ed858262a7cf50b633682f@kojedz.in> <m2tt8okkzs.fsf@comcast.net>
Message-ID: <d8259a07426c0181eb3dc63f36373177@kojedz.in>
Content-Type: text/plain; charset=UTF-8;
 format=flowed
Content-Transfer-Encoding: 8bit
Archived-At: 
 <https://www.postgresql.org/message-id/d8259a07426c0181eb3dc63f36373177%40kojedz.in>
Precedence: bulk

Dear Jerry,

Thanks for sharing your experiments, I will implement our upgrades in a 
similar way. Terminate/restart on different port, wait for catchup, stop 
primary, check replicas somehow (using pg_wal_lsn_diff()), then stop 
replicas too, check for pg_controldata match, and repeat if not.

Regards,
Richard

2025-02-21 04:57 időpontban Jerry Sievers ezt írta:
> richard@kojedz.in writes:
> 
>> Dear Jerry,
>> 
>> So, yes it turns out that some kind of loop must be involved here, as
>> you described:
>> 
>> 1. ensure cluster is running
>> 2. stop primary
>> 3. wait some time
>> 4. stop replicas
>> 5. check if checkpoint locations match. repeat from step 1 if
>> out-of-sync.
>> 
>> My question here is, the unreliable step here is 3rd one. Can we query
>> the replica runtime if he did catch up? I mean, that after stopping
>> the primary, we can obtain the checkpoint location from
>> pg_controldata, then, can we somehow query the running replica about
>> that?
> Assuming your client traffic has been stopped ahead of time and perhaps
> you did a lockout via HBA or other means, including forcible 
> termination
> of persistent clients (we usually do a restart of the primary to insure
> this)...
> 
> We don't wait more than a few seconds before also stopping the replicas
> and the vast majority of times all nodes are at the same checkpoint.
> 
> Cheers!
> 
>> 
>> Thanks in advance,
>> Richard
>> 
>> 2025-02-20 08:49 időpontban Jerry Sievers ezt írta:
>>> richard@kojedz.in writes:
>>> 
>>>> Dear Alvaro,
>>>> Thanks for your answers. Unfortunately, I was unaware of a shutdown
>>>> record, that makes a difference then. So, I definitely must stop the
>>>> primary first, then use pg_controldata to obtain checkpoint
>>>> info. Then, can I query the replicas while they are up and running 
>>>> if
>>>> they've received the shutdown record or not? So, after shutting down
>>>> the primary, how will I know if a replica has received the mentioned
>>>> record, and is safe to shutdown?
>>> Hmmm, not sure about that but what we do, is stop primary, wait a
>>> $short time, then stop replicas...
>>> Then run pg_controldata on all nodes | filter out only the line
>>> indicating latest checkpoint and sort -u the output.  Expect only a
>>> single line if all are matched.
>>> You may also wish to first insure that you got the same number of
>>> lines as total node count before doing the sorting and uniqueing.
>>> Very rarely on our huge systems, we'd have a mismatch after the
>>> verification in in those cases, our automated upgrade procedure
>>> restarts all nodes and then does the shutdown and verify check again.
>>> HTH
>>> 
>>>> Thanks for the clarifications.
>>>> Best regards,
>>>> Richard
>>>> 2025-02-19 16:54 időpontban Álvaro Herrera ezt írta:
>>>>> On 2025-Feb-19, richard@kojedz.in wrote:
>>>>> 
>>>>>> With this, I have the question, that after the shutdown of 
>>>>>> primary,
>>>>>> what is
>>>>>> the guarantee for replicas having the same checkpoint location? 
>>>>>> Why
>>>>>> does the
>>>>>> order of shutting down the servers matter? What would be the 
>>>>>> really
>>>>>> exact
>>>>>> and reliable way to ensure that replicas will have the same
>>>>>> checkpoint
>>>>>> location as the primary?
>>>>> The replicas can't write WAL by themselves, but they will replay
>>>>> whatever the primary has sent; by shutting down the primary first 
>>>>> and
>>>>> letting the replicas catch up, you ensure that the replicas will
>>>>> actually receive the shutdown record and replay it.  If you shut 
>>>>> down
>>>>> the replicas first, they can obviously never catch up with the
>>>>> shutdown
>>>>> checkpoint of the primary.
>>>>> As I recall, if you do shut down the primary first, one potential
>>>>> danger
>>>>> is that the primary fails to send the checkpoint record before
>>>>> shutting
>>>>> down, so the replicas won't receive it and obviously will not 
>>>>> replay
>>>>> it;
>>>>> or simply that they are behind enough that they receive it but 
>>>>> don't
>>>>> replay it.
>>>>> You could use pg_controldata to read the last checkpoint info from
>>>>> all
>>>>> nodes.  You can run it on the primary after shutting it down, and
>>>>> then
>>>>> on each replica while it's still running to ensure that the correct
>>>>> restartpoint has been created.