public inbox for [email protected]  
help / color / mirror / Atom feed
From: Japin Li <[email protected]>
To: Mats Kindahl <[email protected]>
Cc: surya poondla <[email protected]>
Cc: [email protected]
Subject: Re: pg_rewind does not rewind diverging timelines
Date: Mon, 25 May 2026 13:20:57 +0800
Message-ID: <SY7PR01MB1092190B1E748F1438CAB5D7AB60A2@SY7PR01MB10921.ausprd01.prod.outlook.com> (raw)
In-Reply-To: <CAN305gCcj4Mhr3uBQAnQCYsx6F-syp1rGtazoy=h+_EHO0xOXA@mail.gmail.com>
References: <CAN305gBeJr8m7ZRW9mH0zakEFR4hDUPDo8fJRKJOHWMORG5_Bg@mail.gmail.com>
	<CAOVWO5oZZtniLR4Pyd=e_cS-FNxh837Gbz9TDUnSwWqmbap=bw@mail.gmail.com>
	<CAN305gCcj4Mhr3uBQAnQCYsx6F-syp1rGtazoy=h+_EHO0xOXA@mail.gmail.com>


Hi, Mats

On Sun, 24 May 2026 at 20:30, Mats Kindahl <[email protected]> wrote:
> On Fri, May 22, 2026 at 12:09 AM surya poondla <[email protected]> wrote:
>
>  Hi Mats,
>
>  Thanks for picking this up -- the scenario is a real one and I think the UUID-tagging approach is a clean way to
>  solve it. v2 applies and builds without trouble, and the core algorithm reads well to me. 
>  I have a handful of observations that I'd love your thoughts.
>
> Hi Surya,
>
> Thank you for the review. It is a quite rare scenario, but it is real and the fix is simple.
>  
>  Regarding Correctness I have the below thoughts
>
>  1. UUIDv7 timestamp epoch.
>       In StartupXLOG():
>           TimestampTz now = GetCurrentTimestamp();
>           generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000),
>                                        (uint32)(now % 1000) * 1000);
>
>  I think there might be a small mismatch here: GetCurrentTimestamp() returns microseconds since the Postgres epoch
>  (2000-01-01), 
>  whereas generate_uuidv7_r describes its first argument as milliseconds since the Unix epoch. 
>  As written that 30-year offset would land in the UUID's timestamp field, so the resulting UUID wouldn't be a
>  conformant UUIDv7 and wouldn't
>  time-order against UUIDv7s generated through the SQL functions.
>
>  
>  
>  Uniqueness is preserved either way, so the rewind logic still works as intended but it seemed worth flagging.
>
>  I see conversion that's used elsewhere as:
>  us = ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE)
>                     * SECS_PER_DAY * USECS_PER_SEC;
>
>  Or, since promotion isn't on a hot path, gettimeofday() / time(NULL) directly would also be fine.
>
> Yes, the intention was to use a proper timestamp to allow debugging servers if necessary. Switched to gettimeofday() and
> used 0 for sub-ms since this is not going to be critical. (We could use ns here as well, but that would only solve a race
> if you have two servers being promoted in the same ms, which I find unlikely, and there is a random number added for that
> situation.)
>  
>  2. EOR-record path, the intent is unclear.
>
>  The comment above generate_uuidv7_r() at says:
>
>  "The same UUID is written into the history file and later into the XLOG_END_OF_RECOVERY record so that pg_rewind can
>  distinguish two servers..."
>
>  But from what I can see only the history-file part actually lands.
>  xl_end_of_recovery is unchanged, CreateEndOfRecoveryRecord() doesn't add the UUID, and XLogCtl->ThisTimeLineUUID is
>  written under info_lck without a
>  reader (I couldn't grep it). 
>
>  The xlog_redo() memset() + Min(rec_len, sizeof(...)) change reads like preparation for an EOR-struct extension that
>  ended up not being part of the patch.
>
>  Was the EOR-record piece something you intended to keep for a follow-up, or has it been superseded by the
>  history-file approach?
>
> No, the EOR changes are not needed for the promotion, contrary to what I originally thought. Cleaned up the comment and
> the code and removed all traces of changes to the EOR (I hope).
>  
>       
>
>  3. Malformed UUID handling in readTimeLineHistory().
>
>       The optional field-4 path is:
>
>           if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN)
>           {
>               Datum datum = DirectFunctionCall1(uuid_in,
>                                                 CStringGetDatum(uuid_str));
>               ...
>           }
>
>  uuid_in() raises ereport(ERROR) on a malformed input, while the surrounding syntax-error paths in readTimeLineHistory
>  () use FATAL deliberately. 
>  In practice an ERROR during startup ends up being fatal too, so this isn't strictly a bug but it would be nicer to
>  stay consistent.
>
> Agree. I added code to capture the error and raise a FATAL instead (with the error message from the uuid_in, in case it
> is modified it makes sense to show this).
>  
>  Regarding the Tests I have the following thoughts
>
>  The two new cases are nice, a few extensions that I think would strengthen them:
>  1. A mixed-version case where one side has a zero UUID. That's the path we're claiming is graceful, but nothing
>  currently exercises it
>
> Yes, that should work regardless of whether the source or the target has the zero UUID.
>
> I realized one thing: if two timelines have identical TLI but one has zero UUID and one has not, it seems they could not
> come from the same promotion (one promotion happened on an old server and the other one on a new server), that is, they
> should be treated as different. Does that make sense? I made the necessary changes in the attached patches for testing.
> Please have a look.
>  
>  2. A deeper-divergence case (e.g. TLI1->2->3 vs TLI1->2->3') so that findCommonAncestorTimeline's loop walks past
>  matching entries
>       before hitting the mismatch. The 0002 test puts the divergence at depth 1.
>
> I was unsure if this test was necessary or interesting, hence a separate commit. Since you thought it was useful, it's
> now rolled into the patch and I extended the tests with the scenarios you suggested.
>
> I also did some refactorings of the tests to avoid duplication. More below.
>  
>  3. A small assertion against the on-disk 00000002.history contents, to pin down the file format.
>  4. On 0002 the dependency on restore_command pointing at node_x's pg_wal is the kind of thing that tends to break
>  under
>       environment changes. A CHECKPOINT on node_x before the backup, or wal_keep_size as in 0001, would let the test
>  stand on its own.
>
> Good point.
>
> I refactored the code to avoid some duplication and make the test flow self-explanatory and as part of that I set the
> wal_keep_size for all nodes.
>
> In the process I noticed that many of the functions in RewindTest.pm do the same job as the primitives I wrote, but have
> hard-coded variable names. I could rewrite them to take parameters, but that would be quite a big patch to add additional
> changes to each call site, so I did not do that and rather added small wrappers specific for the tests in
> 005_same_timeline.pl⚠️.
>  
> Attached a new version of the now single patch.
>
>  I'm happy to keep reviewing/contributing, thanks again for working on it.
>
> Thank you for reviewing it.

Thank you for your work.  I have one comment.

+	a = &tlh->source[tlh->sourceNentries - 2].tluuid;
+	b = &tlh->target[tlh->targetNentries - 2].tluuid;
+
+	if (memcmp(a, &zero, UUID_LEN) == 0 && memcmp(b, &zero, UUID_LEN) == 0)
+		return true;
+
+	return memcmp(a, b, UUID_LEN) == 0;

Since we already have matchingTimelineUUID(), the above code can be simplified
using it.

-- 
Regards,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.






reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: pg_rewind does not rewind diverging timelines
  In-Reply-To: <SY7PR01MB1092190B1E748F1438CAB5D7AB60A2@SY7PR01MB10921.ausprd01.prod.outlook.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox