Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wQBbl-001LGu-37 for pgsql-hackers@arkaria.postgresql.org; Thu, 21 May 2026 22:11:14 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wQBbj-00BTA3-32 for pgsql-hackers@arkaria.postgresql.org; Thu, 21 May 2026 22:11:12 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wQBZu-00BP1X-2N for pgsql-hackers@lists.postgresql.org; Thu, 21 May 2026 22:09:19 +0000 Received: from mail-oa1-x30.google.com ([2001:4860:4864:20::30]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wQBZt-00000000A4V-1MAh for pgsql-hackers@lists.postgresql.org; Thu, 21 May 2026 22:09:18 +0000 Received: by mail-oa1-x30.google.com with SMTP id 586e51a60fabf-4042905015cso5768440fac.0 for ; Thu, 21 May 2026 15:09:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1779401356; cv=none; d=google.com; s=arc-20240605; b=LLpjkdMEFmt/lXtIam5qsmTHbmSpiMWqTycqkuti46ppl4LKrlZkjP/1TeQT2y2c/0 DUpAccduovRin90EWFWfeFKLwPZ1FWv7oAOIgCIeZ5DPHSIxlR/khPuFGyqNzFiHVw/M D7Hy+au0oGU5wixW0CDz4cDj9bdQN0t3EcUu26vn2nBUjDurx6XmIxLshn0XVTfbUi2o q9weaFFQoIskXQ5P4+NaWKmga8Df2d1UeR5vazVTE5hlb6QXkjE3VHZVSsRcG3idbmKO gVzAev8eaxVhzTRdrgRM8rrrGTbZg8oX55Pp0W7847qFeVkgDechlCxDV3vUEvvPv2I0 2jYg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=XQjEcp7d+oo4+GyMFchUNLYSlqwMXuNOqd7w4jEpMT0=; fh=vq2sz54HwZG6YEBYE25l2N8Jty7xXA+xlgPMcJUgPDw=; b=GUhyGD7rpWMVvF1AaWaVaIxgUsC6fk9iWJ/6oJ5qjb8h0yJhYZXMSgVsq0iKEF0WnE yBkL8QolRi/BRvlvaRSq+IIs5ayN8cCmmIqN2+sFQmLoOMdzpa81VZZl7DpJcYKGB9ty oEEuOgmdKNSifs3TG4gKVkxT3PEazq86EfGhadwOk8wBjkvnPghP/tq4kBV0+BEX4BAZ 9mnRhsVWvhcaXP7PhQqZaWT56kt4KEPmdDG2YFPSBmlDtxSXAvjZ21g8tuaV2uBMugR5 WiCmgsVy6mfpm5kAYQBqC0wxIQrQpqOWUXNQUAgwNliJRydyczd//v5UEbtVIdk5V9xm oKnw==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779401356; x=1780006156; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=XQjEcp7d+oo4+GyMFchUNLYSlqwMXuNOqd7w4jEpMT0=; b=HKFyoJV0U7o3T/jTGaaf3cfJf4mSw1feA8O/I5jKBhhkvZITbRya5L6YK/6RrtHblw StHRvS4D3VjN20UakbFC63Hq4g8+bSFv2zGdMla7LE8YFhsnXRJKUZXEA/nV5TNavX6n 8sWhZCssBulRv22sV/ZDNnM01jnjnDssuH8+0Y3s9nYe8kFOJQi4cKWIkv8YkTU348yU sAgPZTY3qmzzA34R3bSOHRSIfyeHbH/RJsUcBRPG56DhSUSy1eytKRC50EQ8wbN4YbLW /PeBooAJj7T1WfC4JPy7qFIyDh5nfzXFPOdB4TKD3LmhaqkyphyS1OA9PJsxduHvXURH 1VJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779401356; x=1780006156; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=XQjEcp7d+oo4+GyMFchUNLYSlqwMXuNOqd7w4jEpMT0=; b=OKeLDAPmb1Dnxjcqz4zuTVoLJCcnqsLkYKS0s1h02Z96esqMzEb5/wZibtera2ycEE Nlw4KlH10KOTjLK7FEpSXSEPIvQJ0BfZqykszhU24Xl7zc4rxqPeTSw/BEa2DDOeUjpN ns/1SZ4ikyJc/dzMI8RWqRptNKsKT9x3O7kdJbyv6PdWnVAz+QH0gRWvhBpOOIquYXGb jgOuzAfj7yJarDM3qy7AXQMN1ha07WV93QAOqmO+XrbWtftUGF9TU/JlzQLeouIG6mx9 gV/bYPeGNJbyQKI36eQ6S8oMa1hrhSOeKdYZHTe6habmlhvd/WIRXeR4Us2kcUGu1w+u bEQQ== X-Gm-Message-State: AOJu0YxwNmaa58Se5AWmjAaZhFBeOi3b30ySpDVzpY21K69uR0FlWISd zYY6mEZotaz6HZPxqELxM9siBYfq6XI7CXk/C3zB34j8ZUgeGplV3R5p7XfBKlNI6pIt6D6tjtk o47sPtcyP4kXiJRJ8vCFRet1RZYz2d/M= X-Gm-Gg: Acq92OG9IpDdOtW22XbeCqctD/jL3IAwfUQe+dl1b8cLSg3O1c/eev4pehoAd9bFSbE nCwdClfvMJY8XPLMf56kU34o+0/n+T7riQ1wOKsjJubZH4WokZLgSDCZDvGL4YqxagEjU/MwDUM RwoALZqva7jg6l5p9DNUfDIr6L5gekGz2/N5wOA7libhLyh/0yuag4uLVw5NGNqmmsInXote5JV c0qkq1XWa9Pihn3FFpvYE5JZtRKffoHVGl/Shrk7+dVq5pKl4Wg2lGUxYFCESD9eczWcvylZHVG kBFXhL/WzCO7Ug== X-Received: by 2002:a05:6870:a798:b0:42f:ee6c:35fb with SMTP id 586e51a60fabf-43b5ada2955mr518537fac.20.1779401356628; Thu, 21 May 2026 15:09:16 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: surya poondla Date: Thu, 21 May 2026 15:09:05 -0700 X-Gm-Features: AVHnY4INY-Hg3tmNVwmvj0k467p7zxX_Qd5nXA6dOGlL-JUGHDX1P7q7yNGdapQ Message-ID: Subject: Re: pg_rewind does not rewind diverging timelines To: Mats Kindahl Cc: pgsql-hackers@lists.postgresql.org Content-Type: multipart/alternative; boundary="000000000000c6579406525b2942" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000c6579406525b2942 Content-Type: text/plain; charset="UTF-8" Hi Mats, Thanks for picking this up -- the scenario is a real one and I think the UUID-tagging approach is a clean way to solve it. v2 applies and builds without trouble, and the core algorithm reads well to me. I have a handful of observations that I'd love your thoughts. Regarding Correctness I have the below thoughts 1. UUIDv7 timestamp epoch. In StartupXLOG(): TimestampTz now = GetCurrentTimestamp(); generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000), (uint32)(now % 1000) * 1000); I think there might be a small mismatch here: GetCurrentTimestamp() returns microseconds since the Postgres epoch (2000-01-01), whereas generate_uuidv7_r describes its first argument as milliseconds since the Unix epoch. As written that 30-year offset would land in the UUID's timestamp field, so the resulting UUID wouldn't be a conformant UUIDv7 and wouldn't time-order against UUIDv7s generated through the SQL functions. Uniqueness is preserved either way, so the rewind logic still works as intended but it seemed worth flagging. I see conversion that's used elsewhere as: us = ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY * USECS_PER_SEC; Or, since promotion isn't on a hot path, gettimeofday() / time(NULL) directly would also be fine. 2. EOR-record path, the intent is unclear. The comment above generate_uuidv7_r() at says: "The same UUID is written into the history file and later into the XLOG_END_OF_RECOVERY record so that pg_rewind can distinguish two servers..." But from what I can see only the history-file part actually lands. xl_end_of_recovery is unchanged, CreateEndOfRecoveryRecord() doesn't add the UUID, and XLogCtl->ThisTimeLineUUID is written under info_lck without a reader (I couldn't grep it). The xlog_redo() memset() + Min(rec_len, sizeof(...)) change reads like preparation for an EOR-struct extension that ended up not being part of the patch. Was the EOR-record piece something you intended to keep for a follow-up, or has it been superseded by the history-file approach? 3. Malformed UUID handling in readTimeLineHistory(). The optional field-4 path is: if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN) { Datum datum = DirectFunctionCall1(uuid_in, CStringGetDatum(uuid_str)); ... } uuid_in() raises ereport(ERROR) on a malformed input, while the surrounding syntax-error paths in readTimeLineHistory() use FATAL deliberately. In practice an ERROR during startup ends up being fatal too, so this isn't strictly a bug but it would be nicer to stay consistent. Regarding the Tests I have the following thoughts The two new cases are nice, a few extensions that I think would strengthen them: 1. A mixed-version case where one side has a zero UUID. That's the path we're claiming is graceful, but nothing currently exercises it 2. A deeper-divergence case (e.g. TLI1->2->3 vs TLI1->2->3') so that findCommonAncestorTimeline's loop walks past matching entries before hitting the mismatch. The 0002 test puts the divergence at depth 1. 3. A small assertion against the on-disk 00000002.history contents, to pin down the file format. 4. On 0002 the dependency on restore_command pointing at node_x's pg_wal is the kind of thing that tends to break under environment changes. A CHECKPOINT on node_x before the backup, or wal_keep_size as in 0001, would let the test stand on its own. I'm happy to keep reviewing/contributing, thanks again for working on it. Regards, Surya Poondla > --000000000000c6579406525b2942 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Mats,

Thanks for picking this u= p -- the scenario is a real one and I think=C2=A0the UUID-tagging approach = is a clean way to solve it. v2 applies and=C2=A0builds without trouble, and= the core algorithm reads well to me.=C2=A0
I=C2=A0ha= ve a handful of observations that I'd love your thoughts.


Re= garding Correctness I have the below thoughts

1. UUIDv7 timestamp ep= och.
=C2=A0 =C2=A0 =C2=A0In StartupXLOG():
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0TimestampTz now =3D GetCurrentTimestamp();
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000),
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (uint32)(now % = 1000) * 1000);

I think there might be a small mismatch here: GetCurr= entTimestamp() returns microseconds since the Postgres epoch (2000-01-01),= =C2=A0
whereas generate_uuidv7_r describes its first = argument as milliseconds since the Unix epoch.=C2=A0
= As written that 30-year offset would land in the UUID's timestamp field= , so the resulting UUID wouldn't be a conformant UUIDv7 and wouldn'= t
time-order against UUIDv7s generated through the SQL functions.
Uniqueness is preserved either way, so the rewind logic still works as int= ended but it seemed worth flagging.

I see conversion that's used= elsewhere as:
us =3D ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE)
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* SECS= _PER_DAY * USECS_PER_SEC;

Or, since promotion isn't on a hot pat= h, gettimeofday() / time(NULL) directly would also be fine.


2. E= OR-record path, the intent is unclear.

The comment above generate_uu= idv7_r() at says:

"The same UUID is written into the history fi= le and later into the XLOG_END_OF_RECOVERY record so that pg_rewind can dis= tinguish two servers..."

But from what I can see only the histo= ry-file part actually lands.
xl_end_of_recovery is unchanged, CreateEndO= fRecoveryRecord() doesn't add the UUID, and XLogCtl->ThisTimeLineUUI= D is written under info_lck without a
reader (I could= n't grep it).

The xlog_redo() memset() + Min(rec_len, sizeof(..= .)) change reads like preparation for an EOR-struct extension that ended up= not being part of the patch.

Was the EOR-record piece something you= intended to keep for a follow-up, or has it been superseded by the history= -file approach?
=C2=A0 =C2=A0 =C2=A0

3. Malformed UUID handling i= n readTimeLineHistory().

=C2=A0 =C2=A0 =C2=A0The optional field-4 pa= th is:

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (nfields =3D=3D 4 &&= amp; strlen(uuid_str) =3D=3D UUID_STR_LEN)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0{
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Datum datum =3D = DirectFunctionCall1(uuid_in,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0CStringGetDatum(uuid_st= r));
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0...
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0}

uuid_in() raises ereport(ERROR) on a malfo= rmed input, while the surrounding syntax-error paths in readTimeLineHistory= () use FATAL deliberately.
In practice an ERROR during startup ends up = being fatal too, so this isn't strictly a bug but it would be nicer to = stay consistent.=C2=A0


Regarding the Tests I have the following = thoughts

The two new cases are nice, a few extensions that I think w= ould strengthen them:
1. A mixed-version case where one side has a zero = UUID. That's the path we're claiming is graceful, but nothing curre= ntly exercises it
2. A deeper-divergence case (e.g. TLI1->2->3 vs = TLI1->2->3') so that findCommonAncestorTimeline's loop walks = past matching entries
=C2=A0 =C2=A0 =C2=A0before hitting the mismatch. T= he 0002 test puts the divergence at depth 1.
3. A small assertion agains= t the on-disk 00000002.history contents, to pin down the file format.
4.= On 0002=C2=A0the dependency on restore_command pointing at node_x's pg= _wal is the kind of thing that tends to break under
=C2=A0 =C2=A0 =C2=A0= environment changes. A CHECKPOINT on node_x before the backup, or wal_keep_= size as in 0001, would let the test stand on its own.


I'm ha= ppy to keep reviewing/contributing, thanks again for=C2=A0working on it.
Regards,
Surya Poondla
--000000000000c6579406525b2942--