public inbox for [email protected]  
help / color / mirror / Atom feed
From: Masahiko Sawada <[email protected]>
To: Bharath Rupireddy <[email protected]>
Cc: Srinath Reddy Sadipiralla <[email protected]>
Cc: SATYANARAYANA NARLAPURAM <[email protected]>
Cc: Hayato Kuroda (Fujitsu) <[email protected]>
Cc: John H <[email protected]>
Cc: PostgreSQL-development <[email protected]>
Subject: Re: Introduce XID age based replication slot invalidation
Date: Mon, 30 Mar 2026 17:13:08 -0700
Message-ID: <CAD21AoB4MsTpG5JEkifght_tQ91VHJO_8kpsDCrG-z9qkkmN5g@mail.gmail.com> (raw)
In-Reply-To: <CALj2ACU=A31kqHELyaF-=2vnyed=cO2JNQt+vY92KtHLF-m8sg@mail.gmail.com>
References: <CA+-JvFsMHckBMzsu5Ov9HCG3AFbMh056hHy1FiXazBRtZ9pFBg@mail.gmail.com>
	<OSCPR01MB149667506BCFD684FEFDCB919F511A@OSCPR01MB14966.jpnprd01.prod.outlook.com>
	<CALj2ACUmPbkcj4y4oeXvzUkBejG68QDtrFF7QHDC_qz2vQcTCg@mail.gmail.com>
	<CAHg+QDfnK7tQxsEZox=kOkYfqANmL70mwn0N=eRrJxE1Z+1ygg@mail.gmail.com>
	<CALj2ACX_o+dKeAaK76mpAtG646UnDHpGUWziUkCvicVz8mz6=A@mail.gmail.com>
	<CAD21AoATM=Un8ejnbcDQ7q+CaCoxpkA7Ln9bacvQEoymqvjPug@mail.gmail.com>
	<CALj2ACUmUb=CLEyfsQrW0WAkF6Y9fiBfG6pnPjepfOM7A1XReA@mail.gmail.com>
	<CAD21AoAg6x57a8LoP2s+0vgizp9QGHcLGJL9bwh7kzEJq3arBg@mail.gmail.com>
	<CALj2ACWcaSkfMAQu3s6ZkTZuoFvVRD=DkxXbBwC33PL9+kzsqw@mail.gmail.com>
	<CAD21AoBEBqQONiZxvnUYOu814yB2tRPrmX=7KqvL+f3ae7250w@mail.gmail.com>
	<CALj2ACUenekLgjMr8x4DyuU=zKZ4eNqW9XF-1PovSctkY2VA0Q@mail.gmail.com>
	<CAFC+b6pO44=zGqwijzrcyGGTYCM51Y7zS5uQX0_nWjsxW9i3QQ@mail.gmail.com>
	<CALj2ACU=A31kqHELyaF-=2vnyed=cO2JNQt+vY92KtHLF-m8sg@mail.gmail.com>

On Sun, Mar 29, 2026 at 6:35 PM Bharath Rupireddy
<[email protected]> wrote:
>
> Hi,
>
> On Sun, Mar 29, 2026 at 1:16 PM Srinath Reddy Sadipiralla
> <[email protected]> wrote:
> >
> > Hello,
> >
> > Thanks for the v5 patch set, I have reviewed and did initial testing on
> > v5 patch set, and it LGTM, except these
>
> Thank you for reviewing and testing. I appreciate it.
>
> > diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> > index 286f0f46341..c2ff7e464f0 100644
> > --- a/src/backend/replication/slot.c
> > +++ b/src/backend/replication/slot.c
> > @@ -1849,7 +1849,7 @@ ReportSlotInvalidation(ReplicationSlotInvalidationCause cause,
> >                                 else
> >                                 {
> >                                         /* translator: %s is a GUC variable name */
> > -                                       appendStringInfo(&err_detail, _("The slot's xmin %u is %d transactions old, which exceeds the configured \"%s\" value of %d."),
> > +                                       appendStringInfo(&err_detail, _("The slot's catalog_xmin %u is %d transactions old, which exceeds the configured \"%s\" value of %d."),
> >                                                                          catalog_xmin, (int32) (recentXid - catalog_xmin), "max_slot_xid_age", max_slot_xid_age);
> >                                 }
>
> Fixed the typo.
>
> > while testing the active slot XID age invalidation (SIGTERM path) , i
> > observed that slot got invalidated , walsender was killed because of
> > SIGTERM , then starts the infinite-retry-cycle problem where
> > walreceiver starts walsender and walsender will try to use an invalidated
> > slot and dies, will think more on this.
>
> I would like to clarify that once a slot is invalidated due to any of
> the reasons (ReplicationSlotInvalidationCause), it becomes unusable;
> the sender will error out if the receiver tries to use it. This is
> consistent with all existing slot invalidation mechanisms.
>
> Please find the attached v6 patches fixing the typo for further review.
>

I've reviewed the v6 patch. Here are some comments.

 bool
 vacuum_get_cutoffs(Relation rel, const VacuumParams params,
-                  struct VacuumCutoffs *cutoffs)
+                  struct VacuumCutoffs *cutoffs,
+                  TransactionId *slot_xmin,
+                  TransactionId *slot_catalog_xmin)

How about storing both slot_xmin and catalog_xmin into VacuumCutoffs?

---
-   if (InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_REMOVED |
RS_INVAL_IDLE_TIMEOUT,
+   possibleInvalidationCauses = RS_INVAL_WAL_REMOVED | RS_INVAL_IDLE_TIMEOUT |
+       RS_INVAL_XID_AGE;
+
+   if (InvalidateObsoleteReplicationSlots(possibleInvalidationCauses,
                                           _logSegNo, InvalidOid,
+                                          InvalidTransactionId,
+                                          max_slot_xid_age > 0 ?
+                                          ReadNextTransactionId() :
                                           InvalidTransactionId))

It's odd to me that we specify RS_INVAL_XID_AGE while passing
InvalidTransactionId. I think we can specify RS_INVAL_XID_AGE along
with a valid recentXId only when we'd like to check the slots based on
their XIDs.

---
+   /* Check if the slot needs to be invalidated due to max_slot_xid_age GUC */
+   if ((possible_causes & RS_INVAL_XID_AGE) && CanInvalidateXidAgedSlot(s))
+   {
+       TransactionId xidLimit;
+
+       Assert(TransactionIdIsValid(recentXid));
+
+       xidLimit = TransactionIdRetreatedBy(recentXid, max_slot_xid_age);
+

I think we can avoid calculating xidLimit for every slot by
calculating it in InvalidatePossiblyObsoleteSlot() and passing it to
DetermineSlotInvalidationCause().

---
  */
 TransactionId
 GetOldestNonRemovableTransactionId(Relation rel)
+{
+   return GetOldestNonRemovableTransactionIdExt(rel, NULL, NULL);
+}
+
+/*
+ * Same as GetOldestNonRemovableTransactionId(), but also returns the
+ * replication slot xmin and catalog_xmin from the same ComputeXidHorizons()
+ * call.  This avoids a separate ProcArrayLock acquisition when the caller
+ * needs both values.
+ */
+TransactionId
+GetOldestNonRemovableTransactionIdExt(Relation rel,
+                                     TransactionId *slot_xmin,
+                                     TransactionId *slot_catalog_xmin)
 {

I understand that the primary reason why the patch introduces another
variant of GetOldestNonRemovableTransactionId() is to avoid extra
ProcArrayLock acquision to get replication slot xmin and catalog_xmin.
While it's not very elegant, I find that it would not be bad because
otherwise autovacuum takes extra ProcArrayLock (in shared mode) for
every table to vacuum. The ProcArrayLock is already known
high-contented lock it would be better to avoid taking it once more.
If others think differently, we can just call
ProcArrayGetReplicationSlotXmin() separately and compare them to the
limit of XID-age based slot invalidation.

Having said that, I personally don't want to add new instructions to
the existing GetOldestNonRemovableTransactionId(). I guess we might
want to make both the existing function and new function call a common
(inline) function that takes ComputeXidHorizonsResult and returns
appropriate transaction id based on the given relation .

---
+   # Do some work to advance xids
+   $node->safe_psql(
+       'postgres', qq[
+       do \$\$
+       begin
+       for i in 1..$nxids loop
+           -- use an exception block so that each iteration eats an XID
+           begin
+           insert into $table_name values (i);
+           exception
+           when division_by_zero then null;
+           end;
+       end loop;
+       end\$\$;
+   ]);

I think it's fater to use pg_current_xact_id() instead.

---
+   else
+   {
+       $node->safe_psql('postgres', "VACUUM");
+   }

We don't need to vacuum all tables here.

---
+# Configure primary with XID age settings. Set autovacuum_naptime high so
+# that the checkpointer (not vacuum) triggers the invalidation.
+my $max_slot_xid_age = 500;
+$primary5->append_conf(
+   'postgresql.conf', qq{
+max_slot_xid_age = $max_slot_xid_age
+autovacuum_naptime = '1h'
+});

I think that it's better to disable autovacuum than setting a large number.

---
+# Testcase end: Invalidate streaming standby's slot due to max_slot_xid_age
+# GUC (via checkpoint).

I think that we can say "physical slot" instead of standby's slot to
avoid confusion as I thought standby's slot is a slot created on the
standby at the first glance.

---
Do we have tests for invalidating slots on the standbys?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com





view thread (31+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Introduce XID age based replication slot invalidation
  In-Reply-To: <CAD21AoB4MsTpG5JEkifght_tQ91VHJO_8kpsDCrG-z9qkkmN5g@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox