Re: Adding REPACK [concurrently]

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Amit Kapila <[email protected]>
To: Antonin Houska <[email protected]>
Cc: Alvaro Herrera <[email protected]>
Cc: Mihail Nikalayeu <[email protected]>
Cc: Andres Freund <[email protected]>
Cc: Srinath Reddy Sadipiralla <[email protected]>
Cc: Matthias van de Meent <[email protected]>
Cc: Pg Hackers <[email protected]>
Cc: Robert Treat <[email protected]>
Subject: Re: Adding REPACK [concurrently]
Date: Fri, 8 May 2026 17:55:12 +0530
Message-ID: <CAA4eK1LygCDP3FiFzXY9iVNFcHxhf7TT_DFf7tryTu2oipmfpA@mail.gmail.com> (raw)
In-Reply-To: <77611.1778055944@localhost>
References: <[email protected]>
	<77611.1778055944@localhost>

On Wed, May 6, 2026 at 1:55 PM Antonin Houska <[email protected]> wrote:
>
> Alvaro Herrera <[email protected]> wrote:
>
> > On 2026-May-05, Antonin Houska wrote:
> >
> > > However, I failed to notice that COMMIT record of
> > > a transaction listed in the xl_running_xacts WAL record is not guaranteed to
> > > follow the xl_running_xacts record in WAL. In other words, even if
> > > xl_running_xacts is created before a COMMIT record of the contained
> > > transaction, it may end up at higher LSN in WAL. So the cleanup I relied on
> > > might not take place.
> >
> > That's pretty bad news.
> >
> > > I've got no good idea how to fix that.
>
> One idea occurred to me yet, effectively it's just a cleanup. Part of it was
> already proposed [1].
>

Some issues/inefficiencies regarding this fix and base code related to
db-specific snapshots built during decoding:
*
After fix, whenever a db-specific decoder sees a cluster-wide
xl_running_xacts record, it unconditionally calls
LogStandbySnapshot(MyDatabaseId) and returns. This triggers for every
cluster-wide record the decoder encounters (including post snapbuild's
CONSISTENT state) , for the entire duration of the decoding session.
LogStandbySnapshot acquires ProcArrayLock + XidGenLock, calls
GetRunningTransactionData, and writes WAL. With N active db-specific
decoding sessions, each cluster-wide record now triggers N additional
WAL writes.

Additionally, LogStandbySnapshot also logs AccessExclusiveLocks before
the running_xacts record. Physical standbys skip db-specific
XLOG_RUNNING_XACTS records via standby_redo(), but they do process the
preceding XLOG_STANDBY_LOCK records. The same locks may already have
been logged in the most recent cluster-wide snapshot. Physical
standbys could end up processing these lock records twice which may
not be harmful because I think we avoid re-acquiring the lock but
still it is a new overhead in the system.

*
When a cluster-wide running_xacts record arrives:
SnapBuildProcessRunningXacts calls LogStandbySnapshot and returns
early. ReorderBufferAbortOld is called, but with the cluster-wide
oldestRunningXid, which could lag far behind the db-specific value
(due to a long-running transaction in another database).
When a db-specific record arrives: SnapBuildProcessRunningXacts
processes it and advances builder->xmin with the db-specific (more
current) oldestRunningXid. But ReorderBufferAbortOld is NOT called for
db-specific records. This means the reorder buffer is cleaned up using
a conservative, potentially very old, cluster-wide oldestRunningXid,
even though builder->xmin has already advanced much further. The
reorder buffer holds stale entries longer than necessary, increasing
memory pressure.

*
I also see a design level problem with plugins that have
need_shared_catalogs=false and use failover slots. IIUC, the
db-specific optimization was designed around a live decoding session
on the primary which can emit and immediately read its own db-specific
records in the WAL stream to reach consistent state. The
LogicalSlotAdvanceAndCheckSnapState path used by slotsync has a
bounded WAL window and cannot exploit the feedback loop, making the
two mechanisms fundamentally incompatible. I know the slot created by
pgrepack doesn't enable failover option but we have not even added any
guards or thought about db-specific snapbuilds with other parts of the
system that rely on cluster-wide running_xact records, so there could
be more problems which we don't see with the current set of tests.

-- 
With Regards,
Amit Kapila.

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Adding REPACK [concurrently]
  In-Reply-To: <CAA4eK1LygCDP3FiFzXY9iVNFcHxhf7TT_DFf7tryTu2oipmfpA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox