Re: Adding REPACK [concurrently]

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Antonin Houska <[email protected]>
To: Amit Kapila <[email protected]>
Cc: Mihail Nikalayeu <[email protected]>
Cc: Andres Freund <[email protected]>
Cc: Alvaro Herrera <[email protected]>
Cc: Srinath Reddy Sadipiralla <[email protected]>
Cc: Matthias van de Meent <[email protected]>
Cc: Pg Hackers <[email protected]>
Cc: Robert Treat <[email protected]>
Subject: Re: Adding REPACK [concurrently]
Date: Tue, 12 May 2026 13:08:20 +0200
Message-ID: <40976.1778584100@localhost> (raw)
In-Reply-To: <CAA4eK1L1SCa7LjOVzq3zmmqpSV9o7z7VBkPjbE6=2iRLTSBvXw@mail.gmail.com>
References: <CAA4eK1Jg21ODQ7fS2fvN5W_S5kDRhAP5inj3XMRQaa=s-GbYhw@mail.gmail.com>
	<[email protected]>
	<cdgw4sbbfcgk6du3iv54r2dgiy4tfywoklbotlmj4irxavdcr3@glxfw5jj277q>
	<227677.1775576304@localhost>
	<pveffyxhnuurhb44uzqlwo3rkyzorkfh2rot7uwzlf2axhfvbp@7nrs2omysxkc>
	<CAA4eK1JhuT5fyTosWDZ+Pgs+j7xEjObTyRMn80uNKgi_ivqHbw@mail.gmail.com>
	<CADzfLwWnbKcb3v8sStdgNE=WNc3uUqx5SiS4zftX2UaEfNzG5w@mail.gmail.com>
	<85813.1777901089@localhost>
	<27869.1777985266@localhost>
	<CAA4eK1KC6CGN-N2bUffSign8Sw4q6=8d3L-Xh4t+50GCdQb6zw@mail.gmail.com>
	<70574.1778512672@localhost>
	<82942.1778527801@localhost>
	<CAA4eK1L1SCa7LjOVzq3zmmqpSV9o7z7VBkPjbE6=2iRLTSBvXw@mail.gmail.com>

Amit Kapila <[email protected]> wrote:

> On Tue, May 12, 2026 at 1:00 AM Antonin Houska <[email protected]> wrote:
> >
> > Antonin Houska <[email protected]> wrote:
> >
> > > Amit Kapila <[email protected]> wrote:
> > >
> > > > On Tue, May 5, 2026 at 6:17 PM Antonin Houska <[email protected]> wrote:
> > > > >
> > > > > Antonin Houska <[email protected]> wrote:
> > > > >
> > > > > I think the problem is that with database-specific snapshot,
> > > > > SnapBuildProcessRunningXacts() returns early, w/o adjusting builder->xmin
> > > > >
> > > > >         /*
> > > > >          * Database specific transaction info may exist to reach CONSISTENT state
> > > > >          * faster, however the code below makes no use of it. Moreover, such
> > > > >          * record might cause problems because the following normal (cluster-wide)
> > > > >          * record can have lower value of oldestRunningXid. In that case, let's
> > > > >          * wait with the cleanup for the next regular cluster-wide record.
> > > > >          */
> > > > >         if (OidIsValid(running->dbid))
> > > > >                 return;
> > > > >
> > > > > and thus some transactions whose XID is below running->oldestRunningXid may
> > > > > continue to be incorrectly considered running.
> > > > >
> > > > > I originally thought that this should not happen because such transactions
> > > > > will be added to the builder's array of committed transactions by
> > > > > SnapBuildCommitTxn() anyway. However, I failed to notice that COMMIT record of
> > > > > a transaction listed in the xl_running_xacts WAL record is not guaranteed to
> > > > > follow the xl_running_xacts record in WAL. In other words, even if
> > > > > xl_running_xacts is created before a COMMIT record of the contained
> > > > > transaction, it may end up at higher LSN in WAL. So the cleanup I relied on
> > > > > might not take place.
> > > > >
> > > >
> > > > BTW, is it possible to write a test by using injection_points or via
> > > > manual steps (by using debugger, etc) so that we can more clearly
> > > > understand this problem and proposed fix?
> > >
> > > So far I could observe the situation in WAL, but have no idea how it can
> > > happen. For example, transaction 49242 gets committed here
> > >
> > > rmgr: Transaction len (rec/tot): 46/ 46, tx: 49242, lsn: 0/18BC28C8, prev
> > > 0/18BC2890, desc: COMMIT 2026-05-11 16:38:16.603265 CEST
> > >
> > > and then it appears in the 'xids' list of RUNNING_XACTS:
> > >
> > > rmgr: Standby     len (rec/tot):    106/   106, tx:          0, lsn:
> > > 0/18BC3140, prev 0/18BC3100, desc: RUNNING_XACTS nextXid 49255
> > > latestCompletedXid 49241 oldestRunningXid 49242; 13 xacts: 49248 49249 49246
> > > 49243 49252 49251 49244 49245 49242 49250 49253 49254 49247; dbid:5
> > >
> > >
> > > I thought the situation is quite common (and therefore nothing of
> > > SnapBuildProcessRunningXacts() should be skipped), but when trying to
> > > reproduce the problem, I noticed that LogStandbySnapshot() shouldn't allow
> > > that ordering issue when logical decoding is enabled:
> > >
> > >       /*
> > >        * GetRunningTransactionData() acquired ProcArrayLock, we must release it.
> > >        * For Hot Standby this can be done before inserting the WAL record
> > >        * because ProcArrayApplyRecoveryInfo() rechecks the commit status using
> > >        * the clog. For logical decoding, though, the lock can't be released
> > >        * early because the clog might be "in the future" from the POV of the
> > >        * historic snapshot. This would allow for situations where we're waiting
> > >        * for the end of a transaction listed in the xl_running_xacts record
> > >        * which, according to the WAL, has committed before the xl_running_xacts
> > >        * record. Fortunately this routine isn't executed frequently, and it's
> > >        * only a shared lock.
> > >        */
> > >       if (!logical_decoding_enabled)
> > >               LWLockRelease(ProcArrayLock);
> > >
> > > So I don't have the answer right now.
> >
> > I think now that "waiting for the end of a transaction listed in the
> > xl_running_xacts record" in the comment is about transaction removal from
> > procarray. However, the COMMIT record can still be ahead of xl_running_xacts
> > because RecordTransactionCommit() is called before
> > ProcArrayEndTransaction().
> >
> 
> I see your point. Due to this, once the xmin regresses based on
> cluster-wide running_xact, some transaction could appear to be running
> when it should have appeared as committed.

The problem is that xmin does not advance when it should. Attached is a test
that reproduces the problem (it includes [1], to handle injection points in
background worker), I hope the comments in the specification file are helpful.

It's actually not exactly the problem reported in the stress test, but IMO the
core issue is the same: effects of some transactions are lost. In the stress
test, tuple deletion was lost, so the error was "could not create unique
index". Here I only demonstrate lost INSERT.

> Assuming, the problematic case is something
> like what I described, even than the fix of skipping cluster-wide
> running xacts and instead LOG db-specific running_xacts to help
> updating builder's xmin sounds inelegant and probably inefficient. For
> example, I think such a dependency means we can never enable
> db-specific snapshots on standby.

ok

[1] https://www.postgresql.org/message-id/4703.1774250534%40localhost

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Adding REPACK [concurrently]
  In-Reply-To: <40976.1778584100@localhost>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox