Re: Optimize SnapBuild by maintaining committed.xip in sorted order

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Xuneng Zhou <[email protected]>
To: pgsql-hackers <[email protected]>
Cc: Masahiko Sawada <[email protected]>
Subject: Re: Optimize SnapBuild by maintaining committed.xip in sorted order
Date: Tue, 16 Dec 2025 18:52:13 +0800
Message-ID: <CABPTF7Vty_fsJLpj6=Xn5MM5PXiV_n=XeN87S+7JhSmROYwCAA@mail.gmail.com> (raw)
In-Reply-To: <CABPTF7WcZ4_xvPeBh9Ws3_AJ3u67gHdHSAmE5Pp+0pDppY=ZvA@mail.gmail.com>
References: <CABPTF7XiwbA38OZBj5Y2F-q+fZ=03YFN9iFnb_406F4SfE-f4w@mail.gmail.com>
	<CABPTF7WcZ4_xvPeBh9Ws3_AJ3u67gHdHSAmE5Pp+0pDppY=ZvA@mail.gmail.com>

Hi,

On Tue, Dec 16, 2025 at 6:20 PM Xuneng Zhou <[email protected]> wrote:
>
> Hi,
>
> On Fri, Nov 7, 2025 at 12:55 PM Xuneng Zhou <[email protected]> wrote:
> >
> > Hi hackers,
> >
> > I'd like to propose an optimization for logical decoding's snapshot building
> > mechanism that eliminates repeated sorting overhead once a replication slot
> > reaches the CONSISTENT state.
> >
> > 1) Problem
> >
> > Currently, SnapBuildBuildSnapshot() sorts the entire committed.xip array on
> > every call using qsort(). Once a logical decoding slot reaches CONSISTENT
> > state, snapshots are built frequently—after every catalog-modifying transaction
> > commit. This repeated O(n log n) sorting becomes a performance bottleneck as
> > the committed transaction array grows.
> >
> > The existing code even has a TODO comment acknowledging this inefficiency:
> > "TODO: It's unclear whether that reasoning has much merit. Every time we add
> > something here after becoming consistent will also require distributing a
> > snapshot."
> >
> > 2) Proposed Solution
> >
> > Maintain sorted order via batch merge on each commit:
> >
> > 1. Collect all relevant XIDs (including subtransactions) into a batch array
> > 2. Sort the batch: O(m log m) where m is typically small (1 + nsubxacts)
> > 3. Merge into the global array: O(n + m) via reverse in-place merge
> >
> > While per-commit cost increases from O(m) to O(m log m + n), this is offset by
> > eliminating O(n log n) sorts at each snapshot build. Since CONSISTENT state
> > builds snapshots after each catalog-modifying commit, the amortized cost is
> > significantly lower.
> >
> > 3) Performance Impact
> >
> > Expected improvements in CONSISTENT state:
> > - Eliminates O(n log n) qsort on every snapshot build
> > - Replaces with O(m log m + n) merge per commit
> > - For typical cases where m << n and snapshots are frequent, this is a win
> >
> >
> > The benchmark results for this patch are shown below.
> >
> > 4) Configuration
> >
> > shared_buffers = '4GB'
> > wal_level = logical
> > max_replication_slots = 10
> > max_wal_senders = 10
> > log_min_messages = warning
> > max_connections = 600
> > autovacuum = off
> > checkpoint_timeout = 15min
> > max_wal_size = 4GB
> >
> >
> > 5) Workloads
> >
> > # Workload 1: DDL - High frequency, small commits
> > create_ddl_workload() {
> >   local ROOT="$1"
> >   cat >"$ROOT/ddl.sql" <<'SQL'
> > -- High-frequency small catalog commits
> > -- Each commit triggers SnapBuildBuildSnapshot
> > DO $$
> > DECLARE
> >   tbl text := format('temp_%s_%s',
> >                      current_setting('application_name', true),
> >                      floor(random()*1e9)::int);
> > BEGIN
> >   EXECUTE format('CREATE TEMP TABLE %I (id int, data text) ON COMMIT
> > DROP', tbl);
> >   EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
> >   EXECUTE format('SELECT * FROM %I', tbl);
> > END$$;
> > SQL
> > }
> >
> > # Workload 2: MIXED - mix of DDL and DML, 50%-50%
> > create_mixed_workload() {
> >   local ROOT="$1"
> >   cat >"$ROOT/mixed_ddl.sql" <<'SQL'
> > -- DDL workload (catalog changes)
> > DO $$
> > DECLARE
> >   tbl text := format('t_%s_%s',
> >                      current_setting('application_name', true),
> >                      floor(random()*1e9)::int);
> > BEGIN
> >   EXECUTE format('CREATE TABLE %I (id int, data text) ON COMMIT DROP', tbl);
> >   EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
> > END$$;
> >
> > SQL
> >   cat >"$ROOT/mixed_dml.sql" <<'SQL'
> > -- DML workload (no catalog changes)
> > INSERT INTO app_data (id, data)
> > VALUES (floor(random()*1e6)::int, repeat('x', 100))
> > ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
> > SQL
> > }
> >
> > # Workload 3: CONTROL - Pure DML, no catalog changes
> > create_control_workload() {
> >   local ROOT="$1"
> >   cat >"$ROOT/control.sql" <<'SQL'
> >
> > -- Pure DML, no catalog changes
> > -- Should show no difference between baseline and patched
> > INSERT INTO control_data (id, data)
> > VALUES (floor(random()*1e6)::int, repeat('x', 100))
> > ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
> > SQL
> > }
> >
> >
> > 6) Test strategy
> >
> > Clients: 100 concurrent connections per workload
> > Duration: 40 seconds per run
> > Repetitions: 1 run per workload type
> > Replication:  Logical replication slot using `test_decoding` plugin
> > with `EXPORT_SNAPSHOT=true`
> >
> > Workload Types:
> > 1. DDL - Pure catalog churn (temp table create/drop)
> > 2. Mixed- 50% DDL + 50% DML (workload)
> > 3. Control - Pure DML (no catalog changes)
> >
> > Measurement:
> > - Metrics captured from pg_stat_replication_slots before/after each run
> > - Primary metrics: total_txns (transactions decoded) and total_bytes
> > (data volume)
> > - Compared baseline (vanilla PostgreSQL) vs patched (sorted
> > committed.xip optimization)
> >
> >
> > 7) Performance results
> >
> > DDL Workload: +235% Decoder Improvement
> > Decoder throughput:  713.76 → 2396.52 txns/sec  (+235%)
> > Throughput (MB/s):   672.67 → 1747.22 MB/s     (+159%)
> > Decode efficiency:   9.46% → 33.29%             (+23.83 points)
> >
> > Mixed Workload: No Change
> > Decoder throughput:  2751.10 → 2730.00 txns/sec  (0%)
> > Decode efficiency:   40.47% → 40.47%             (unchanged)
> >
> > We can see that the qsort overhead in SnapBuild has been eliminated in
> > the flamegraphs in the mixed workload. However, the performance
> > improvement was not observed as in the DDL workload. My guess is that,
> > for DML workloads, ReorderBufferApplyChange has become the new
> > hotspot.
> >
> > DML Workload: No Change
> > Decoder throughput:  3062.57 → 3066.37 txns/sec  (0%)
> > Decode efficiency:   49.97% → 49.97%             (unchanged)
> >
> > === Workload: ddl ===
> > Client commits/sec:
> >   Baseline:  7545.76 commits/sec
> >   Patched:   7198.21 commits/sec
> >
> > Decoder throughput (from pg_stat_replication_slots):
> >   Baseline:  713.76 txns/sec  (672.67 MB/s)
> >   Patched:   2396.52 txns/sec  (1747.22 MB/s)
> >
> > Transaction efficiency (decoded vs committed):
> >   Baseline:  309376 committed  →  29264 decoded  (9.46%)
> >   Patched:   302325 committed  →  100654 decoded  (33.29%)
> >
> > Total decoded (all reps):
> >   Baseline:  29264 txns  (27579.53 MB)
> >   Patched:   100654 txns  (73383.10 MB)
> >
> >   Decoder improvement: +235.00% (txns/sec)
> >   Decoder improvement: +159.00% (MB/s)
> >   Efficiency improvement: +23.83% points (more transactions decoded
> > per committed)
> >
> > === Workload: mixed ===
> > Client commits/sec:
> >   Baseline:  6797.50 commits/sec
> >   Patched:   6745.26 commits/sec
> >
> > Decoder throughput (from pg_stat_replication_slots):
> >   Baseline:  2751.10 txns/sec  (210.35 MB/s)
> >   Patched:   2730.00 txns/sec  (205.64 MB/s)
> >
> > Transaction efficiency (decoded vs committed):
> >   Baseline:  285495 committed  →  115546 decoded  (40.47%)
> >   Patched:   283301 committed  →  114660 decoded  (40.47%)
> >
> > Total decoded (all reps):
> >   Baseline:  115546 txns  (8834.71 MB)
> >   Patched:   114660 txns  (8636.96 MB)
> >
> >   ≈  Decoder unchanged: 0.00% (txns/sec)
> >
> > === Workload: DML ===
> > Client commits/sec:
> >   Baseline:  6129.24 commits/sec
> >   Patched:   6136.93 commits/sec
> >
> > Decoder throughput (from pg_stat_replication_slots):
> >   Baseline:  3062.57 txns/sec  (0.26 MB/s)
> >   Patched:   3066.37 txns/sec  (0.26 MB/s)
> >
> > Transaction efficiency (decoded vs committed):
> >   Baseline:  257428 committed  →  128628 decoded  (49.97%)
> >   Patched:   251614 committed  →  125721 decoded  (49.97%)
> >
> > Total decoded (all reps):
> >   Baseline:  128628 txns  (10.98 MB)
> >   Patched:   125721 txns  (10.69 MB)
> >
> >   ≈  Decoder unchanged: 0.00% (txns/sec)
> >
> > 8) Potential regression
> >
> > The potential regression point could be before the slot reaches the
> > CONSISTENT state, particularly when building_full_snapshot is set to
> > true. In this phase, all transactions including those that don’t
> > modify the catalog — must be added to the committed.xip array. These
> > XIDs don’t require later snapshot builds or sorting, so the
> > batch-insert logic increases the per-insert cost from O(1) to O(m + n)
> > without providing a direct benefit.
> >
> > However, the impact of this regression could be limited. The system
> > remains in the pre-CONSISTENT phase only briefly during initial
> > snapshot building, and the building_full_snapshot = true case is rare,
> > mainly used when creating replication slots with the EXPORT_SNAPSHOT
> > option.
> >
> > Once the slot becomes CONSISTENT, only catalog-modifying transactions
> > are tracked in committed.xip, and the patch reduces overall
> > snapshot-building overhead by eliminating repeated full-array sorts.
> >
> > We could also adopt a two-phase approach — keeping the current
> > behavior before reaching the CONSISTENT state and maintaining a sorted
> > array only after that point. This would preserve the performance
> > benefits while avoiding potential regressions. However, it would
> > introduce additional complexity and potential risks in handling the
> > state transitions.
> >
> > if (builder->state < SNAPBUILD_CONSISTENT)
> > {
> > /* ensure that only commits after this are getting replayed */
> > if (builder->start_decoding_at <= lsn)
> > builder->start_decoding_at = lsn + 1;
> >
> > /*
> > * If building an exportable snapshot, force xid to be tracked, even
> > * if the transaction didn't modify the catalog.
> > */
> > if (builder->building_full_snapshot)
> > {
> > needs_timetravel = true;
> > }
> > }
>
> After some consideration, a two-phase sorting strategy seems feasible
> to implement in a relatively straightforward manner. So it's done in
> v2. I also plan to run benchmarks to evaluate potential regressions of
> the original “sort-at-all-stages” approach of v1.
>
> > 9) Additional benefit
> > With this patch applied, we can optimize the SnapBuildPurgeOlderTxn to use
> > two binary searchs rather than interating and copying to find the interval
> > to keep in the sorted commited.xip array. [1]
> >
> > Feedbacks welcomed.
> >
> > [1] https://www.postgresql.org/message-id/flat/CABPTF7V9gcpTLrOY0fG4YontoHjVg8YrbmiH4XB_5PT6K56xhg%40mai...
>
> V2 fixes several issues in v1, including a potential memory leak, type
> inconsistencies, and applies pgindent to the files.
>

V3 fixes a critical issue where the snapshot->xip array in
SnapBuildBuildSnapshot might not be sorted before reaching the
consistent state. Sorry for the noise here.

-- 
Best,
Xuneng


Attachments:

  [application/octet-stream] v3-0001-Optimize-SnapBuild-by-maintaining-committed.xip-i.patch (11.3K, 2-v3-0001-Optimize-SnapBuild-by-maintaining-committed.xip-i.patch)
  download | inline diff:
From a7c8d1da5d0496b2a7d05d52e37511a37075e13d Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Tue, 16 Dec 2025 15:15:14 +0800
Subject: [PATCH v3 1/2] Optimize SnapBuild by maintaining committed.xip in
 sorted order

Maintain committed.xip in xidComparator order using a two-phase approach
that balances performance during slot initialization with efficiency during
steady-state decoding:

Pre-CONSISTENT phase:
- Use fast O(1) append for unsorted XIDs
- Sort once at transition to CONSISTENT state
- Avoids merge overhead during initial snapshot building

Post-CONSISTENT phase:
- Use per-commit batch insertion with sorted merge
- Collect all relevant XIDs for each commit
- Sort the batch once: O(m log m)
- Reverse-merge into the global array: O(N + m)

With this change, snapshot builds can skip the qsort() step and simply
memcpy the sorted array. While per-commit work increases from O(m) (plain
append) to O(m log m + N) (sort-and-merge) after CONSISTENT, eliminating
repeated O(N log N) sorts at each snapshot build significantly reduces
overall steady-state cost once CONSISTENT is reached and snapshots are
built frequently.
---
 src/backend/replication/logical/snapbuild.c  | 164 +++++++++++++++++--
 src/include/replication/snapbuild_internal.h |  12 +-
 2 files changed, 151 insertions(+), 25 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index d6ab1e017eb..e0d7d3e0ae8 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -407,8 +407,25 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 		   builder->committed.xip,
 		   builder->committed.xcnt * sizeof(TransactionId));
 
-	/* sort so we can bsearch() */
-	qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);
+	if (builder->state < SNAPBUILD_CONSISTENT)
+	{
+		/*
+		 * Pre-CONSISTENT: committed.xip is unsorted, sort the snapshot copy.
+		 */
+		if (snapshot->xcnt > 1)
+			qsort(snapshot->xip, snapshot->xcnt,
+				  sizeof(TransactionId), xidComparator);
+	}
+#ifdef USE_ASSERT_CHECKING
+	else
+	{
+		/*
+		 * Post-CONSISTENT: committed.xip should already be sorted, verify.
+		 */
+		for (int i = 1; i < snapshot->xcnt; i++)
+			Assert(snapshot->xip[i - 1] < snapshot->xip[i]);
+	}
+#endif
 
 	/*
 	 * Initially, subxip is empty, i.e. it's a snapshot to be used by
@@ -822,17 +839,33 @@ SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, Transact
 }
 
 /*
- * Keep track of a new catalog changing transaction that has committed.
+ * Keep track of new catalog changing transactions that have committed.
+ *
+ * Before reaching CONSISTENT state, we use fast O(1) append since the array
+ * doesn't need to be sorted yet.  After CONSISTENT, we maintain sorted order
+ * using a merge approach to avoid repeated full-array sorts at snapshot build.
  */
 static void
-SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
+SnapBuildAddCommittedTxns(SnapBuild *builder,
+						  const TransactionId *batch_xids,
+						  int batch_cnt)
 {
-	Assert(TransactionIdIsValid(xid));
+	TransactionId *committed_xids;
+	size_t		old_xcnt;
+
+	old_xcnt = builder->committed.xcnt;
 
-	if (builder->committed.xcnt == builder->committed.xcnt_space)
+	/* Ensure we have space for all elements */
+	if (old_xcnt + batch_cnt > builder->committed.xcnt_space)
 	{
 		builder->committed.xcnt_space = builder->committed.xcnt_space * 2 + 1;
 
+		/*
+		 * Repeat if we need more than 2x current space.
+		 */
+		while (old_xcnt + batch_cnt > builder->committed.xcnt_space)
+			builder->committed.xcnt_space = builder->committed.xcnt_space * 2 + 1;
+
 		elog(DEBUG1, "increasing space for committed transactions to %u",
 			 (uint32) builder->committed.xcnt_space);
 
@@ -840,12 +873,57 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 										  builder->committed.xcnt_space * sizeof(TransactionId));
 	}
 
+	committed_xids = builder->committed.xip;
+
 	/*
-	 * TODO: It might make sense to keep the array sorted here instead of
-	 * doing it every time we build a new snapshot. On the other hand this
-	 * gets called repeatedly when a transaction with subtransactions commits.
+	 * Before CONSISTENT state, just append unsorted for O(1) performance. The
+	 * array will be sorted once when transitioning to CONSISTENT.
 	 */
-	builder->committed.xip[builder->committed.xcnt++] = xid;
+	if (builder->state < SNAPBUILD_CONSISTENT)
+	{
+		for (int i = 0; i < batch_cnt; i++)
+		{
+			Assert(TransactionIdIsValid(batch_xids[i]));
+			committed_xids[old_xcnt++] = batch_xids[i];
+		}
+		builder->committed.xcnt = old_xcnt;
+	}
+	else
+	{
+		/*
+		 * After CONSISTENT, maintain sorted order via merge.  Merge from the
+		 * end to avoid overwriting unread data.
+		 */
+		int			old_idx = old_xcnt - 1;
+		int			batch_idx = batch_cnt - 1;
+		int			write_idx = old_xcnt + batch_cnt - 1;
+
+		while (old_idx >= 0 && batch_idx >= 0)
+		{
+			Assert(TransactionIdIsValid(batch_xids[batch_idx]));
+
+			if (committed_xids[old_idx] > batch_xids[batch_idx])
+			{
+				committed_xids[write_idx--] = committed_xids[old_idx--];
+			}
+			else
+			{
+				/* Duplicates should never occur */
+				Assert(committed_xids[old_idx] != batch_xids[batch_idx]);
+				committed_xids[write_idx--] = batch_xids[batch_idx--];
+			}
+		}
+
+		/* Copy any remaining batch elements */
+		while (batch_idx >= 0)
+		{
+			Assert(TransactionIdIsValid(batch_xids[batch_idx]));
+			committed_xids[write_idx--] = batch_xids[batch_idx--];
+		}
+
+		/* Old elements that weren't moved are already in correct position */
+		builder->committed.xcnt = old_xcnt + batch_cnt;
+	}
 }
 
 /*
@@ -947,6 +1025,13 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 
 	TransactionId xmax = xid;
 
+	/*
+	 * Collect XIDs that need tracking into a batch.  We'll sort and merge
+	 * them into committed.xip in one pass at the end.
+	 */
+	TransactionId *batch_xids = NULL;
+	int			batch_cnt = 0;
+
 	/*
 	 * Transactions preceding BUILDING_SNAPSHOT will neither be decoded, nor
 	 * will they be part of a snapshot.  So we don't need to record anything.
@@ -977,6 +1062,12 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		}
 	}
 
+	if (nsubxacts > 0 || builder->building_full_snapshot ||
+		SnapBuildXidHasCatalogChanges(builder, xid, xinfo))
+	{
+		batch_xids = palloc((nsubxacts + 1) * sizeof(TransactionId));
+	}
+
 	for (nxact = 0; nxact < nsubxacts; nxact++)
 	{
 		TransactionId subxid = subxacts[nxact];
@@ -993,7 +1084,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			elog(DEBUG1, "found subtransaction %u:%u with catalog changes",
 				 xid, subxid);
 
-			SnapBuildAddCommittedTxn(builder, subxid);
+			batch_xids[batch_cnt++] = subxid;
 
 			if (NormalTransactionIdFollows(subxid, xmax))
 				xmax = subxid;
@@ -1007,7 +1098,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		 */
 		else if (needs_timetravel)
 		{
-			SnapBuildAddCommittedTxn(builder, subxid);
+			batch_xids[batch_cnt++] = subxid;
 			if (NormalTransactionIdFollows(subxid, xmax))
 				xmax = subxid;
 		}
@@ -1020,7 +1111,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			 xid);
 		needs_snapshot = true;
 		needs_timetravel = true;
-		SnapBuildAddCommittedTxn(builder, xid);
+		batch_xids[batch_cnt++] = xid;
 	}
 	else if (sub_needs_timetravel)
 	{
@@ -1028,15 +1119,33 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		elog(DEBUG2, "forced transaction %u to do timetravel due to one of its subtransactions",
 			 xid);
 		needs_timetravel = true;
-		SnapBuildAddCommittedTxn(builder, xid);
+		batch_xids[batch_cnt++] = xid;
 	}
 	else if (needs_timetravel)
 	{
 		elog(DEBUG2, "forced transaction %u to do timetravel", xid);
 
-		SnapBuildAddCommittedTxn(builder, xid);
+		batch_xids[batch_cnt++] = xid;
+	}
+
+	/*
+	 * Sort and merge the batch into committed.xip.  This maintains the
+	 * invariant that committed.xip is globally sorted by raw uint32 order
+	 * (xidComparator).
+	 */
+	if (batch_cnt > 0)
+	{
+		/* Sort by raw uint32 order */
+		qsort(batch_xids, batch_cnt, sizeof(TransactionId), xidComparator);
+
+		/* Merge into global array */
+		SnapBuildAddCommittedTxns(builder, batch_xids, batch_cnt);
 	}
 
+	/* Free batch array if allocated (even if nothing was added to it) */
+	if (batch_xids != NULL)
+		pfree(batch_xids);
+
 	if (!needs_timetravel)
 	{
 		/* record that we cannot export a general snapshot anymore */
@@ -1305,6 +1414,15 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		Assert(TransactionIdIsNormal(builder->xmin));
 		Assert(TransactionIdIsNormal(builder->xmax));
 
+		/*
+		 * Sort committed.xip before transitioning to CONSISTENT.  During the
+		 * pre-CONSISTENT phase, XIDs were appended unsorted for performance.
+		 * Now we need sorted order for efficient snapshot building.
+		 */
+		if (builder->committed.xcnt > 1)
+			qsort(builder->committed.xip, builder->committed.xcnt,
+				  sizeof(TransactionId), xidComparator);
+
 		builder->state = SNAPBUILD_CONSISTENT;
 		builder->next_phase_at = InvalidTransactionId;
 
@@ -1402,6 +1520,15 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 			 TransactionIdPrecedesOrEquals(builder->next_phase_at,
 										   running->oldestRunningXid))
 	{
+		/*
+		 * Sort committed.xip before transitioning to CONSISTENT.  During the
+		 * pre-CONSISTENT phase, XIDs were appended unsorted for performance.
+		 * Now we need sorted order for efficient snapshot building.
+		 */
+		if (builder->committed.xcnt > 1)
+			qsort(builder->committed.xip, builder->committed.xcnt,
+				  sizeof(TransactionId), xidComparator);
+
 		builder->state = SNAPBUILD_CONSISTENT;
 		builder->next_phase_at = InvalidTransactionId;
 
@@ -1889,6 +2016,13 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 		pfree(builder->committed.xip);
 		builder->committed.xcnt_space = ondisk.builder.committed.xcnt;
 		builder->committed.xip = ondisk.builder.committed.xip;
+
+		/*
+		 * Sort the restored array to ensure it's in xidComparator order. Old
+		 * snapshot files may have been written with unsorted arrays.
+		 */
+		qsort(builder->committed.xip, builder->committed.xcnt,
+			  sizeof(TransactionId), xidComparator);
 	}
 	ondisk.builder.committed.xip = NULL;
 
diff --git a/src/include/replication/snapbuild_internal.h b/src/include/replication/snapbuild_internal.h
index 3b915dc8793..164160e5e5e 100644
--- a/src/include/replication/snapbuild_internal.h
+++ b/src/include/replication/snapbuild_internal.h
@@ -115,16 +115,8 @@ struct SnapBuild
 		/*
 		 * Array of committed transactions that have modified the catalog.
 		 *
-		 * As this array is frequently modified we do *not* keep it in
-		 * xidComparator order. Instead we sort the array when building &
-		 * distributing a snapshot.
-		 *
-		 * TODO: It's unclear whether that reasoning has much merit. Every
-		 * time we add something here after becoming consistent will also
-		 * require distributing a snapshot. Storing them sorted would
-		 * potentially also make it easier to purge (but more complicated wrt
-		 * wraparound?). Should be improved if sorting while building the
-		 * snapshot shows up in profiles.
+		 * Maintained in sorted order (by raw uint32 value) to allow efficient
+		 * snapshot building without repeated sorting overhead.
 		 */
 		TransactionId *xip;
 	}			committed;
-- 
2.51.0

view thread (5+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Optimize SnapBuild by maintaining committed.xip in sorted order
  In-Reply-To: <CABPTF7Vty_fsJLpj6=Xn5MM5PXiV_n=XeN87S+7JhSmROYwCAA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox