public inbox for [email protected]  
help / color / mirror / Atom feed
From: Xuneng Zhou <[email protected]>
To: pgsql-hackers <[email protected]>
Cc: Masahiko Sawada <[email protected]>
Subject: Optimize SnapBuild by maintaining committed.xip in sorted order
Date: Fri, 7 Nov 2025 12:55:57 +0800
Message-ID: <CABPTF7XiwbA38OZBj5Y2F-q+fZ=03YFN9iFnb_406F4SfE-f4w@mail.gmail.com> (raw)

Hi hackers,

I'd like to propose an optimization for logical decoding's snapshot building
mechanism that eliminates repeated sorting overhead once a replication slot
reaches the CONSISTENT state.

1) Problem

Currently, SnapBuildBuildSnapshot() sorts the entire committed.xip array on
every call using qsort(). Once a logical decoding slot reaches CONSISTENT
state, snapshots are built frequently—after every catalog-modifying transaction
commit. This repeated O(n log n) sorting becomes a performance bottleneck as
the committed transaction array grows.

The existing code even has a TODO comment acknowledging this inefficiency:
"TODO: It's unclear whether that reasoning has much merit. Every time we add
something here after becoming consistent will also require distributing a
snapshot."

2) Proposed Solution

Maintain sorted order via batch merge on each commit:

1. Collect all relevant XIDs (including subtransactions) into a batch array
2. Sort the batch: O(m log m) where m is typically small (1 + nsubxacts)
3. Merge into the global array: O(n + m) via reverse in-place merge

While per-commit cost increases from O(m) to O(m log m + n), this is offset by
eliminating O(n log n) sorts at each snapshot build. Since CONSISTENT state
builds snapshots after each catalog-modifying commit, the amortized cost is
significantly lower.

3) Performance Impact

Expected improvements in CONSISTENT state:
- Eliminates O(n log n) qsort on every snapshot build
- Replaces with O(m log m + n) merge per commit
- For typical cases where m << n and snapshots are frequent, this is a win


The benchmark results for this patch are shown below.

4) Configuration

shared_buffers = '4GB'
wal_level = logical
max_replication_slots = 10
max_wal_senders = 10
log_min_messages = warning
max_connections = 600
autovacuum = off
checkpoint_timeout = 15min
max_wal_size = 4GB


5) Workloads

# Workload 1: DDL - High frequency, small commits
create_ddl_workload() {
  local ROOT="$1"
  cat >"$ROOT/ddl.sql" <<'SQL'
-- High-frequency small catalog commits
-- Each commit triggers SnapBuildBuildSnapshot
DO $$
DECLARE
  tbl text := format('temp_%s_%s',
                     current_setting('application_name', true),
                     floor(random()*1e9)::int);
BEGIN
  EXECUTE format('CREATE TEMP TABLE %I (id int, data text) ON COMMIT
DROP', tbl);
  EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
  EXECUTE format('SELECT * FROM %I', tbl);
END$$;
SQL
}

# Workload 2: MIXED - mix of DDL and DML, 50%-50%
create_mixed_workload() {
  local ROOT="$1"
  cat >"$ROOT/mixed_ddl.sql" <<'SQL'
-- DDL workload (catalog changes)
DO $$
DECLARE
  tbl text := format('t_%s_%s',
                     current_setting('application_name', true),
                     floor(random()*1e9)::int);
BEGIN
  EXECUTE format('CREATE TABLE %I (id int, data text) ON COMMIT DROP', tbl);
  EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
END$$;

SQL
  cat >"$ROOT/mixed_dml.sql" <<'SQL'
-- DML workload (no catalog changes)
INSERT INTO app_data (id, data)
VALUES (floor(random()*1e6)::int, repeat('x', 100))
ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
SQL
}

# Workload 3: CONTROL - Pure DML, no catalog changes
create_control_workload() {
  local ROOT="$1"
  cat >"$ROOT/control.sql" <<'SQL'

-- Pure DML, no catalog changes
-- Should show no difference between baseline and patched
INSERT INTO control_data (id, data)
VALUES (floor(random()*1e6)::int, repeat('x', 100))
ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
SQL
}


6) Test strategy

Clients: 100 concurrent connections per workload
Duration: 40 seconds per run
Repetitions: 1 run per workload type
Replication:  Logical replication slot using `test_decoding` plugin
with `EXPORT_SNAPSHOT=true`

Workload Types:
1. DDL - Pure catalog churn (temp table create/drop)
2. Mixed- 50% DDL + 50% DML (workload)
3. Control - Pure DML (no catalog changes)

Measurement:
- Metrics captured from pg_stat_replication_slots before/after each run
- Primary metrics: total_txns (transactions decoded) and total_bytes
(data volume)
- Compared baseline (vanilla PostgreSQL) vs patched (sorted
committed.xip optimization)


7) Performance results

DDL Workload: +235% Decoder Improvement
Decoder throughput:  713.76 → 2396.52 txns/sec  (+235%)
Throughput (MB/s):   672.67 → 1747.22 MB/s     (+159%)
Decode efficiency:   9.46% → 33.29%             (+23.83 points)

Mixed Workload: No Change
Decoder throughput:  2751.10 → 2730.00 txns/sec  (0%)
Decode efficiency:   40.47% → 40.47%             (unchanged)

We can see that the qsort overhead in SnapBuild has been eliminated in
the flamegraphs in the mixed workload. However, the performance
improvement was not observed as in the DDL workload. My guess is that,
for DML workloads, ReorderBufferApplyChange has become the new
hotspot.

DML Workload: No Change
Decoder throughput:  3062.57 → 3066.37 txns/sec  (0%)
Decode efficiency:   49.97% → 49.97%             (unchanged)

=== Workload: ddl ===
Client commits/sec:
  Baseline:  7545.76 commits/sec
  Patched:   7198.21 commits/sec

Decoder throughput (from pg_stat_replication_slots):
  Baseline:  713.76 txns/sec  (672.67 MB/s)
  Patched:   2396.52 txns/sec  (1747.22 MB/s)

Transaction efficiency (decoded vs committed):
  Baseline:  309376 committed  →  29264 decoded  (9.46%)
  Patched:   302325 committed  →  100654 decoded  (33.29%)

Total decoded (all reps):
  Baseline:  29264 txns  (27579.53 MB)
  Patched:   100654 txns  (73383.10 MB)

  Decoder improvement: +235.00% (txns/sec)
  Decoder improvement: +159.00% (MB/s)
  Efficiency improvement: +23.83% points (more transactions decoded
per committed)

=== Workload: mixed ===
Client commits/sec:
  Baseline:  6797.50 commits/sec
  Patched:   6745.26 commits/sec

Decoder throughput (from pg_stat_replication_slots):
  Baseline:  2751.10 txns/sec  (210.35 MB/s)
  Patched:   2730.00 txns/sec  (205.64 MB/s)

Transaction efficiency (decoded vs committed):
  Baseline:  285495 committed  →  115546 decoded  (40.47%)
  Patched:   283301 committed  →  114660 decoded  (40.47%)

Total decoded (all reps):
  Baseline:  115546 txns  (8834.71 MB)
  Patched:   114660 txns  (8636.96 MB)

  ≈  Decoder unchanged: 0.00% (txns/sec)

=== Workload: DML ===
Client commits/sec:
  Baseline:  6129.24 commits/sec
  Patched:   6136.93 commits/sec

Decoder throughput (from pg_stat_replication_slots):
  Baseline:  3062.57 txns/sec  (0.26 MB/s)
  Patched:   3066.37 txns/sec  (0.26 MB/s)

Transaction efficiency (decoded vs committed):
  Baseline:  257428 committed  →  128628 decoded  (49.97%)
  Patched:   251614 committed  →  125721 decoded  (49.97%)

Total decoded (all reps):
  Baseline:  128628 txns  (10.98 MB)
  Patched:   125721 txns  (10.69 MB)

  ≈  Decoder unchanged: 0.00% (txns/sec)

8) Potential regression

The potential regression point could be before the slot reaches the
CONSISTENT state, particularly when building_full_snapshot is set to
true. In this phase, all transactions including those that don’t
modify the catalog — must be added to the committed.xip array. These
XIDs don’t require later snapshot builds or sorting, so the
batch-insert logic increases the per-insert cost from O(1) to O(m + n)
without providing a direct benefit.

However, the impact of this regression could be limited. The system
remains in the pre-CONSISTENT phase only briefly during initial
snapshot building, and the building_full_snapshot = true case is rare,
mainly used when creating replication slots with the EXPORT_SNAPSHOT
option.

Once the slot becomes CONSISTENT, only catalog-modifying transactions
are tracked in committed.xip, and the patch reduces overall
snapshot-building overhead by eliminating repeated full-array sorts.

We could also adopt a two-phase approach — keeping the current
behavior before reaching the CONSISTENT state and maintaining a sorted
array only after that point. This would preserve the performance
benefits while avoiding potential regressions. However, it would
introduce additional complexity and potential risks in handling the
state transitions.

if (builder->state < SNAPBUILD_CONSISTENT)
{
/* ensure that only commits after this are getting replayed */
if (builder->start_decoding_at <= lsn)
builder->start_decoding_at = lsn + 1;

/*
* If building an exportable snapshot, force xid to be tracked, even
* if the transaction didn't modify the catalog.
*/
if (builder->building_full_snapshot)
{
needs_timetravel = true;
}
}

9) Additional benefit
With this patch applied, we can optimize the SnapBuildPurgeOlderTxn to use
two binary searchs rather than interating and copying to find the interval
to keep in the sorted commited.xip array. [1]

Feedbacks welcomed.

[1] https://www.postgresql.org/message-id/flat/CABPTF7V9gcpTLrOY0fG4YontoHjVg8YrbmiH4XB_5PT6K56xhg%40mai...

Best,
Xuneng


Attachments:

  [application/octet-stream] v1-0001-Optimize-SnapBuild-by-maintaining-committed.xip-i.patch (7.9K, 2-v1-0001-Optimize-SnapBuild-by-maintaining-committed.xip-i.patch)
  download | inline diff:
From 14e2ee95a153726f88649f545c3606a85e848be3 Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Thu, 30 Oct 2025 17:57:45 +0800
Subject: [PATCH v1] Optimize SnapBuild by maintaining committed.xip in sorted order
We now maintain committed.xip in xidComparator order using a per-commit batch insertion approach:

Collect all relevant XIDs for each commit
Sort the batch once: O(m log m)
Reverse-merge it into the global array: O(N + m)

With this change, snapshot builds can skip the qsort() step and simply memcpy the sorted array.
While per-commit work increases from O(m) (plain append) to O(m log m + N) (sort-and-merge), 
eliminating repeated O(N log N) sorts at each snapshot build significantly reduces overall steady-state 
cost once CONSISTENT is reached and snapshots are built frequently.
---
 src/backend/replication/logical/snapbuild.c  | 97 +++++++++++++++++---
 src/include/replication/snapbuild_internal.h | 12 +--
 2 files changed, 85 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 98ddee20929..500805f43dc 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -407,8 +407,10 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 		   builder->committed.xip,
 		   builder->committed.xcnt * sizeof(TransactionId));
 
-	/* sort so we can bsearch() */
-	qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);
+#ifdef USE_ASSERT_CHECKING
+	for (int i = 1; i < snapshot->xcnt; i++)
+		Assert(snapshot->xip[i - 1] <= snapshot->xip[i]);
+#endif
 
 	/*
 	 * Initially, subxip is empty, i.e. it's a snapshot to be used by
@@ -826,14 +828,29 @@ SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, Transact
  * Keep track of a new catalog changing transaction that has committed.
  */
 static void
-SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
+SnapBuildAddCommittedTxns(SnapBuild *builder,
+						  const TransactionId *batch_xids,
+						  int batch_cnt)
 {
-	Assert(TransactionIdIsValid(xid));
+	TransactionId *committed_xids;
+	int			old_xcnt;
+	int			old_idx;		/* read position in old array */
+	int			batch_idx;		/* read position in batch */
+	int			write_idx;		/* write position (from end) */
 
-	if (builder->committed.xcnt == builder->committed.xcnt_space)
+	old_xcnt = builder->committed.xcnt;
+
+	/* Ensure we have space for all elements */
+	if (old_xcnt + batch_cnt > builder->committed.xcnt_space)
 	{
 		builder->committed.xcnt_space = builder->committed.xcnt_space * 2 + 1;
 
+		/*
+		 * Repeat if we need more than 2x current space.
+		 */
+		while (old_xcnt + batch_cnt > builder->committed.xcnt_space)
+			builder->committed.xcnt_space = builder->committed.xcnt_space * 2 + 1;
+
 		elog(DEBUG1, "increasing space for committed transactions to %u",
 			 (uint32) builder->committed.xcnt_space);
 
@@ -841,12 +858,35 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
 										  builder->committed.xcnt_space * sizeof(TransactionId));
 	}
 
+	committed_xids = builder->committed.xip;
+
 	/*
-	 * TODO: It might make sense to keep the array sorted here instead of
-	 * doing it every time we build a new snapshot. On the other hand this
-	 * gets called repeatedly when a transaction with subtransactions commits.
+	 * Merge from the end to avoid overwriting unread data.  We write the
+	 * merged result starting from position (old_xcnt + batch_cnt - 1) and
+	 * work backwards.
 	 */
-	builder->committed.xip[builder->committed.xcnt++] = xid;
+	old_idx = old_xcnt - 1;
+	batch_idx = batch_cnt - 1;
+	write_idx = old_xcnt + batch_cnt - 1;
+
+	while (old_idx >= 0 && batch_idx >= 0)
+	{
+		if (committed_xids[old_idx] > batch_xids[batch_idx])
+			committed_xids[write_idx--] = committed_xids[old_idx--];
+		else
+		{
+			Assert(TransactionIdIsValid(batch_xids[batch_idx]));
+			committed_xids[write_idx--] = batch_xids[batch_idx--];
+		}
+	}
+
+	/* Copy any remaining batch elements */
+	while (batch_idx >= 0)
+		committed_xids[write_idx--] = batch_xids[batch_idx--];
+
+	/* Old elements that weren't moved are already in the correct position */
+
+	builder->committed.xcnt = old_xcnt + batch_cnt;
 }
 
 /*
@@ -948,6 +988,13 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 
 	TransactionId xmax = xid;
 
+	/*
+	 * Collect XIDs that need tracking into a batch.  We'll sort and merge
+	 * them into committed.xip in one pass at the end.
+	 */
+	 TransactionId *batch_xids = NULL;
+	 int		batch_cnt = 0;
+
 	/*
 	 * Transactions preceding BUILDING_SNAPSHOT will neither be decoded, nor
 	 * will they be part of a snapshot.  So we don't need to record anything.
@@ -978,6 +1025,12 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		}
 	}
 
+	if (nsubxacts > 0 || builder->building_full_snapshot ||
+		SnapBuildXidHasCatalogChanges(builder, xid, xinfo))
+	{
+		batch_xids = palloc((nsubxacts + 1) * sizeof(TransactionId));
+	}
+
 	for (nxact = 0; nxact < nsubxacts; nxact++)
 	{
 		TransactionId subxid = subxacts[nxact];
@@ -994,7 +1047,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			elog(DEBUG1, "found subtransaction %u:%u with catalog changes",
 				 xid, subxid);
 
-			SnapBuildAddCommittedTxn(builder, subxid);
+			batch_xids[batch_cnt++] = subxid;
 
 			if (NormalTransactionIdFollows(subxid, xmax))
 				xmax = subxid;
@@ -1008,7 +1061,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		 */
 		else if (needs_timetravel)
 		{
-			SnapBuildAddCommittedTxn(builder, subxid);
+			batch_xids[batch_cnt++] = subxid;
 			if (NormalTransactionIdFollows(subxid, xmax))
 				xmax = subxid;
 		}
@@ -1021,7 +1074,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			 xid);
 		needs_snapshot = true;
 		needs_timetravel = true;
-		SnapBuildAddCommittedTxn(builder, xid);
+		batch_xids[batch_cnt++] = xid;
 	}
 	else if (sub_needs_timetravel)
 	{
@@ -1029,13 +1082,29 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		elog(DEBUG2, "forced transaction %u to do timetravel due to one of its subtransactions",
 			 xid);
 		needs_timetravel = true;
-		SnapBuildAddCommittedTxn(builder, xid);
+		batch_xids[batch_cnt++] = xid;
 	}
 	else if (needs_timetravel)
 	{
 		elog(DEBUG2, "forced transaction %u to do timetravel", xid);
 
-		SnapBuildAddCommittedTxn(builder, xid);
+		batch_xids[batch_cnt++] = xid;
+	}
+
+	/*
+	 * Sort and merge the batch into committed.xip.  This maintains the
+	 * invariant that committed.xip is globally sorted by raw uint32 order
+	 * (xidComparator).
+	 */
+	if (batch_cnt > 0)
+	{
+		/* Sort by raw uint32 order */
+		qsort(batch_xids, batch_cnt, sizeof(TransactionId), xidComparator);
+
+		/* Merge into global array */
+		SnapBuildAddCommittedTxns(builder, batch_xids, batch_cnt);
+
+		pfree(batch_xids);
 	}
 
 	if (!needs_timetravel)
diff --git a/src/include/replication/snapbuild_internal.h b/src/include/replication/snapbuild_internal.h
index 3b915dc8793..164160e5e5e 100644
--- a/src/include/replication/snapbuild_internal.h
+++ b/src/include/replication/snapbuild_internal.h
@@ -115,16 +115,8 @@ struct SnapBuild
 		/*
 		 * Array of committed transactions that have modified the catalog.
 		 *
-		 * As this array is frequently modified we do *not* keep it in
-		 * xidComparator order. Instead we sort the array when building &
-		 * distributing a snapshot.
-		 *
-		 * TODO: It's unclear whether that reasoning has much merit. Every
-		 * time we add something here after becoming consistent will also
-		 * require distributing a snapshot. Storing them sorted would
-		 * potentially also make it easier to purge (but more complicated wrt
-		 * wraparound?). Should be improved if sorting while building the
-		 * snapshot shows up in profiles.
+		 * Maintained in sorted order (by raw uint32 value) to allow efficient
+		 * snapshot building without repeated sorting overhead.
 		 */
 		TransactionId *xip;
 	}			committed;
-- 
2.51.0



  [application/zip] flamegraphs.zip (217.1K, 3-flamegraphs.zip)
  download

view thread (5+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Optimize SnapBuild by maintaining committed.xip in sorted order
  In-Reply-To: <CABPTF7XiwbA38OZBj5Y2F-q+fZ=03YFN9iFnb_406F4SfE-f4w@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox