public inbox for [email protected]
help / color / mirror / Atom feedOptimize SnapBuild by maintaining committed.xip in sorted order
5+ messages / 2 participants
[nested] [flat]
* Optimize SnapBuild by maintaining committed.xip in sorted order
@ 2025-11-07 04:55 Xuneng Zhou <[email protected]>
0 siblings, 1 reply; 5+ messages in thread
From: Xuneng Zhou @ 2025-11-07 04:55 UTC (permalink / raw)
To: pgsql-hackers <[email protected]>; +Cc: Masahiko Sawada <[email protected]>
Hi hackers,
I'd like to propose an optimization for logical decoding's snapshot building
mechanism that eliminates repeated sorting overhead once a replication slot
reaches the CONSISTENT state.
1) Problem
Currently, SnapBuildBuildSnapshot() sorts the entire committed.xip array on
every call using qsort(). Once a logical decoding slot reaches CONSISTENT
state, snapshots are built frequently—after every catalog-modifying transaction
commit. This repeated O(n log n) sorting becomes a performance bottleneck as
the committed transaction array grows.
The existing code even has a TODO comment acknowledging this inefficiency:
"TODO: It's unclear whether that reasoning has much merit. Every time we add
something here after becoming consistent will also require distributing a
snapshot."
2) Proposed Solution
Maintain sorted order via batch merge on each commit:
1. Collect all relevant XIDs (including subtransactions) into a batch array
2. Sort the batch: O(m log m) where m is typically small (1 + nsubxacts)
3. Merge into the global array: O(n + m) via reverse in-place merge
While per-commit cost increases from O(m) to O(m log m + n), this is offset by
eliminating O(n log n) sorts at each snapshot build. Since CONSISTENT state
builds snapshots after each catalog-modifying commit, the amortized cost is
significantly lower.
3) Performance Impact
Expected improvements in CONSISTENT state:
- Eliminates O(n log n) qsort on every snapshot build
- Replaces with O(m log m + n) merge per commit
- For typical cases where m << n and snapshots are frequent, this is a win
The benchmark results for this patch are shown below.
4) Configuration
shared_buffers = '4GB'
wal_level = logical
max_replication_slots = 10
max_wal_senders = 10
log_min_messages = warning
max_connections = 600
autovacuum = off
checkpoint_timeout = 15min
max_wal_size = 4GB
5) Workloads
# Workload 1: DDL - High frequency, small commits
create_ddl_workload() {
local ROOT="$1"
cat >"$ROOT/ddl.sql" <<'SQL'
-- High-frequency small catalog commits
-- Each commit triggers SnapBuildBuildSnapshot
DO $$
DECLARE
tbl text := format('temp_%s_%s',
current_setting('application_name', true),
floor(random()*1e9)::int);
BEGIN
EXECUTE format('CREATE TEMP TABLE %I (id int, data text) ON COMMIT
DROP', tbl);
EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
EXECUTE format('SELECT * FROM %I', tbl);
END$$;
SQL
}
# Workload 2: MIXED - mix of DDL and DML, 50%-50%
create_mixed_workload() {
local ROOT="$1"
cat >"$ROOT/mixed_ddl.sql" <<'SQL'
-- DDL workload (catalog changes)
DO $$
DECLARE
tbl text := format('t_%s_%s',
current_setting('application_name', true),
floor(random()*1e9)::int);
BEGIN
EXECUTE format('CREATE TABLE %I (id int, data text) ON COMMIT DROP', tbl);
EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
END$$;
SQL
cat >"$ROOT/mixed_dml.sql" <<'SQL'
-- DML workload (no catalog changes)
INSERT INTO app_data (id, data)
VALUES (floor(random()*1e6)::int, repeat('x', 100))
ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
SQL
}
# Workload 3: CONTROL - Pure DML, no catalog changes
create_control_workload() {
local ROOT="$1"
cat >"$ROOT/control.sql" <<'SQL'
-- Pure DML, no catalog changes
-- Should show no difference between baseline and patched
INSERT INTO control_data (id, data)
VALUES (floor(random()*1e6)::int, repeat('x', 100))
ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
SQL
}
6) Test strategy
Clients: 100 concurrent connections per workload
Duration: 40 seconds per run
Repetitions: 1 run per workload type
Replication: Logical replication slot using `test_decoding` plugin
with `EXPORT_SNAPSHOT=true`
Workload Types:
1. DDL - Pure catalog churn (temp table create/drop)
2. Mixed- 50% DDL + 50% DML (workload)
3. Control - Pure DML (no catalog changes)
Measurement:
- Metrics captured from pg_stat_replication_slots before/after each run
- Primary metrics: total_txns (transactions decoded) and total_bytes
(data volume)
- Compared baseline (vanilla PostgreSQL) vs patched (sorted
committed.xip optimization)
7) Performance results
DDL Workload: +235% Decoder Improvement
Decoder throughput: 713.76 → 2396.52 txns/sec (+235%)
Throughput (MB/s): 672.67 → 1747.22 MB/s (+159%)
Decode efficiency: 9.46% → 33.29% (+23.83 points)
Mixed Workload: No Change
Decoder throughput: 2751.10 → 2730.00 txns/sec (0%)
Decode efficiency: 40.47% → 40.47% (unchanged)
We can see that the qsort overhead in SnapBuild has been eliminated in
the flamegraphs in the mixed workload. However, the performance
improvement was not observed as in the DDL workload. My guess is that,
for DML workloads, ReorderBufferApplyChange has become the new
hotspot.
DML Workload: No Change
Decoder throughput: 3062.57 → 3066.37 txns/sec (0%)
Decode efficiency: 49.97% → 49.97% (unchanged)
=== Workload: ddl ===
Client commits/sec:
Baseline: 7545.76 commits/sec
Patched: 7198.21 commits/sec
Decoder throughput (from pg_stat_replication_slots):
Baseline: 713.76 txns/sec (672.67 MB/s)
Patched: 2396.52 txns/sec (1747.22 MB/s)
Transaction efficiency (decoded vs committed):
Baseline: 309376 committed → 29264 decoded (9.46%)
Patched: 302325 committed → 100654 decoded (33.29%)
Total decoded (all reps):
Baseline: 29264 txns (27579.53 MB)
Patched: 100654 txns (73383.10 MB)
Decoder improvement: +235.00% (txns/sec)
Decoder improvement: +159.00% (MB/s)
Efficiency improvement: +23.83% points (more transactions decoded
per committed)
=== Workload: mixed ===
Client commits/sec:
Baseline: 6797.50 commits/sec
Patched: 6745.26 commits/sec
Decoder throughput (from pg_stat_replication_slots):
Baseline: 2751.10 txns/sec (210.35 MB/s)
Patched: 2730.00 txns/sec (205.64 MB/s)
Transaction efficiency (decoded vs committed):
Baseline: 285495 committed → 115546 decoded (40.47%)
Patched: 283301 committed → 114660 decoded (40.47%)
Total decoded (all reps):
Baseline: 115546 txns (8834.71 MB)
Patched: 114660 txns (8636.96 MB)
≈ Decoder unchanged: 0.00% (txns/sec)
=== Workload: DML ===
Client commits/sec:
Baseline: 6129.24 commits/sec
Patched: 6136.93 commits/sec
Decoder throughput (from pg_stat_replication_slots):
Baseline: 3062.57 txns/sec (0.26 MB/s)
Patched: 3066.37 txns/sec (0.26 MB/s)
Transaction efficiency (decoded vs committed):
Baseline: 257428 committed → 128628 decoded (49.97%)
Patched: 251614 committed → 125721 decoded (49.97%)
Total decoded (all reps):
Baseline: 128628 txns (10.98 MB)
Patched: 125721 txns (10.69 MB)
≈ Decoder unchanged: 0.00% (txns/sec)
8) Potential regression
The potential regression point could be before the slot reaches the
CONSISTENT state, particularly when building_full_snapshot is set to
true. In this phase, all transactions including those that don’t
modify the catalog — must be added to the committed.xip array. These
XIDs don’t require later snapshot builds or sorting, so the
batch-insert logic increases the per-insert cost from O(1) to O(m + n)
without providing a direct benefit.
However, the impact of this regression could be limited. The system
remains in the pre-CONSISTENT phase only briefly during initial
snapshot building, and the building_full_snapshot = true case is rare,
mainly used when creating replication slots with the EXPORT_SNAPSHOT
option.
Once the slot becomes CONSISTENT, only catalog-modifying transactions
are tracked in committed.xip, and the patch reduces overall
snapshot-building overhead by eliminating repeated full-array sorts.
We could also adopt a two-phase approach — keeping the current
behavior before reaching the CONSISTENT state and maintaining a sorted
array only after that point. This would preserve the performance
benefits while avoiding potential regressions. However, it would
introduce additional complexity and potential risks in handling the
state transitions.
if (builder->state < SNAPBUILD_CONSISTENT)
{
/* ensure that only commits after this are getting replayed */
if (builder->start_decoding_at <= lsn)
builder->start_decoding_at = lsn + 1;
/*
* If building an exportable snapshot, force xid to be tracked, even
* if the transaction didn't modify the catalog.
*/
if (builder->building_full_snapshot)
{
needs_timetravel = true;
}
}
9) Additional benefit
With this patch applied, we can optimize the SnapBuildPurgeOlderTxn to use
two binary searchs rather than interating and copying to find the interval
to keep in the sorted commited.xip array. [1]
Feedbacks welcomed.
[1] https://www.postgresql.org/message-id/flat/CABPTF7V9gcpTLrOY0fG4YontoHjVg8YrbmiH4XB_5PT6K56xhg%40mai...
Best,
Xuneng
Attachments:
[application/octet-stream] v1-0001-Optimize-SnapBuild-by-maintaining-committed.xip-i.patch (7.9K, 2-v1-0001-Optimize-SnapBuild-by-maintaining-committed.xip-i.patch)
download | inline diff:
From 14e2ee95a153726f88649f545c3606a85e848be3 Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Thu, 30 Oct 2025 17:57:45 +0800
Subject: [PATCH v1] Optimize SnapBuild by maintaining committed.xip in sorted order
We now maintain committed.xip in xidComparator order using a per-commit batch insertion approach:
Collect all relevant XIDs for each commit
Sort the batch once: O(m log m)
Reverse-merge it into the global array: O(N + m)
With this change, snapshot builds can skip the qsort() step and simply memcpy the sorted array.
While per-commit work increases from O(m) (plain append) to O(m log m + N) (sort-and-merge),
eliminating repeated O(N log N) sorts at each snapshot build significantly reduces overall steady-state
cost once CONSISTENT is reached and snapshots are built frequently.
---
src/backend/replication/logical/snapbuild.c | 97 +++++++++++++++++---
src/include/replication/snapbuild_internal.h | 12 +--
2 files changed, 85 insertions(+), 24 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 98ddee20929..500805f43dc 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -407,8 +407,10 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
builder->committed.xip,
builder->committed.xcnt * sizeof(TransactionId));
- /* sort so we can bsearch() */
- qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);
+#ifdef USE_ASSERT_CHECKING
+ for (int i = 1; i < snapshot->xcnt; i++)
+ Assert(snapshot->xip[i - 1] <= snapshot->xip[i]);
+#endif
/*
* Initially, subxip is empty, i.e. it's a snapshot to be used by
@@ -826,14 +828,29 @@ SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, Transact
* Keep track of a new catalog changing transaction that has committed.
*/
static void
-SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
+SnapBuildAddCommittedTxns(SnapBuild *builder,
+ const TransactionId *batch_xids,
+ int batch_cnt)
{
- Assert(TransactionIdIsValid(xid));
+ TransactionId *committed_xids;
+ int old_xcnt;
+ int old_idx; /* read position in old array */
+ int batch_idx; /* read position in batch */
+ int write_idx; /* write position (from end) */
- if (builder->committed.xcnt == builder->committed.xcnt_space)
+ old_xcnt = builder->committed.xcnt;
+
+ /* Ensure we have space for all elements */
+ if (old_xcnt + batch_cnt > builder->committed.xcnt_space)
{
builder->committed.xcnt_space = builder->committed.xcnt_space * 2 + 1;
+ /*
+ * Repeat if we need more than 2x current space.
+ */
+ while (old_xcnt + batch_cnt > builder->committed.xcnt_space)
+ builder->committed.xcnt_space = builder->committed.xcnt_space * 2 + 1;
+
elog(DEBUG1, "increasing space for committed transactions to %u",
(uint32) builder->committed.xcnt_space);
@@ -841,12 +858,35 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
builder->committed.xcnt_space * sizeof(TransactionId));
}
+ committed_xids = builder->committed.xip;
+
/*
- * TODO: It might make sense to keep the array sorted here instead of
- * doing it every time we build a new snapshot. On the other hand this
- * gets called repeatedly when a transaction with subtransactions commits.
+ * Merge from the end to avoid overwriting unread data. We write the
+ * merged result starting from position (old_xcnt + batch_cnt - 1) and
+ * work backwards.
*/
- builder->committed.xip[builder->committed.xcnt++] = xid;
+ old_idx = old_xcnt - 1;
+ batch_idx = batch_cnt - 1;
+ write_idx = old_xcnt + batch_cnt - 1;
+
+ while (old_idx >= 0 && batch_idx >= 0)
+ {
+ if (committed_xids[old_idx] > batch_xids[batch_idx])
+ committed_xids[write_idx--] = committed_xids[old_idx--];
+ else
+ {
+ Assert(TransactionIdIsValid(batch_xids[batch_idx]));
+ committed_xids[write_idx--] = batch_xids[batch_idx--];
+ }
+ }
+
+ /* Copy any remaining batch elements */
+ while (batch_idx >= 0)
+ committed_xids[write_idx--] = batch_xids[batch_idx--];
+
+ /* Old elements that weren't moved are already in the correct position */
+
+ builder->committed.xcnt = old_xcnt + batch_cnt;
}
/*
@@ -948,6 +988,13 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
TransactionId xmax = xid;
+ /*
+ * Collect XIDs that need tracking into a batch. We'll sort and merge
+ * them into committed.xip in one pass at the end.
+ */
+ TransactionId *batch_xids = NULL;
+ int batch_cnt = 0;
+
/*
* Transactions preceding BUILDING_SNAPSHOT will neither be decoded, nor
* will they be part of a snapshot. So we don't need to record anything.
@@ -978,6 +1025,12 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
}
}
+ if (nsubxacts > 0 || builder->building_full_snapshot ||
+ SnapBuildXidHasCatalogChanges(builder, xid, xinfo))
+ {
+ batch_xids = palloc((nsubxacts + 1) * sizeof(TransactionId));
+ }
+
for (nxact = 0; nxact < nsubxacts; nxact++)
{
TransactionId subxid = subxacts[nxact];
@@ -994,7 +1047,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
elog(DEBUG1, "found subtransaction %u:%u with catalog changes",
xid, subxid);
- SnapBuildAddCommittedTxn(builder, subxid);
+ batch_xids[batch_cnt++] = subxid;
if (NormalTransactionIdFollows(subxid, xmax))
xmax = subxid;
@@ -1008,7 +1061,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
*/
else if (needs_timetravel)
{
- SnapBuildAddCommittedTxn(builder, subxid);
+ batch_xids[batch_cnt++] = subxid;
if (NormalTransactionIdFollows(subxid, xmax))
xmax = subxid;
}
@@ -1021,7 +1074,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
xid);
needs_snapshot = true;
needs_timetravel = true;
- SnapBuildAddCommittedTxn(builder, xid);
+ batch_xids[batch_cnt++] = xid;
}
else if (sub_needs_timetravel)
{
@@ -1029,13 +1082,29 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
elog(DEBUG2, "forced transaction %u to do timetravel due to one of its subtransactions",
xid);
needs_timetravel = true;
- SnapBuildAddCommittedTxn(builder, xid);
+ batch_xids[batch_cnt++] = xid;
}
else if (needs_timetravel)
{
elog(DEBUG2, "forced transaction %u to do timetravel", xid);
- SnapBuildAddCommittedTxn(builder, xid);
+ batch_xids[batch_cnt++] = xid;
+ }
+
+ /*
+ * Sort and merge the batch into committed.xip. This maintains the
+ * invariant that committed.xip is globally sorted by raw uint32 order
+ * (xidComparator).
+ */
+ if (batch_cnt > 0)
+ {
+ /* Sort by raw uint32 order */
+ qsort(batch_xids, batch_cnt, sizeof(TransactionId), xidComparator);
+
+ /* Merge into global array */
+ SnapBuildAddCommittedTxns(builder, batch_xids, batch_cnt);
+
+ pfree(batch_xids);
}
if (!needs_timetravel)
diff --git a/src/include/replication/snapbuild_internal.h b/src/include/replication/snapbuild_internal.h
index 3b915dc8793..164160e5e5e 100644
--- a/src/include/replication/snapbuild_internal.h
+++ b/src/include/replication/snapbuild_internal.h
@@ -115,16 +115,8 @@ struct SnapBuild
/*
* Array of committed transactions that have modified the catalog.
*
- * As this array is frequently modified we do *not* keep it in
- * xidComparator order. Instead we sort the array when building &
- * distributing a snapshot.
- *
- * TODO: It's unclear whether that reasoning has much merit. Every
- * time we add something here after becoming consistent will also
- * require distributing a snapshot. Storing them sorted would
- * potentially also make it easier to purge (but more complicated wrt
- * wraparound?). Should be improved if sorting while building the
- * snapshot shows up in profiles.
+ * Maintained in sorted order (by raw uint32 value) to allow efficient
+ * snapshot building without repeated sorting overhead.
*/
TransactionId *xip;
} committed;
--
2.51.0
[application/zip] flamegraphs.zip (217.1K, 3-flamegraphs.zip)
download
^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: Optimize SnapBuild by maintaining committed.xip in sorted order
@ 2025-12-16 10:20 Xuneng Zhou <[email protected]>
parent: Xuneng Zhou <[email protected]>
0 siblings, 1 reply; 5+ messages in thread
From: Xuneng Zhou @ 2025-12-16 10:20 UTC (permalink / raw)
To: pgsql-hackers <[email protected]>; +Cc: Masahiko Sawada <[email protected]>
Hi,
On Fri, Nov 7, 2025 at 12:55 PM Xuneng Zhou <[email protected]> wrote:
>
> Hi hackers,
>
> I'd like to propose an optimization for logical decoding's snapshot building
> mechanism that eliminates repeated sorting overhead once a replication slot
> reaches the CONSISTENT state.
>
> 1) Problem
>
> Currently, SnapBuildBuildSnapshot() sorts the entire committed.xip array on
> every call using qsort(). Once a logical decoding slot reaches CONSISTENT
> state, snapshots are built frequently—after every catalog-modifying transaction
> commit. This repeated O(n log n) sorting becomes a performance bottleneck as
> the committed transaction array grows.
>
> The existing code even has a TODO comment acknowledging this inefficiency:
> "TODO: It's unclear whether that reasoning has much merit. Every time we add
> something here after becoming consistent will also require distributing a
> snapshot."
>
> 2) Proposed Solution
>
> Maintain sorted order via batch merge on each commit:
>
> 1. Collect all relevant XIDs (including subtransactions) into a batch array
> 2. Sort the batch: O(m log m) where m is typically small (1 + nsubxacts)
> 3. Merge into the global array: O(n + m) via reverse in-place merge
>
> While per-commit cost increases from O(m) to O(m log m + n), this is offset by
> eliminating O(n log n) sorts at each snapshot build. Since CONSISTENT state
> builds snapshots after each catalog-modifying commit, the amortized cost is
> significantly lower.
>
> 3) Performance Impact
>
> Expected improvements in CONSISTENT state:
> - Eliminates O(n log n) qsort on every snapshot build
> - Replaces with O(m log m + n) merge per commit
> - For typical cases where m << n and snapshots are frequent, this is a win
>
>
> The benchmark results for this patch are shown below.
>
> 4) Configuration
>
> shared_buffers = '4GB'
> wal_level = logical
> max_replication_slots = 10
> max_wal_senders = 10
> log_min_messages = warning
> max_connections = 600
> autovacuum = off
> checkpoint_timeout = 15min
> max_wal_size = 4GB
>
>
> 5) Workloads
>
> # Workload 1: DDL - High frequency, small commits
> create_ddl_workload() {
> local ROOT="$1"
> cat >"$ROOT/ddl.sql" <<'SQL'
> -- High-frequency small catalog commits
> -- Each commit triggers SnapBuildBuildSnapshot
> DO $$
> DECLARE
> tbl text := format('temp_%s_%s',
> current_setting('application_name', true),
> floor(random()*1e9)::int);
> BEGIN
> EXECUTE format('CREATE TEMP TABLE %I (id int, data text) ON COMMIT
> DROP', tbl);
> EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
> EXECUTE format('SELECT * FROM %I', tbl);
> END$$;
> SQL
> }
>
> # Workload 2: MIXED - mix of DDL and DML, 50%-50%
> create_mixed_workload() {
> local ROOT="$1"
> cat >"$ROOT/mixed_ddl.sql" <<'SQL'
> -- DDL workload (catalog changes)
> DO $$
> DECLARE
> tbl text := format('t_%s_%s',
> current_setting('application_name', true),
> floor(random()*1e9)::int);
> BEGIN
> EXECUTE format('CREATE TABLE %I (id int, data text) ON COMMIT DROP', tbl);
> EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
> END$$;
>
> SQL
> cat >"$ROOT/mixed_dml.sql" <<'SQL'
> -- DML workload (no catalog changes)
> INSERT INTO app_data (id, data)
> VALUES (floor(random()*1e6)::int, repeat('x', 100))
> ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
> SQL
> }
>
> # Workload 3: CONTROL - Pure DML, no catalog changes
> create_control_workload() {
> local ROOT="$1"
> cat >"$ROOT/control.sql" <<'SQL'
>
> -- Pure DML, no catalog changes
> -- Should show no difference between baseline and patched
> INSERT INTO control_data (id, data)
> VALUES (floor(random()*1e6)::int, repeat('x', 100))
> ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
> SQL
> }
>
>
> 6) Test strategy
>
> Clients: 100 concurrent connections per workload
> Duration: 40 seconds per run
> Repetitions: 1 run per workload type
> Replication: Logical replication slot using `test_decoding` plugin
> with `EXPORT_SNAPSHOT=true`
>
> Workload Types:
> 1. DDL - Pure catalog churn (temp table create/drop)
> 2. Mixed- 50% DDL + 50% DML (workload)
> 3. Control - Pure DML (no catalog changes)
>
> Measurement:
> - Metrics captured from pg_stat_replication_slots before/after each run
> - Primary metrics: total_txns (transactions decoded) and total_bytes
> (data volume)
> - Compared baseline (vanilla PostgreSQL) vs patched (sorted
> committed.xip optimization)
>
>
> 7) Performance results
>
> DDL Workload: +235% Decoder Improvement
> Decoder throughput: 713.76 → 2396.52 txns/sec (+235%)
> Throughput (MB/s): 672.67 → 1747.22 MB/s (+159%)
> Decode efficiency: 9.46% → 33.29% (+23.83 points)
>
> Mixed Workload: No Change
> Decoder throughput: 2751.10 → 2730.00 txns/sec (0%)
> Decode efficiency: 40.47% → 40.47% (unchanged)
>
> We can see that the qsort overhead in SnapBuild has been eliminated in
> the flamegraphs in the mixed workload. However, the performance
> improvement was not observed as in the DDL workload. My guess is that,
> for DML workloads, ReorderBufferApplyChange has become the new
> hotspot.
>
> DML Workload: No Change
> Decoder throughput: 3062.57 → 3066.37 txns/sec (0%)
> Decode efficiency: 49.97% → 49.97% (unchanged)
>
> === Workload: ddl ===
> Client commits/sec:
> Baseline: 7545.76 commits/sec
> Patched: 7198.21 commits/sec
>
> Decoder throughput (from pg_stat_replication_slots):
> Baseline: 713.76 txns/sec (672.67 MB/s)
> Patched: 2396.52 txns/sec (1747.22 MB/s)
>
> Transaction efficiency (decoded vs committed):
> Baseline: 309376 committed → 29264 decoded (9.46%)
> Patched: 302325 committed → 100654 decoded (33.29%)
>
> Total decoded (all reps):
> Baseline: 29264 txns (27579.53 MB)
> Patched: 100654 txns (73383.10 MB)
>
> Decoder improvement: +235.00% (txns/sec)
> Decoder improvement: +159.00% (MB/s)
> Efficiency improvement: +23.83% points (more transactions decoded
> per committed)
>
> === Workload: mixed ===
> Client commits/sec:
> Baseline: 6797.50 commits/sec
> Patched: 6745.26 commits/sec
>
> Decoder throughput (from pg_stat_replication_slots):
> Baseline: 2751.10 txns/sec (210.35 MB/s)
> Patched: 2730.00 txns/sec (205.64 MB/s)
>
> Transaction efficiency (decoded vs committed):
> Baseline: 285495 committed → 115546 decoded (40.47%)
> Patched: 283301 committed → 114660 decoded (40.47%)
>
> Total decoded (all reps):
> Baseline: 115546 txns (8834.71 MB)
> Patched: 114660 txns (8636.96 MB)
>
> ≈ Decoder unchanged: 0.00% (txns/sec)
>
> === Workload: DML ===
> Client commits/sec:
> Baseline: 6129.24 commits/sec
> Patched: 6136.93 commits/sec
>
> Decoder throughput (from pg_stat_replication_slots):
> Baseline: 3062.57 txns/sec (0.26 MB/s)
> Patched: 3066.37 txns/sec (0.26 MB/s)
>
> Transaction efficiency (decoded vs committed):
> Baseline: 257428 committed → 128628 decoded (49.97%)
> Patched: 251614 committed → 125721 decoded (49.97%)
>
> Total decoded (all reps):
> Baseline: 128628 txns (10.98 MB)
> Patched: 125721 txns (10.69 MB)
>
> ≈ Decoder unchanged: 0.00% (txns/sec)
>
> 8) Potential regression
>
> The potential regression point could be before the slot reaches the
> CONSISTENT state, particularly when building_full_snapshot is set to
> true. In this phase, all transactions including those that don’t
> modify the catalog — must be added to the committed.xip array. These
> XIDs don’t require later snapshot builds or sorting, so the
> batch-insert logic increases the per-insert cost from O(1) to O(m + n)
> without providing a direct benefit.
>
> However, the impact of this regression could be limited. The system
> remains in the pre-CONSISTENT phase only briefly during initial
> snapshot building, and the building_full_snapshot = true case is rare,
> mainly used when creating replication slots with the EXPORT_SNAPSHOT
> option.
>
> Once the slot becomes CONSISTENT, only catalog-modifying transactions
> are tracked in committed.xip, and the patch reduces overall
> snapshot-building overhead by eliminating repeated full-array sorts.
>
> We could also adopt a two-phase approach — keeping the current
> behavior before reaching the CONSISTENT state and maintaining a sorted
> array only after that point. This would preserve the performance
> benefits while avoiding potential regressions. However, it would
> introduce additional complexity and potential risks in handling the
> state transitions.
>
> if (builder->state < SNAPBUILD_CONSISTENT)
> {
> /* ensure that only commits after this are getting replayed */
> if (builder->start_decoding_at <= lsn)
> builder->start_decoding_at = lsn + 1;
>
> /*
> * If building an exportable snapshot, force xid to be tracked, even
> * if the transaction didn't modify the catalog.
> */
> if (builder->building_full_snapshot)
> {
> needs_timetravel = true;
> }
> }
After some consideration, a two-phase sorting strategy seems feasible
to implement in a relatively straightforward manner. So it's done in
v2. I also plan to run benchmarks to evaluate potential regressions of
the original “sort-at-all-stages” approach of v1.
> 9) Additional benefit
> With this patch applied, we can optimize the SnapBuildPurgeOlderTxn to use
> two binary searchs rather than interating and copying to find the interval
> to keep in the sorted commited.xip array. [1]
>
> Feedbacks welcomed.
>
> [1] https://www.postgresql.org/message-id/flat/CABPTF7V9gcpTLrOY0fG4YontoHjVg8YrbmiH4XB_5PT6K56xhg%40mai...
V2 fixes several issues in v1, including a potential memory leak, type
inconsistencies, and applies pgindent to the files.
--
Best,
Xuneng
Attachments:
[application/octet-stream] v2-0001-Optimize-SnapBuild-by-maintaining-committed.xip-i.patch (11.0K, 2-v2-0001-Optimize-SnapBuild-by-maintaining-committed.xip-i.patch)
download | inline diff:
From 1a543fc11e981712044d577458ca8b53300e520d Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Tue, 16 Dec 2025 15:15:14 +0800
Subject: [PATCH v2 1/2] Optimize SnapBuild by maintaining committed.xip in
sorted order
Maintain committed.xip in xidComparator order using a two-phase approach
that balances performance during slot initialization with efficiency during
steady-state decoding:
Pre-CONSISTENT phase:
- Use fast O(1) append for unsorted XIDs
- Sort once at transition to CONSISTENT state
- Avoids merge overhead during initial snapshot building
Post-CONSISTENT phase:
- Use per-commit batch insertion with sorted merge
- Collect all relevant XIDs for each commit
- Sort the batch once: O(m log m)
- Reverse-merge into the global array: O(N + m)
With this change, snapshot builds can skip the qsort() step and simply
memcpy the sorted array. While per-commit work increases from O(m) (plain
append) to O(m log m + N) (sort-and-merge) after CONSISTENT, eliminating
repeated O(N log N) sorts at each snapshot build significantly reduces
overall steady-state cost once CONSISTENT is reached and snapshots are
built frequently.
---
src/backend/replication/logical/snapbuild.c | 150 +++++++++++++++++--
src/include/replication/snapbuild_internal.h | 12 +-
2 files changed, 137 insertions(+), 25 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index d6ab1e017eb..2031ef911bf 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -407,8 +407,11 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
builder->committed.xip,
builder->committed.xcnt * sizeof(TransactionId));
- /* sort so we can bsearch() */
- qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);
+#ifdef USE_ASSERT_CHECKING
+ /* Verify array is strictly sorted */
+ for (int i = 1; i < snapshot->xcnt; i++)
+ Assert(snapshot->xip[i - 1] < snapshot->xip[i]);
+#endif
/*
* Initially, subxip is empty, i.e. it's a snapshot to be used by
@@ -822,17 +825,33 @@ SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, Transact
}
/*
- * Keep track of a new catalog changing transaction that has committed.
+ * Keep track of new catalog changing transactions that have committed.
+ *
+ * Before reaching CONSISTENT state, we use fast O(1) append since the array
+ * doesn't need to be sorted yet. After CONSISTENT, we maintain sorted order
+ * using a merge approach to avoid repeated full-array sorts at snapshot build.
*/
static void
-SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
+SnapBuildAddCommittedTxns(SnapBuild *builder,
+ const TransactionId *batch_xids,
+ int batch_cnt)
{
- Assert(TransactionIdIsValid(xid));
+ TransactionId *committed_xids;
+ size_t old_xcnt;
+
+ old_xcnt = builder->committed.xcnt;
- if (builder->committed.xcnt == builder->committed.xcnt_space)
+ /* Ensure we have space for all elements */
+ if (old_xcnt + batch_cnt > builder->committed.xcnt_space)
{
builder->committed.xcnt_space = builder->committed.xcnt_space * 2 + 1;
+ /*
+ * Repeat if we need more than 2x current space.
+ */
+ while (old_xcnt + batch_cnt > builder->committed.xcnt_space)
+ builder->committed.xcnt_space = builder->committed.xcnt_space * 2 + 1;
+
elog(DEBUG1, "increasing space for committed transactions to %u",
(uint32) builder->committed.xcnt_space);
@@ -840,12 +859,57 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
builder->committed.xcnt_space * sizeof(TransactionId));
}
+ committed_xids = builder->committed.xip;
+
/*
- * TODO: It might make sense to keep the array sorted here instead of
- * doing it every time we build a new snapshot. On the other hand this
- * gets called repeatedly when a transaction with subtransactions commits.
+ * Before CONSISTENT state, just append unsorted for O(1) performance. The
+ * array will be sorted once when transitioning to CONSISTENT.
*/
- builder->committed.xip[builder->committed.xcnt++] = xid;
+ if (builder->state < SNAPBUILD_CONSISTENT)
+ {
+ for (int i = 0; i < batch_cnt; i++)
+ {
+ Assert(TransactionIdIsValid(batch_xids[i]));
+ committed_xids[old_xcnt++] = batch_xids[i];
+ }
+ builder->committed.xcnt = old_xcnt;
+ }
+ else
+ {
+ /*
+ * After CONSISTENT, maintain sorted order via merge. Merge from the
+ * end to avoid overwriting unread data.
+ */
+ int old_idx = old_xcnt - 1;
+ int batch_idx = batch_cnt - 1;
+ int write_idx = old_xcnt + batch_cnt - 1;
+
+ while (old_idx >= 0 && batch_idx >= 0)
+ {
+ Assert(TransactionIdIsValid(batch_xids[batch_idx]));
+
+ if (committed_xids[old_idx] > batch_xids[batch_idx])
+ {
+ committed_xids[write_idx--] = committed_xids[old_idx--];
+ }
+ else
+ {
+ /* Duplicates should never occur */
+ Assert(committed_xids[old_idx] != batch_xids[batch_idx]);
+ committed_xids[write_idx--] = batch_xids[batch_idx--];
+ }
+ }
+
+ /* Copy any remaining batch elements */
+ while (batch_idx >= 0)
+ {
+ Assert(TransactionIdIsValid(batch_xids[batch_idx]));
+ committed_xids[write_idx--] = batch_xids[batch_idx--];
+ }
+
+ /* Old elements that weren't moved are already in correct position */
+ builder->committed.xcnt = old_xcnt + batch_cnt;
+ }
}
/*
@@ -947,6 +1011,13 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
TransactionId xmax = xid;
+ /*
+ * Collect XIDs that need tracking into a batch. We'll sort and merge
+ * them into committed.xip in one pass at the end.
+ */
+ TransactionId *batch_xids = NULL;
+ int batch_cnt = 0;
+
/*
* Transactions preceding BUILDING_SNAPSHOT will neither be decoded, nor
* will they be part of a snapshot. So we don't need to record anything.
@@ -977,6 +1048,12 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
}
}
+ if (nsubxacts > 0 || builder->building_full_snapshot ||
+ SnapBuildXidHasCatalogChanges(builder, xid, xinfo))
+ {
+ batch_xids = palloc((nsubxacts + 1) * sizeof(TransactionId));
+ }
+
for (nxact = 0; nxact < nsubxacts; nxact++)
{
TransactionId subxid = subxacts[nxact];
@@ -993,7 +1070,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
elog(DEBUG1, "found subtransaction %u:%u with catalog changes",
xid, subxid);
- SnapBuildAddCommittedTxn(builder, subxid);
+ batch_xids[batch_cnt++] = subxid;
if (NormalTransactionIdFollows(subxid, xmax))
xmax = subxid;
@@ -1007,7 +1084,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
*/
else if (needs_timetravel)
{
- SnapBuildAddCommittedTxn(builder, subxid);
+ batch_xids[batch_cnt++] = subxid;
if (NormalTransactionIdFollows(subxid, xmax))
xmax = subxid;
}
@@ -1020,7 +1097,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
xid);
needs_snapshot = true;
needs_timetravel = true;
- SnapBuildAddCommittedTxn(builder, xid);
+ batch_xids[batch_cnt++] = xid;
}
else if (sub_needs_timetravel)
{
@@ -1028,15 +1105,33 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
elog(DEBUG2, "forced transaction %u to do timetravel due to one of its subtransactions",
xid);
needs_timetravel = true;
- SnapBuildAddCommittedTxn(builder, xid);
+ batch_xids[batch_cnt++] = xid;
}
else if (needs_timetravel)
{
elog(DEBUG2, "forced transaction %u to do timetravel", xid);
- SnapBuildAddCommittedTxn(builder, xid);
+ batch_xids[batch_cnt++] = xid;
}
+ /*
+ * Sort and merge the batch into committed.xip. This maintains the
+ * invariant that committed.xip is globally sorted by raw uint32 order
+ * (xidComparator).
+ */
+ if (batch_cnt > 0)
+ {
+ /* Sort by raw uint32 order */
+ qsort(batch_xids, batch_cnt, sizeof(TransactionId), xidComparator);
+
+ /* Merge into global array */
+ SnapBuildAddCommittedTxns(builder, batch_xids, batch_cnt);
+ }
+
+ /* Free batch array if allocated (even if nothing was added to it) */
+ if (batch_xids != NULL)
+ pfree(batch_xids);
+
if (!needs_timetravel)
{
/* record that we cannot export a general snapshot anymore */
@@ -1305,6 +1400,15 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
Assert(TransactionIdIsNormal(builder->xmin));
Assert(TransactionIdIsNormal(builder->xmax));
+ /*
+ * Sort committed.xip before transitioning to CONSISTENT. During the
+ * pre-CONSISTENT phase, XIDs were appended unsorted for performance.
+ * Now we need sorted order for efficient snapshot building.
+ */
+ if (builder->committed.xcnt > 1)
+ qsort(builder->committed.xip, builder->committed.xcnt,
+ sizeof(TransactionId), xidComparator);
+
builder->state = SNAPBUILD_CONSISTENT;
builder->next_phase_at = InvalidTransactionId;
@@ -1402,6 +1506,15 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
TransactionIdPrecedesOrEquals(builder->next_phase_at,
running->oldestRunningXid))
{
+ /*
+ * Sort committed.xip before transitioning to CONSISTENT. During the
+ * pre-CONSISTENT phase, XIDs were appended unsorted for performance.
+ * Now we need sorted order for efficient snapshot building.
+ */
+ if (builder->committed.xcnt > 1)
+ qsort(builder->committed.xip, builder->committed.xcnt,
+ sizeof(TransactionId), xidComparator);
+
builder->state = SNAPBUILD_CONSISTENT;
builder->next_phase_at = InvalidTransactionId;
@@ -1889,6 +2002,13 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
pfree(builder->committed.xip);
builder->committed.xcnt_space = ondisk.builder.committed.xcnt;
builder->committed.xip = ondisk.builder.committed.xip;
+
+ /*
+ * Sort the restored array to ensure it's in xidComparator order. Old
+ * snapshot files may have been written with unsorted arrays.
+ */
+ qsort(builder->committed.xip, builder->committed.xcnt,
+ sizeof(TransactionId), xidComparator);
}
ondisk.builder.committed.xip = NULL;
diff --git a/src/include/replication/snapbuild_internal.h b/src/include/replication/snapbuild_internal.h
index 3b915dc8793..164160e5e5e 100644
--- a/src/include/replication/snapbuild_internal.h
+++ b/src/include/replication/snapbuild_internal.h
@@ -115,16 +115,8 @@ struct SnapBuild
/*
* Array of committed transactions that have modified the catalog.
*
- * As this array is frequently modified we do *not* keep it in
- * xidComparator order. Instead we sort the array when building &
- * distributing a snapshot.
- *
- * TODO: It's unclear whether that reasoning has much merit. Every
- * time we add something here after becoming consistent will also
- * require distributing a snapshot. Storing them sorted would
- * potentially also make it easier to purge (but more complicated wrt
- * wraparound?). Should be improved if sorting while building the
- * snapshot shows up in profiles.
+ * Maintained in sorted order (by raw uint32 value) to allow efficient
+ * snapshot building without repeated sorting overhead.
*/
TransactionId *xip;
} committed;
--
2.51.0
^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: Optimize SnapBuild by maintaining committed.xip in sorted order
@ 2025-12-16 10:52 Xuneng Zhou <[email protected]>
parent: Xuneng Zhou <[email protected]>
0 siblings, 1 reply; 5+ messages in thread
From: Xuneng Zhou @ 2025-12-16 10:52 UTC (permalink / raw)
To: pgsql-hackers <[email protected]>; +Cc: Masahiko Sawada <[email protected]>
Hi,
On Tue, Dec 16, 2025 at 6:20 PM Xuneng Zhou <[email protected]> wrote:
>
> Hi,
>
> On Fri, Nov 7, 2025 at 12:55 PM Xuneng Zhou <[email protected]> wrote:
> >
> > Hi hackers,
> >
> > I'd like to propose an optimization for logical decoding's snapshot building
> > mechanism that eliminates repeated sorting overhead once a replication slot
> > reaches the CONSISTENT state.
> >
> > 1) Problem
> >
> > Currently, SnapBuildBuildSnapshot() sorts the entire committed.xip array on
> > every call using qsort(). Once a logical decoding slot reaches CONSISTENT
> > state, snapshots are built frequently—after every catalog-modifying transaction
> > commit. This repeated O(n log n) sorting becomes a performance bottleneck as
> > the committed transaction array grows.
> >
> > The existing code even has a TODO comment acknowledging this inefficiency:
> > "TODO: It's unclear whether that reasoning has much merit. Every time we add
> > something here after becoming consistent will also require distributing a
> > snapshot."
> >
> > 2) Proposed Solution
> >
> > Maintain sorted order via batch merge on each commit:
> >
> > 1. Collect all relevant XIDs (including subtransactions) into a batch array
> > 2. Sort the batch: O(m log m) where m is typically small (1 + nsubxacts)
> > 3. Merge into the global array: O(n + m) via reverse in-place merge
> >
> > While per-commit cost increases from O(m) to O(m log m + n), this is offset by
> > eliminating O(n log n) sorts at each snapshot build. Since CONSISTENT state
> > builds snapshots after each catalog-modifying commit, the amortized cost is
> > significantly lower.
> >
> > 3) Performance Impact
> >
> > Expected improvements in CONSISTENT state:
> > - Eliminates O(n log n) qsort on every snapshot build
> > - Replaces with O(m log m + n) merge per commit
> > - For typical cases where m << n and snapshots are frequent, this is a win
> >
> >
> > The benchmark results for this patch are shown below.
> >
> > 4) Configuration
> >
> > shared_buffers = '4GB'
> > wal_level = logical
> > max_replication_slots = 10
> > max_wal_senders = 10
> > log_min_messages = warning
> > max_connections = 600
> > autovacuum = off
> > checkpoint_timeout = 15min
> > max_wal_size = 4GB
> >
> >
> > 5) Workloads
> >
> > # Workload 1: DDL - High frequency, small commits
> > create_ddl_workload() {
> > local ROOT="$1"
> > cat >"$ROOT/ddl.sql" <<'SQL'
> > -- High-frequency small catalog commits
> > -- Each commit triggers SnapBuildBuildSnapshot
> > DO $$
> > DECLARE
> > tbl text := format('temp_%s_%s',
> > current_setting('application_name', true),
> > floor(random()*1e9)::int);
> > BEGIN
> > EXECUTE format('CREATE TEMP TABLE %I (id int, data text) ON COMMIT
> > DROP', tbl);
> > EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
> > EXECUTE format('SELECT * FROM %I', tbl);
> > END$$;
> > SQL
> > }
> >
> > # Workload 2: MIXED - mix of DDL and DML, 50%-50%
> > create_mixed_workload() {
> > local ROOT="$1"
> > cat >"$ROOT/mixed_ddl.sql" <<'SQL'
> > -- DDL workload (catalog changes)
> > DO $$
> > DECLARE
> > tbl text := format('t_%s_%s',
> > current_setting('application_name', true),
> > floor(random()*1e9)::int);
> > BEGIN
> > EXECUTE format('CREATE TABLE %I (id int, data text) ON COMMIT DROP', tbl);
> > EXECUTE format('INSERT INTO %I VALUES (1, ''x'')', tbl);
> > END$$;
> >
> > SQL
> > cat >"$ROOT/mixed_dml.sql" <<'SQL'
> > -- DML workload (no catalog changes)
> > INSERT INTO app_data (id, data)
> > VALUES (floor(random()*1e6)::int, repeat('x', 100))
> > ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
> > SQL
> > }
> >
> > # Workload 3: CONTROL - Pure DML, no catalog changes
> > create_control_workload() {
> > local ROOT="$1"
> > cat >"$ROOT/control.sql" <<'SQL'
> >
> > -- Pure DML, no catalog changes
> > -- Should show no difference between baseline and patched
> > INSERT INTO control_data (id, data)
> > VALUES (floor(random()*1e6)::int, repeat('x', 100))
> > ON CONFLICT (id) DO UPDATE SET data = repeat('y', 100);
> > SQL
> > }
> >
> >
> > 6) Test strategy
> >
> > Clients: 100 concurrent connections per workload
> > Duration: 40 seconds per run
> > Repetitions: 1 run per workload type
> > Replication: Logical replication slot using `test_decoding` plugin
> > with `EXPORT_SNAPSHOT=true`
> >
> > Workload Types:
> > 1. DDL - Pure catalog churn (temp table create/drop)
> > 2. Mixed- 50% DDL + 50% DML (workload)
> > 3. Control - Pure DML (no catalog changes)
> >
> > Measurement:
> > - Metrics captured from pg_stat_replication_slots before/after each run
> > - Primary metrics: total_txns (transactions decoded) and total_bytes
> > (data volume)
> > - Compared baseline (vanilla PostgreSQL) vs patched (sorted
> > committed.xip optimization)
> >
> >
> > 7) Performance results
> >
> > DDL Workload: +235% Decoder Improvement
> > Decoder throughput: 713.76 → 2396.52 txns/sec (+235%)
> > Throughput (MB/s): 672.67 → 1747.22 MB/s (+159%)
> > Decode efficiency: 9.46% → 33.29% (+23.83 points)
> >
> > Mixed Workload: No Change
> > Decoder throughput: 2751.10 → 2730.00 txns/sec (0%)
> > Decode efficiency: 40.47% → 40.47% (unchanged)
> >
> > We can see that the qsort overhead in SnapBuild has been eliminated in
> > the flamegraphs in the mixed workload. However, the performance
> > improvement was not observed as in the DDL workload. My guess is that,
> > for DML workloads, ReorderBufferApplyChange has become the new
> > hotspot.
> >
> > DML Workload: No Change
> > Decoder throughput: 3062.57 → 3066.37 txns/sec (0%)
> > Decode efficiency: 49.97% → 49.97% (unchanged)
> >
> > === Workload: ddl ===
> > Client commits/sec:
> > Baseline: 7545.76 commits/sec
> > Patched: 7198.21 commits/sec
> >
> > Decoder throughput (from pg_stat_replication_slots):
> > Baseline: 713.76 txns/sec (672.67 MB/s)
> > Patched: 2396.52 txns/sec (1747.22 MB/s)
> >
> > Transaction efficiency (decoded vs committed):
> > Baseline: 309376 committed → 29264 decoded (9.46%)
> > Patched: 302325 committed → 100654 decoded (33.29%)
> >
> > Total decoded (all reps):
> > Baseline: 29264 txns (27579.53 MB)
> > Patched: 100654 txns (73383.10 MB)
> >
> > Decoder improvement: +235.00% (txns/sec)
> > Decoder improvement: +159.00% (MB/s)
> > Efficiency improvement: +23.83% points (more transactions decoded
> > per committed)
> >
> > === Workload: mixed ===
> > Client commits/sec:
> > Baseline: 6797.50 commits/sec
> > Patched: 6745.26 commits/sec
> >
> > Decoder throughput (from pg_stat_replication_slots):
> > Baseline: 2751.10 txns/sec (210.35 MB/s)
> > Patched: 2730.00 txns/sec (205.64 MB/s)
> >
> > Transaction efficiency (decoded vs committed):
> > Baseline: 285495 committed → 115546 decoded (40.47%)
> > Patched: 283301 committed → 114660 decoded (40.47%)
> >
> > Total decoded (all reps):
> > Baseline: 115546 txns (8834.71 MB)
> > Patched: 114660 txns (8636.96 MB)
> >
> > ≈ Decoder unchanged: 0.00% (txns/sec)
> >
> > === Workload: DML ===
> > Client commits/sec:
> > Baseline: 6129.24 commits/sec
> > Patched: 6136.93 commits/sec
> >
> > Decoder throughput (from pg_stat_replication_slots):
> > Baseline: 3062.57 txns/sec (0.26 MB/s)
> > Patched: 3066.37 txns/sec (0.26 MB/s)
> >
> > Transaction efficiency (decoded vs committed):
> > Baseline: 257428 committed → 128628 decoded (49.97%)
> > Patched: 251614 committed → 125721 decoded (49.97%)
> >
> > Total decoded (all reps):
> > Baseline: 128628 txns (10.98 MB)
> > Patched: 125721 txns (10.69 MB)
> >
> > ≈ Decoder unchanged: 0.00% (txns/sec)
> >
> > 8) Potential regression
> >
> > The potential regression point could be before the slot reaches the
> > CONSISTENT state, particularly when building_full_snapshot is set to
> > true. In this phase, all transactions including those that don’t
> > modify the catalog — must be added to the committed.xip array. These
> > XIDs don’t require later snapshot builds or sorting, so the
> > batch-insert logic increases the per-insert cost from O(1) to O(m + n)
> > without providing a direct benefit.
> >
> > However, the impact of this regression could be limited. The system
> > remains in the pre-CONSISTENT phase only briefly during initial
> > snapshot building, and the building_full_snapshot = true case is rare,
> > mainly used when creating replication slots with the EXPORT_SNAPSHOT
> > option.
> >
> > Once the slot becomes CONSISTENT, only catalog-modifying transactions
> > are tracked in committed.xip, and the patch reduces overall
> > snapshot-building overhead by eliminating repeated full-array sorts.
> >
> > We could also adopt a two-phase approach — keeping the current
> > behavior before reaching the CONSISTENT state and maintaining a sorted
> > array only after that point. This would preserve the performance
> > benefits while avoiding potential regressions. However, it would
> > introduce additional complexity and potential risks in handling the
> > state transitions.
> >
> > if (builder->state < SNAPBUILD_CONSISTENT)
> > {
> > /* ensure that only commits after this are getting replayed */
> > if (builder->start_decoding_at <= lsn)
> > builder->start_decoding_at = lsn + 1;
> >
> > /*
> > * If building an exportable snapshot, force xid to be tracked, even
> > * if the transaction didn't modify the catalog.
> > */
> > if (builder->building_full_snapshot)
> > {
> > needs_timetravel = true;
> > }
> > }
>
> After some consideration, a two-phase sorting strategy seems feasible
> to implement in a relatively straightforward manner. So it's done in
> v2. I also plan to run benchmarks to evaluate potential regressions of
> the original “sort-at-all-stages” approach of v1.
>
> > 9) Additional benefit
> > With this patch applied, we can optimize the SnapBuildPurgeOlderTxn to use
> > two binary searchs rather than interating and copying to find the interval
> > to keep in the sorted commited.xip array. [1]
> >
> > Feedbacks welcomed.
> >
> > [1] https://www.postgresql.org/message-id/flat/CABPTF7V9gcpTLrOY0fG4YontoHjVg8YrbmiH4XB_5PT6K56xhg%40mai...
>
> V2 fixes several issues in v1, including a potential memory leak, type
> inconsistencies, and applies pgindent to the files.
>
V3 fixes a critical issue where the snapshot->xip array in
SnapBuildBuildSnapshot might not be sorted before reaching the
consistent state. Sorry for the noise here.
--
Best,
Xuneng
Attachments:
[application/octet-stream] v3-0001-Optimize-SnapBuild-by-maintaining-committed.xip-i.patch (11.3K, 2-v3-0001-Optimize-SnapBuild-by-maintaining-committed.xip-i.patch)
download | inline diff:
From a7c8d1da5d0496b2a7d05d52e37511a37075e13d Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Tue, 16 Dec 2025 15:15:14 +0800
Subject: [PATCH v3 1/2] Optimize SnapBuild by maintaining committed.xip in
sorted order
Maintain committed.xip in xidComparator order using a two-phase approach
that balances performance during slot initialization with efficiency during
steady-state decoding:
Pre-CONSISTENT phase:
- Use fast O(1) append for unsorted XIDs
- Sort once at transition to CONSISTENT state
- Avoids merge overhead during initial snapshot building
Post-CONSISTENT phase:
- Use per-commit batch insertion with sorted merge
- Collect all relevant XIDs for each commit
- Sort the batch once: O(m log m)
- Reverse-merge into the global array: O(N + m)
With this change, snapshot builds can skip the qsort() step and simply
memcpy the sorted array. While per-commit work increases from O(m) (plain
append) to O(m log m + N) (sort-and-merge) after CONSISTENT, eliminating
repeated O(N log N) sorts at each snapshot build significantly reduces
overall steady-state cost once CONSISTENT is reached and snapshots are
built frequently.
---
src/backend/replication/logical/snapbuild.c | 164 +++++++++++++++++--
src/include/replication/snapbuild_internal.h | 12 +-
2 files changed, 151 insertions(+), 25 deletions(-)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index d6ab1e017eb..e0d7d3e0ae8 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -407,8 +407,25 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
builder->committed.xip,
builder->committed.xcnt * sizeof(TransactionId));
- /* sort so we can bsearch() */
- qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);
+ if (builder->state < SNAPBUILD_CONSISTENT)
+ {
+ /*
+ * Pre-CONSISTENT: committed.xip is unsorted, sort the snapshot copy.
+ */
+ if (snapshot->xcnt > 1)
+ qsort(snapshot->xip, snapshot->xcnt,
+ sizeof(TransactionId), xidComparator);
+ }
+#ifdef USE_ASSERT_CHECKING
+ else
+ {
+ /*
+ * Post-CONSISTENT: committed.xip should already be sorted, verify.
+ */
+ for (int i = 1; i < snapshot->xcnt; i++)
+ Assert(snapshot->xip[i - 1] < snapshot->xip[i]);
+ }
+#endif
/*
* Initially, subxip is empty, i.e. it's a snapshot to be used by
@@ -822,17 +839,33 @@ SnapBuildDistributeSnapshotAndInval(SnapBuild *builder, XLogRecPtr lsn, Transact
}
/*
- * Keep track of a new catalog changing transaction that has committed.
+ * Keep track of new catalog changing transactions that have committed.
+ *
+ * Before reaching CONSISTENT state, we use fast O(1) append since the array
+ * doesn't need to be sorted yet. After CONSISTENT, we maintain sorted order
+ * using a merge approach to avoid repeated full-array sorts at snapshot build.
*/
static void
-SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
+SnapBuildAddCommittedTxns(SnapBuild *builder,
+ const TransactionId *batch_xids,
+ int batch_cnt)
{
- Assert(TransactionIdIsValid(xid));
+ TransactionId *committed_xids;
+ size_t old_xcnt;
+
+ old_xcnt = builder->committed.xcnt;
- if (builder->committed.xcnt == builder->committed.xcnt_space)
+ /* Ensure we have space for all elements */
+ if (old_xcnt + batch_cnt > builder->committed.xcnt_space)
{
builder->committed.xcnt_space = builder->committed.xcnt_space * 2 + 1;
+ /*
+ * Repeat if we need more than 2x current space.
+ */
+ while (old_xcnt + batch_cnt > builder->committed.xcnt_space)
+ builder->committed.xcnt_space = builder->committed.xcnt_space * 2 + 1;
+
elog(DEBUG1, "increasing space for committed transactions to %u",
(uint32) builder->committed.xcnt_space);
@@ -840,12 +873,57 @@ SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
builder->committed.xcnt_space * sizeof(TransactionId));
}
+ committed_xids = builder->committed.xip;
+
/*
- * TODO: It might make sense to keep the array sorted here instead of
- * doing it every time we build a new snapshot. On the other hand this
- * gets called repeatedly when a transaction with subtransactions commits.
+ * Before CONSISTENT state, just append unsorted for O(1) performance. The
+ * array will be sorted once when transitioning to CONSISTENT.
*/
- builder->committed.xip[builder->committed.xcnt++] = xid;
+ if (builder->state < SNAPBUILD_CONSISTENT)
+ {
+ for (int i = 0; i < batch_cnt; i++)
+ {
+ Assert(TransactionIdIsValid(batch_xids[i]));
+ committed_xids[old_xcnt++] = batch_xids[i];
+ }
+ builder->committed.xcnt = old_xcnt;
+ }
+ else
+ {
+ /*
+ * After CONSISTENT, maintain sorted order via merge. Merge from the
+ * end to avoid overwriting unread data.
+ */
+ int old_idx = old_xcnt - 1;
+ int batch_idx = batch_cnt - 1;
+ int write_idx = old_xcnt + batch_cnt - 1;
+
+ while (old_idx >= 0 && batch_idx >= 0)
+ {
+ Assert(TransactionIdIsValid(batch_xids[batch_idx]));
+
+ if (committed_xids[old_idx] > batch_xids[batch_idx])
+ {
+ committed_xids[write_idx--] = committed_xids[old_idx--];
+ }
+ else
+ {
+ /* Duplicates should never occur */
+ Assert(committed_xids[old_idx] != batch_xids[batch_idx]);
+ committed_xids[write_idx--] = batch_xids[batch_idx--];
+ }
+ }
+
+ /* Copy any remaining batch elements */
+ while (batch_idx >= 0)
+ {
+ Assert(TransactionIdIsValid(batch_xids[batch_idx]));
+ committed_xids[write_idx--] = batch_xids[batch_idx--];
+ }
+
+ /* Old elements that weren't moved are already in correct position */
+ builder->committed.xcnt = old_xcnt + batch_cnt;
+ }
}
/*
@@ -947,6 +1025,13 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
TransactionId xmax = xid;
+ /*
+ * Collect XIDs that need tracking into a batch. We'll sort and merge
+ * them into committed.xip in one pass at the end.
+ */
+ TransactionId *batch_xids = NULL;
+ int batch_cnt = 0;
+
/*
* Transactions preceding BUILDING_SNAPSHOT will neither be decoded, nor
* will they be part of a snapshot. So we don't need to record anything.
@@ -977,6 +1062,12 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
}
}
+ if (nsubxacts > 0 || builder->building_full_snapshot ||
+ SnapBuildXidHasCatalogChanges(builder, xid, xinfo))
+ {
+ batch_xids = palloc((nsubxacts + 1) * sizeof(TransactionId));
+ }
+
for (nxact = 0; nxact < nsubxacts; nxact++)
{
TransactionId subxid = subxacts[nxact];
@@ -993,7 +1084,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
elog(DEBUG1, "found subtransaction %u:%u with catalog changes",
xid, subxid);
- SnapBuildAddCommittedTxn(builder, subxid);
+ batch_xids[batch_cnt++] = subxid;
if (NormalTransactionIdFollows(subxid, xmax))
xmax = subxid;
@@ -1007,7 +1098,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
*/
else if (needs_timetravel)
{
- SnapBuildAddCommittedTxn(builder, subxid);
+ batch_xids[batch_cnt++] = subxid;
if (NormalTransactionIdFollows(subxid, xmax))
xmax = subxid;
}
@@ -1020,7 +1111,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
xid);
needs_snapshot = true;
needs_timetravel = true;
- SnapBuildAddCommittedTxn(builder, xid);
+ batch_xids[batch_cnt++] = xid;
}
else if (sub_needs_timetravel)
{
@@ -1028,15 +1119,33 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
elog(DEBUG2, "forced transaction %u to do timetravel due to one of its subtransactions",
xid);
needs_timetravel = true;
- SnapBuildAddCommittedTxn(builder, xid);
+ batch_xids[batch_cnt++] = xid;
}
else if (needs_timetravel)
{
elog(DEBUG2, "forced transaction %u to do timetravel", xid);
- SnapBuildAddCommittedTxn(builder, xid);
+ batch_xids[batch_cnt++] = xid;
+ }
+
+ /*
+ * Sort and merge the batch into committed.xip. This maintains the
+ * invariant that committed.xip is globally sorted by raw uint32 order
+ * (xidComparator).
+ */
+ if (batch_cnt > 0)
+ {
+ /* Sort by raw uint32 order */
+ qsort(batch_xids, batch_cnt, sizeof(TransactionId), xidComparator);
+
+ /* Merge into global array */
+ SnapBuildAddCommittedTxns(builder, batch_xids, batch_cnt);
}
+ /* Free batch array if allocated (even if nothing was added to it) */
+ if (batch_xids != NULL)
+ pfree(batch_xids);
+
if (!needs_timetravel)
{
/* record that we cannot export a general snapshot anymore */
@@ -1305,6 +1414,15 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
Assert(TransactionIdIsNormal(builder->xmin));
Assert(TransactionIdIsNormal(builder->xmax));
+ /*
+ * Sort committed.xip before transitioning to CONSISTENT. During the
+ * pre-CONSISTENT phase, XIDs were appended unsorted for performance.
+ * Now we need sorted order for efficient snapshot building.
+ */
+ if (builder->committed.xcnt > 1)
+ qsort(builder->committed.xip, builder->committed.xcnt,
+ sizeof(TransactionId), xidComparator);
+
builder->state = SNAPBUILD_CONSISTENT;
builder->next_phase_at = InvalidTransactionId;
@@ -1402,6 +1520,15 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
TransactionIdPrecedesOrEquals(builder->next_phase_at,
running->oldestRunningXid))
{
+ /*
+ * Sort committed.xip before transitioning to CONSISTENT. During the
+ * pre-CONSISTENT phase, XIDs were appended unsorted for performance.
+ * Now we need sorted order for efficient snapshot building.
+ */
+ if (builder->committed.xcnt > 1)
+ qsort(builder->committed.xip, builder->committed.xcnt,
+ sizeof(TransactionId), xidComparator);
+
builder->state = SNAPBUILD_CONSISTENT;
builder->next_phase_at = InvalidTransactionId;
@@ -1889,6 +2016,13 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
pfree(builder->committed.xip);
builder->committed.xcnt_space = ondisk.builder.committed.xcnt;
builder->committed.xip = ondisk.builder.committed.xip;
+
+ /*
+ * Sort the restored array to ensure it's in xidComparator order. Old
+ * snapshot files may have been written with unsorted arrays.
+ */
+ qsort(builder->committed.xip, builder->committed.xcnt,
+ sizeof(TransactionId), xidComparator);
}
ondisk.builder.committed.xip = NULL;
diff --git a/src/include/replication/snapbuild_internal.h b/src/include/replication/snapbuild_internal.h
index 3b915dc8793..164160e5e5e 100644
--- a/src/include/replication/snapbuild_internal.h
+++ b/src/include/replication/snapbuild_internal.h
@@ -115,16 +115,8 @@ struct SnapBuild
/*
* Array of committed transactions that have modified the catalog.
*
- * As this array is frequently modified we do *not* keep it in
- * xidComparator order. Instead we sort the array when building &
- * distributing a snapshot.
- *
- * TODO: It's unclear whether that reasoning has much merit. Every
- * time we add something here after becoming consistent will also
- * require distributing a snapshot. Storing them sorted would
- * potentially also make it easier to purge (but more complicated wrt
- * wraparound?). Should be improved if sorting while building the
- * snapshot shows up in profiles.
+ * Maintained in sorted order (by raw uint32 value) to allow efficient
+ * snapshot building without repeated sorting overhead.
*/
TransactionId *xip;
} committed;
--
2.51.0
^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: Optimize SnapBuild by maintaining committed.xip in sorted order
@ 2026-04-09 02:04 Yogesh Sharma <[email protected]>
parent: Xuneng Zhou <[email protected]>
0 siblings, 1 reply; 5+ messages in thread
From: Yogesh Sharma @ 2026-04-09 02:04 UTC (permalink / raw)
To: [email protected]; +Cc: Xuneng Zhou <[email protected]>
The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested
At a highlevel it looks good.
^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: Optimize SnapBuild by maintaining committed.xip in sorted order
@ 2026-04-09 07:33 Xuneng Zhou <[email protected]>
parent: Yogesh Sharma <[email protected]>
0 siblings, 0 replies; 5+ messages in thread
From: Xuneng Zhou @ 2026-04-09 07:33 UTC (permalink / raw)
To: Yogesh Sharma <[email protected]>; +Cc: [email protected]; Xuneng Zhou <[email protected]>
Hi Yogesh,
On Thu, Apr 9, 2026 at 10:05 AM Yogesh Sharma
<[email protected]> wrote:
>
> The following review has been posted through the commitfest application:
> make installcheck-world: not tested
> Implements feature: not tested
> Spec compliant: not tested
> Documentation: not tested
>
> At a highlevel it looks good.
Thanks for looking into this. It seems the community has reached a
consensus (maybe partially) on this and the related patch toward the
end of the thread [1]: they try to solve a relatively small problem in
a rather complex way. I’m therefore considering withdrawing both
patches from the current commitfest.
[1] https://www.postgresql.org/message-id/flat/CABPTF7V9gcpTLrOY0fG4YontoHjVg8YrbmiH4XB_5PT6K56xhg%40mai...
--
Best,
Xuneng
^ permalink raw reply [nested|flat] 5+ messages in thread
end of thread, other threads:[~2026-04-09 07:33 UTC | newest]
Thread overview: 5+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-11-07 04:55 Optimize SnapBuild by maintaining committed.xip in sorted order Xuneng Zhou <[email protected]>
2025-12-16 10:20 ` Xuneng Zhou <[email protected]>
2025-12-16 10:52 ` Xuneng Zhou <[email protected]>
2026-04-09 02:04 ` Yogesh Sharma <[email protected]>
2026-04-09 07:33 ` Xuneng Zhou <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox