Add logical_decoding_spill_limit to cap spill file disk usage per slot

public inbox for [email protected]  
help / color / mirror / Atom feed

Add logical_decoding_spill_limit to cap spill file disk usage per slot
6+ messages / 3 participants
[nested] [flat]

* Add logical_decoding_spill_limit to cap spill file disk usage per slot
@ 2026-03-23 13:19 shawn wang <[email protected]>
  2026-03-23 13:23 ` Re: Add logical_decoding_spill_limit to cap spill file disk usage per slot shawn wang <[email protected]>
  2026-03-23 14:24 ` Re: Add logical_decoding_spill_limit to cap spill file disk usage per slot Bharath Rupireddy <[email protected]>
  2026-03-26 05:33 ` RE: Add logical_decoding_spill_limit to cap spill file disk usage per slot Hayato Kuroda (Fujitsu) <[email protected]>
  0 siblings, 3 replies; 6+ messages in thread

From: shawn wang @ 2026-03-23 13:19 UTC (permalink / raw)
  To: [email protected]

Hi hackers,

 == Motivation ==

We operate a fleet of PostgreSQL instances with logical replication. On
several occasions, we have experienced production incidents where logical
decoding spill files (pg_replslot/<slot>/xid-*.spill) grew uncontrollably —
consuming tens of gigabytes and eventually filling up the data disk. This
caused the entire instance to go read-only, impacting not just replication
but all write workloads.

The typical scenario is a large transaction (e.g. bulk data load or a
long-running DDL) combined with a subscriber that is either slow or
temporarily disconnected. The reorder buffer exceeds
logical_decoding_work_mem and starts spilling, but there is no upper bound
on how much can be spilled. The only backstop today is the OS returning
ENOSPC, at which point the damage is already done.

We looked for existing protections:

   - max_slot_wal_keep_size: limits WAL retention, but does not affect
   spill files at all.
   - logical_decoding_work_mem: controls *when* spilling starts, but not
   *how much* can be spilled.
   - There is no existing GUC, patch, or commitfest entry that addresses
   spill file disk quota.

The "Report reorder buffer size" patch (CF #6053, by Ashutosh Bapat)
improves observability of reorder buffer state, which is complementary —
but observability alone cannot prevent disk-full incidents.

== Proposed solution ==

The attached patch adds a new GUC:
logical_decoding_spill_limit (integer, unit kB, default 0)

When set to a positive value, it limits the total size of on-disk spill
files per replication slot. Key design points:

   1. Tracking: We add two new fields: - ReorderBuffer.spillBytesOnDisk —
   current total on-disk spill size for this slot (unlike spillBytes which is
   a cumulative statistic counter, this is a live gauge). -
   ReorderBufferTXN.serialized_size — per-transaction on-disk size, so we can
   accurately decrement the global counter during cleanup.
   2. Increment: In ReorderBufferSerializeChange(), after a successful
   write(), both counters are incremented by the size written.
   3. Decrement: In ReorderBufferRestoreCleanup(), when spill files are
   unlinked, the global counter is decremented by the transaction's
   serialized_size.
   4. Enforcement: In ReorderBufferCheckMemoryLimit(), before calling
   ReorderBufferSerializeTXN(), we check: if (spillBytesOnDisk + txn->size >
   spill_limit) ereport(ERROR, ...) This is only checked on the spill-to-disk
   path — not on the streaming path (which involves no disk I/O).
   5. Behavior on limit exceeded: An ERROR is raised with
   ERRCODE_CONFIGURATION_LIMIT_EXCEEDED. The walsender exits, but the slot's
   restart_lsn and confirmed_flush are preserved. The subscriber can reconnect
   after the DBA:
      1. increases logical_decoding_spill_limit, or
      2. increases logical_decoding_work_mem (to reduce spilling), or
      3. switches to a streaming-capable output plugin (which avoids
      spilling entirely).
   6. Default 0 means unlimited — fully backward compatible.

== Why per-slot, not global? ==

Each ReorderBuffer instance lives in a single walsender process and
corresponds to exactly one replication slot. A per-slot limit is:

   - Lock-free (no shared memory coordination needed)
   - Simple to reason about (each slot has its own budget)
   - Sufficient to protect against disk-full (the DBA sets the limit based
   on available disk / number of slots)

A global (cross-slot) limit could be layered on top later if needed, but
would require shared-memory counters with spinlock/atomic protection.

== Performance impact ==

   - Hot path (in-memory change queuing): zero overhead.
   - Spill path: one integer comparison before serialization, one integer
   addition after write() — negligible compared to the I/O cost.
   - Cleanup path: one integer subtraction after unlink() — negligible.

Looking forward to feedback.
Thanks,
Shawn.

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Add logical_decoding_spill_limit to cap spill file disk usage per slot
  2026-03-23 13:19 Add logical_decoding_spill_limit to cap spill file disk usage per slot shawn wang <[email protected]>
@ 2026-03-23 13:23 ` shawn wang <[email protected]>
  2 siblings, 0 replies; 6+ messages in thread

From: shawn wang @ 2026-03-23 13:23 UTC (permalink / raw)
  To: [email protected]

This is the patch.

shawn wang <[email protected]> 于2026年3月23日周一 21:19写道：

> Hi hackers,
>
>  == Motivation ==
>
> We operate a fleet of PostgreSQL instances with logical replication. On
> several occasions, we have experienced production incidents where logical
> decoding spill files (pg_replslot/<slot>/xid-*.spill) grew uncontrollably —
> consuming tens of gigabytes and eventually filling up the data disk. This
> caused the entire instance to go read-only, impacting not just replication
> but all write workloads.
>
> The typical scenario is a large transaction (e.g. bulk data load or a
> long-running DDL) combined with a subscriber that is either slow or
> temporarily disconnected. The reorder buffer exceeds
> logical_decoding_work_mem and starts spilling, but there is no upper bound
> on how much can be spilled. The only backstop today is the OS returning
> ENOSPC, at which point the damage is already done.
>
> We looked for existing protections:
>
>    - max_slot_wal_keep_size: limits WAL retention, but does not affect
>    spill files at all.
>    - logical_decoding_work_mem: controls *when* spilling starts, but not
>    *how much* can be spilled.
>    - There is no existing GUC, patch, or commitfest entry that addresses
>    spill file disk quota.
>
>
> The "Report reorder buffer size" patch (CF #6053, by Ashutosh Bapat)
> improves observability of reorder buffer state, which is complementary —
> but observability alone cannot prevent disk-full incidents.
>
> == Proposed solution ==
>
> The attached patch adds a new GUC:
> logical_decoding_spill_limit (integer, unit kB, default 0)
>
> When set to a positive value, it limits the total size of on-disk spill
> files per replication slot. Key design points:
>
>    1. Tracking: We add two new fields: - ReorderBuffer.spillBytesOnDisk —
>    current total on-disk spill size for this slot (unlike spillBytes which is
>    a cumulative statistic counter, this is a live gauge). -
>    ReorderBufferTXN.serialized_size — per-transaction on-disk size, so we can
>    accurately decrement the global counter during cleanup.
>    2. Increment: In ReorderBufferSerializeChange(), after a successful
>    write(), both counters are incremented by the size written.
>    3. Decrement: In ReorderBufferRestoreCleanup(), when spill files are
>    unlinked, the global counter is decremented by the transaction's
>    serialized_size.
>    4. Enforcement: In ReorderBufferCheckMemoryLimit(), before calling
>    ReorderBufferSerializeTXN(), we check: if (spillBytesOnDisk + txn->size >
>    spill_limit) ereport(ERROR, ...) This is only checked on the spill-to-disk
>    path — not on the streaming path (which involves no disk I/O).
>    5. Behavior on limit exceeded: An ERROR is raised with
>    ERRCODE_CONFIGURATION_LIMIT_EXCEEDED. The walsender exits, but the slot's
>    restart_lsn and confirmed_flush are preserved. The subscriber can reconnect
>    after the DBA:
>       1. increases logical_decoding_spill_limit, or
>       2. increases logical_decoding_work_mem (to reduce spilling), or
>       3. switches to a streaming-capable output plugin (which avoids
>       spilling entirely).
>    6. Default 0 means unlimited — fully backward compatible.
>
> == Why per-slot, not global? ==
>
> Each ReorderBuffer instance lives in a single walsender process and
> corresponds to exactly one replication slot. A per-slot limit is:
>
>    - Lock-free (no shared memory coordination needed)
>    - Simple to reason about (each slot has its own budget)
>    - Sufficient to protect against disk-full (the DBA sets the limit
>    based on available disk / number of slots)
>
> A global (cross-slot) limit could be layered on top later if needed, but
> would require shared-memory counters with spinlock/atomic protection.
>
> == Performance impact ==
>
>    - Hot path (in-memory change queuing): zero overhead.
>    - Spill path: one integer comparison before serialization, one integer
>    addition after write() — negligible compared to the I/O cost.
>    - Cleanup path: one integer subtraction after unlink() — negligible.
>
>
> Looking forward to feedback.
> Thanks,
> Shawn.
>


Attachments:

  [application/octet-stream] 0001-Add-logical_decoding_spill_limit-GUC-to-cap-spill-file-limit.patch (11.4K, 3-0001-Add-logical_decoding_spill_limit-GUC-to-cap-spill-file-limit.patch)
  download | inline diff:
From d93e1b9be047f6ba484e1f105a94505b7d261c5b Mon Sep 17 00:00:00 2001
From: Shawn Wang <[email protected]>
Date: Mon, 23 Mar 2026 20:46:08 +0800
Subject: [PATCH] Add logical_decoding_spill_limit GUC to cap spill file disk
 usage.

When logical decoding exceeds logical_decoding_work_mem, changes are
spilled to disk under pg_replslot/<slot>/.  Currently there is no
limit on the total size of these spill files, which can lead to
uncontrolled disk space consumption for large or numerous concurrent
transactions.

Add a new GUC logical_decoding_spill_limit (integer, unit kB,
default 0 meaning unlimited) that limits the total on-disk spill file
size per replication slot.  The implementation tracks the current
on-disk footprint in ReorderBuffer.spillBytesOnDisk (incremented
after each successful write in ReorderBufferSerializeChange,
decremented when spill files are removed in
ReorderBufferRestoreCleanup) and per-transaction in
ReorderBufferTXN.serialized_size.

Before serializing a transaction to disk, ReorderBufferCheckMemoryLimit
checks whether the projected write would exceed the configured limit.
If so, it raises an ERROR.  The replication slot state (restart_lsn,
confirmed_flush) is preserved, so the subscriber can reconnect after
the administrator increases the limit, raises logical_decoding_work_mem,
or switches to a streaming-capable output plugin.

The check is only on the spill-to-disk path (not the streaming path,
which involves no disk I/O) and only when the limit is set (> 0), so
there is zero overhead on the normal hot path and full backward
compatibility with existing configurations.
---
 doc/src/sgml/config.sgml                      | 24 ++++++++
 .../replication/logical/reorderbuffer.c       | 56 +++++++++++++++++++
 src/backend/utils/misc/guc_parameters.dat     | 10 ++++
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/replication/reorderbuffer.h       | 17 ++++++
 5 files changed, 109 insertions(+)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8cdd826fbd3..e6fc9653fef 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2120,6 +2120,30 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-spill-limit" xreflabel="logical_decoding_spill_limit">
+      <term><varname>logical_decoding_spill_limit</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_spill_limit</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum total size of on-disk spill files that logical
+        decoding may use per replication slot. When changes exceed the memory
+        limit set by <varname>logical_decoding_work_mem</varname>, they are
+        spilled to disk. This parameter limits the total amount of disk space
+        used by those spill files. If the limit would be exceeded, the
+        walsender process raises an error and disconnects. The replication
+        slot's state is preserved, so the subscriber can reconnect after the
+        administrator increases this limit or switches to a streaming-capable
+        output plugin.
+        If this value is specified without units, it is taken as kilobytes.
+        The default value is <literal>0</literal>, which means there is no
+        limit on spill file disk usage.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-commit-timestamp-buffers" xreflabel="commit_timestamp_buffers">
       <term><varname>commit_timestamp_buffers</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 682d13c9f22..d9f82c2ac48 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -72,6 +72,15 @@
  *	  counter is updated; however transactions with size 0 are not stored in
  *	  the heap, because they have no changes to evict.
  *
+ *	  To prevent spill files from consuming unbounded disk space, the
+ *	  logical_decoding_spill_limit GUC can be set to limit the total size
+ *	  of spill files per replication slot.  We track the current on-disk
+ *	  footprint in ReorderBuffer.spillBytesOnDisk (incremented on each
+ *	  successful write, decremented when spill files are cleaned up) and
+ *	  per-transaction in ReorderBufferTXN.serialized_size.  Before
+ *	  serializing a transaction, we check whether the projected write would
+ *	  exceed the configured limit and raise an ERROR if so.
+ *
  *	  We still rely on max_changes_in_memory when loading serialized changes
  *	  back into memory. At that point we can't use the memory limit directly
  *	  as we load the subxacts independently. One option to deal with this
@@ -224,6 +233,7 @@ typedef struct ReorderBufferDiskChange
  * like.
  */
 int			logical_decoding_work_mem;
+int			logical_decoding_spill_limit;
 static const Size max_changes_in_memory = 4096; /* XXX for restore only */
 
 /* GUC variable */
@@ -3885,6 +3895,10 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
  * If debug_logical_replication_streaming is set to "immediate", stream or
  * serialize the changes immediately.
  *
+ * When spilling to disk, if logical_decoding_spill_limit is set (> 0),
+ * we check whether the projected write would exceed the configured limit
+ * and raise an ERROR if so, to prevent unbounded disk usage.
+ *
  * XXX At this point we select the transactions until we reach under the memory
  * limit, but we might also adapt a more elaborate eviction strategy - for example
  * evicting enough transactions to free certain fraction (e.g. 50%) of the memory
@@ -3964,6 +3978,36 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
 				continue;
 
+			/*
+			 * Check the spill-to-disk size limit before actually serializing.
+			 * We use the transaction's in-memory size as an estimate of how
+			 * much will be written, which is a reasonable approximation.
+			 *
+			 * We only check when logical_decoding_spill_limit is set (> 0).
+			 * When the limit would be exceeded, raise an ERROR.  The
+			 * walsender will exit, but the replication slot's restart_lsn is
+			 * preserved so the subscriber can reconnect after the DBA
+			 * increases the limit or switches to a streaming-capable output
+			 * plugin.
+			 */
+			if (logical_decoding_spill_limit > 0 &&
+				rb->spillBytesOnDisk + txn->size >
+				(Size) logical_decoding_spill_limit * 1024)
+			{
+				ereport(ERROR,
+						(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+						 errmsg("logical decoding spill file size limit exceeded"),
+						 errdetail("Current on-disk spill size is %zu bytes, "
+								   "transaction to spill is %zu bytes, "
+								   "limit is %d kB.",
+								   rb->spillBytesOnDisk, txn->size,
+								   logical_decoding_spill_limit),
+						 errhint("Consider increasing %s, %s, "
+								 "or using a streaming-capable output plugin.",
+								 "logical_decoding_spill_limit",
+								 "logical_decoding_work_mem")));
+			}
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
@@ -4286,6 +4330,13 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	}
 	pgstat_report_wait_end();
 
+	/*
+	 * Update the on-disk spill size accounting for both the transaction and
+	 * the reorder buffer.  This is used to enforce logical_decoding_spill_limit.
+	 */
+	txn->serialized_size += ondisk->size;
+	rb->spillBytesOnDisk += ondisk->size;
+
 	/*
 	 * Keep the transaction's final_lsn up to date with each change we send to
 	 * disk, so that ReorderBufferRestoreCleanup works correctly.  (We used to
@@ -4872,6 +4923,11 @@ ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn)
 					(errcode_for_file_access(),
 					 errmsg("could not remove file \"%s\": %m", path)));
 	}
+
+	/* Update the on-disk spill size accounting. */
+	Assert(rb->spillBytesOnDisk >= txn->serialized_size);
+	rb->spillBytesOnDisk -= txn->serialized_size;
+	txn->serialized_size = 0;
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 0c9854ad8fc..3168d23f387 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1849,6 +1849,16 @@
   boot_val => 'false',
 },
 
+{ name => 'logical_decoding_spill_limit', type => 'int', context => 'PGC_USERSET', group => 'RESOURCES_MEM',
+  short_desc => 'Sets the maximum total size of spill files for logical decoding.',
+  long_desc => 'Limits the total disk space used by spill files per replication slot during logical decoding. 0 means no limit.',
+  flags => 'GUC_UNIT_KB',
+  variable => 'logical_decoding_spill_limit',
+  boot_val => '0',
+  min => '0',
+  max => 'MAX_KILOBYTES',
+},
+
 { name => 'logical_decoding_work_mem', type => 'int', context => 'PGC_USERSET', group => 'RESOURCES_MEM',
   short_desc => 'Sets the maximum memory to be used for logical decoding.',
   long_desc => 'This much memory can be used by each internal reorder buffer before spilling to disk.',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e4abe6c0077..04b4fb73737 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -149,6 +149,8 @@
 #maintenance_work_mem = 64MB            # min 64kB
 #autovacuum_work_mem = -1               # min 64kB, or -1 to use maintenance_work_mem
 #logical_decoding_work_mem = 64MB       # min 64kB
+#logical_decoding_spill_limit = 0       # limits spill file disk usage per slot;
+                                        # 0 means no limit
 #max_stack_depth = 2MB                  # min 100kB
 #shared_memory_type = mmap              # the default is the first option
                                         # supported by the operating system:
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 2d717a9e152..04249a99d82 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -25,6 +25,7 @@
 
 /* GUC variables */
 extern PGDLLIMPORT int logical_decoding_work_mem;
+extern PGDLLIMPORT int logical_decoding_spill_limit;
 extern PGDLLIMPORT int debug_logical_replication_streaming;
 
 /* possible values for debug_logical_replication_streaming */
@@ -461,6 +462,14 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
+	/*
+	 * Size of this transaction's changes currently serialized to disk (in
+	 * bytes).  Tracked per-transaction so that we can accurately update the
+	 * ReorderBuffer-level spillBytesOnDisk counter when spill files are
+	 * cleaned up.
+	 */
+	Size		serialized_size;
+
 	/*
 	 * Private data pointer of the output plugin.
 	 */
@@ -685,6 +694,14 @@ struct ReorderBuffer
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillBytes;		/* amount of data spilled to disk */
 
+	/*
+	 * Current total size of spill files on disk for this reorder buffer (in
+	 * bytes).  Unlike spillBytes which is a cumulative statistic counter, this
+	 * tracks the actual on-disk footprint right now and is decremented when
+	 * spill files are cleaned up.  Used to enforce logical_decoding_spill_limit.
+	 */
+	Size		spillBytesOnDisk;
+
 	/* Statistics about transactions streamed to the decoding output plugin */
 	int64		streamTxns;		/* number of transactions streamed */
 	int64		streamCount;	/* streaming invocation counter */
-- 
2.43.7



^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Add logical_decoding_spill_limit to cap spill file disk usage per slot
  2026-03-23 13:19 Add logical_decoding_spill_limit to cap spill file disk usage per slot shawn wang <[email protected]>
@ 2026-03-23 14:24 ` Bharath Rupireddy <[email protected]>
  2026-04-03 16:14   ` Re: Add logical_decoding_spill_limit to cap spill file disk usage per slot shawn wang <[email protected]>
  2 siblings, 1 reply; 6+ messages in thread

From: Bharath Rupireddy @ 2026-03-23 14:24 UTC (permalink / raw)
  To: shawn wang <[email protected]>; +Cc: [email protected]

Hi,

On Mon, Mar 23, 2026 at 6:20 AM shawn wang <[email protected]> wrote:
>
> Hi hackers,

Thank you for proposing this new feature.

>  == Motivation ==
>
> We operate a fleet of PostgreSQL instances with logical replication. On several occasions, we have experienced production incidents where logical decoding spill files (pg_replslot/<slot>/xid-*.spill) grew uncontrollably — consuming tens of gigabytes and eventually filling up the data disk. This caused the entire instance to go read-only, impacting not just replication but all write workloads.
>
> The typical scenario is a large transaction (e.g. bulk data load or a long-running DDL) combined with a subscriber that is either slow or temporarily disconnected. The reorder buffer exceeds logical_decoding_work_mem and starts spilling, but there is no upper bound on how much can be spilled. The only backstop today is the OS returning ENOSPC, at which point the damage is already done.

Having a lot of spill files also increases crash/recovery times.
However, files spilling to disk causing no-space-left-on-disk issues
leading to downtime applies to WAL files, historical catalog snapshot
files, subtransaction overflow files, CLOG (and all the subsystems
backed by SLRU data structure), etc. - basically any Postgres
subsystem writing files to disk. I'm a bit worried that we may end up
solving disk space issues, which IMHO are outside of the database
scope, in the database. Others may have different opinions though.

How common is this issue? Could you please add a test case to the
proposed patch that without this feature would otherwise hit the issue
described?

Having said that, were alternatives like disabling subscriptions when
seen occupying the disk space considered?

> We looked for existing protections:
>
> max_slot_wal_keep_size: limits WAL retention, but does not affect spill files at all.
> logical_decoding_work_mem: controls *when* spilling starts, but not *how much* can be spilled.
> There is no existing GUC, patch, or commitfest entry that addresses spill file disk quota.

Interesting!

> The "Report reorder buffer size" patch (CF #6053, by Ashutosh Bapat) improves observability of reorder buffer state, which is complementary — but observability alone cannot prevent disk-full incidents.

With the proposed reorder buffer stats above, would it be possible to
have a monitoring solution (an extension or a tool) to disable
subscriptions and notify the admin? Would something like this work?

> == Proposed solution ==
>
> The attached patch adds a new GUC:
> logical_decoding_spill_limit (integer, unit kB, default 0)
>
> When set to a positive value, it limits the total size of on-disk spill files per replication slot. Key design points:
>
> Tracking: We add two new fields: - ReorderBuffer.spillBytesOnDisk — current total on-disk spill size for this slot (unlike spillBytes which is a cumulative statistic counter, this is a live gauge). - ReorderBufferTXN.serialized_size — per-transaction on-disk size, so we can accurately decrement the global counter during cleanup.
> Increment: In ReorderBufferSerializeChange(), after a successful write(), both counters are incremented by the size written.
> Decrement: In ReorderBufferRestoreCleanup(), when spill files are unlinked, the global counter is decremented by the transaction's serialized_size.
> Enforcement: In ReorderBufferCheckMemoryLimit(), before calling ReorderBufferSerializeTXN(), we check: if (spillBytesOnDisk + txn->size > spill_limit) ereport(ERROR, ...) This is only checked on the spill-to-disk path — not on the streaming path (which involves no disk I/O).
> Behavior on limit exceeded: An ERROR is raised with ERRCODE_CONFIGURATION_LIMIT_EXCEEDED. The walsender exits, but the slot's restart_lsn and confirmed_flush are preserved. The subscriber can reconnect after the DBA:
>
> increases logical_decoding_spill_limit, or
> increases logical_decoding_work_mem (to reduce spilling), or
> switches to a streaming-capable output plugin (which avoids spilling entirely).

When the logical_decoding_spill_limit is exceeded, ERRORing out in the
walsender is even more problematic, right? The replication slot would
be inactive, causing bloat and preventing tuple freezing, WAL files
growth and eventually the system may hit disk-space issues - it is
like "we avoided disk space issues for one subsystem, but introduced
it for another". This looks a bit problematic IMHO. Others may have
different opinions though.

--
Bharath Rupireddy
Amazon Web Services: https://aws.amazon.com





^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Add logical_decoding_spill_limit to cap spill file disk usage per slot
  2026-03-23 13:19 Add logical_decoding_spill_limit to cap spill file disk usage per slot shawn wang <[email protected]>
  2026-03-23 14:24 ` Re: Add logical_decoding_spill_limit to cap spill file disk usage per slot Bharath Rupireddy <[email protected]>
@ 2026-04-03 16:14   ` shawn wang <[email protected]>
  0 siblings, 0 replies; 6+ messages in thread

From: shawn wang @ 2026-04-03 16:14 UTC (permalink / raw)
  To: Bharath Rupireddy <[email protected]>; +Cc: [email protected]

Hi Bharath,

Thank you for the thorough and thoughtful review!


> Having a lot of spill files also increases crash/recovery times.
> However, files spilling to disk causing no-space-left-on-disk issues
> leading to downtime applies to WAL files, historical catalog snapshot
> files, subtransaction overflow files, CLOG (and all the subsystems
> backed by SLRU data structure), etc. - basically any Postgres
> subsystem writing files to disk. I'm a bit worried that we may end up
> solving disk space issues, which IMHO are outside of the database
> scope, in the database. Others may have different opinions though.

That's a fair philosophical point.  However, I'd argue that spill files
are uniquely unprotected compared to the other subsystems you mentioned:

  - WAL files → max_slot_wal_keep_size
  - Temp files → temp_file_limit
  - SLRU/CLOG → bounded by design (sized proportionally to active XIDs)

Spill files are the gap: they can grow proportionally to the *data
volume* of in-progress transactions, not to the number of transactions,
and there is no existing upper bound.  A single bulk-load transaction
replicated via logical decoding can generate tens of GBs of spill files
with no way for the DBA to cap it.

So this is not about solving a general OS-level disk problem in the
database; it is about providing the same kind of safety net that already
exists for WAL (max_slot_wal_keep_size) and temp files (temp_file_limit)
— a per-subsystem guardrail that the DBA can configure.



> How common is this issue? Could you please add a test case to the
> proposed patch that without this feature would otherwise hit the issue
> described?

Typically triggered by bulk data migrations or long-running DDL on large
tables,
combined with a subscriber that fell behind.  Each incident caused the
instance
to go read-only, affecting all workloads — not just replication.

We have also seen reports of this in pgsql-general and on community
forums, though the root cause is not always identified as spill files.

Absolutely.  I will add a TAP test that:
  1. Creates 10 logical replication slots with test_decoding.
  2. Starts a large transaction that generates spill files exceeding
     the configured logical_decoding_spill_limit.
  3. Verifies that the walsender ERRORs out with the expected
     ERRCODE_CONFIGURATION_LIMIT_EXCEEDED.
  4. Verifies that spill files are cleaned up after the error.

I will include the 4 in v2.

Test case:
1. build a table:

psql -p 5432 -d postgres -c "
CREATE TABLE spill_load (
    id      bigint,
    payload text
);
"

2. build some slots:
#base

for i in $(seq 1 10); do
    psql -p 5432 -d postgres -c \
        "SELECT pg_create_logical_replication_slot('spill_slot_${i}',
'test_decoding');"
done

psql -p 5432 -d postgres -c \
    "SELECT slot_name, plugin, active FROM pg_replication_slots ORDER BY
slot_name;"

3. start pg_recvlogical:
#bash
mkdir -p /tmp/spill_test/pids

for i in $(seq 1 10); do
    pg_recvlogical \
        --dbname="port=5432 dbname=postgres" \
        --slot="spill_slot_${i}" \
        --start \
        --no-loop \
        --status-interval=0 \
        --file=/dev/null &
    echo $! > /tmp/spill_test/pids/recv_${i}.pid
    echo "Started pg_recvlogical for spill_slot_${i}, PID=$!"
done

sleep 2

psql -p 5432 -d postgres -c \
    "SELECT slot_name, active, active_pid
     FROM pg_replication_slots
     WHERE slot_name LIKE 'spill_slot_%'
     ORDER BY slot_name;"

4. make a big transaction:

psql -p 5432 -d postgres -c "
BEGIN;
INSERT INTO spill_load
    SELECT i, repeat('X', 100000)
    FROM generate_series(1, 10000000) i;
COMMIT;

"

> Having said that, were alternatives like disabling subscriptions when
> seen occupying the disk space considered?

Yes, we considered monitoring-based approaches, but they have inherent
limitations:

  1. Reaction latency: Even with 1-second polling, a high-throughput
     bulk load can spill gigabytes between two polls.  The spill rate
     can easily exceed 100 MB/s, so a 5-second monitoring gap means
     500 MB of uncontrolled growth.

  2. External dependency: Relying on an external tool means the
     protection is not available by default.  Every deployment must
     set up and maintain the monitoring — and if it fails or is
     misconfigured, the disk-full scenario still occurs.

  3. No subscriber-side control: pg_recvlogical users and third-party
     CDC tools (Debezium, etc.) that consume directly from the
     publisher have no "subscription" to disable.

  4. Disabling a subscription does not immediately stop spilling:
     the walsender continues processing WAL records already in flight.

The proposed GUC acts as a synchronous, in-process safety valve — it
checks the limit at the exact point where the next spill write would
occur, with zero latency.  Monitoring and the GUC are complementary:
monitoring for alerting, the GUC as a hard backstop.


> When the logical_decoding_spill_limit is exceeded, ERRORing out in the
> walsender is even more problematic, right? The replication slot would
> be inactive, causing bloat and preventing tuple freezing, WAL files
> growth and eventually the system may hit disk-space issues - it is
> like "we avoided disk space issues for one subsystem, but introduced
> it for another". This looks a bit problematic IMHO. Others may have
> different opinions though.

This is a great point and deserves a careful answer.

First, the "inactive slot causes WAL growth" problem already has a
well-established solution: max_slot_wal_keep_size.  When this is
configured (which is recommended practice), an inactive slot will be
invalidated once WAL retention exceeds the limit, releasing both WAL
files and catalog bloat.  So the combination of:

logical_decoding_spill_limit + max_slot_wal_keep_size

provides defense in depth: the spill limit prevents the data disk from
filling up with spill files, while max_slot_wal_keep_size prevents WAL
disk from growing unboundedly due to an inactive slot.

Second, compare the two failure modes:

  (a) WITHOUT spill limit: disk fills up → the ENTIRE instance goes
      read-only or shutdown.  All workloads stop.  Recovery requires manual
      intervention (rm spill files, restart).  Data loss is possible
      if WAL writes also fail.

  (b) WITH spill limit: walsender ERRORs → only replication is
      interrupted.  The primary continues serving all read/write
      workloads normally.  The DBA receives a clear error message
      and can act: increase the limit, increase work_mem, enable
      streaming, or investigate the large transaction.

cenario (b) is strictly better than (a): it converts an
instance-wide outage into a replication-only interruption with a
clear, actionable error message.

Additionally, in the updated patch (v2), I have ensured that spill

files are properly cleaned up in the error path (via
WalSndErrorCleanup), so exceeding the limit does not leave orphaned
files on disk.

Best regards,
Shawn


Attachments:

  [application/x-patch] v2-0001-Add-logical_decoding_spill_limit-GUC-to-cap-spill-file-limit.patch (14.0K, 3-v2-0001-Add-logical_decoding_spill_limit-GUC-to-cap-spill-file-limit.patch)
  download | inline diff:
From 3693062b63acb189f657c819c1df0f55fdffa118 Mon Sep 17 00:00:00 2001
From: Shawn Wang <[email protected]>
Date: Mon, 23 Mar 2026 20:46:08 +0800
Subject: [PATCH v2] Add logical_decoding_spill_limit GUC to cap spill file
 disk usage.

When logical decoding exceeds logical_decoding_work_mem, changes are
spilled to disk under pg_replslot/<slot>/.  Currently there is no
limit on the total size of these spill files, which can lead to
uncontrolled disk space consumption for large or numerous concurrent
transactions.

Add a new GUC logical_decoding_spill_limit (integer, unit kB,
default 0 meaning unlimited) that limits the total on-disk spill file
size per replication slot.  The implementation tracks the current
on-disk footprint in ReorderBuffer.spillBytesOnDisk (incremented
after each successful write in ReorderBufferSerializeChange,
decremented when spill files are removed in
ReorderBufferRestoreCleanup) and per-transaction in
ReorderBufferTXN.serialized_size.

Before serializing a transaction to disk, ReorderBufferCheckMemoryLimit
checks whether the projected write would exceed the configured limit.
If so, it raises an ERROR.  The replication slot state (restart_lsn,
confirmed_flush) is preserved, so the subscriber can reconnect after
the administrator increases the limit, raises logical_decoding_work_mem,
or switches to a streaming-capable output plugin.

The check is only on the spill-to-disk path (not the streaming path,
which involves no disk I/O) and only when the limit is set (> 0), so
there is zero overhead on the normal hot path and full backward
compatibility with existing configurations.

Additionally, ensure spill files are cleaned up during error recovery.
Export ReorderBufferCleanupSerializedTXNs() and call it in
WalSndErrorCleanup() before releasing the replication slot, so that
spill files left behind by the ERROR are properly removed.  Without
this, the error path would release the slot (setting MyReplicationSlot
to NULL) before any chance to clean up, leaving orphaned spill files
on disk until the next walsender connects to the same slot or the
server restarts.
---
 doc/src/sgml/config.sgml                      | 24 ++++++++
 .../replication/logical/reorderbuffer.c       | 60 ++++++++++++++++++-
 src/backend/replication/walsender.c           | 12 ++++
 src/backend/utils/misc/guc_parameters.dat     | 10 ++++
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/replication/reorderbuffer.h       | 18 ++++++
 6 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8cdd826fbd3..e6fc9653fef 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2120,6 +2120,30 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-logical-decoding-spill-limit" xreflabel="logical_decoding_spill_limit">
+      <term><varname>logical_decoding_spill_limit</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>logical_decoding_spill_limit</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum total size of on-disk spill files that logical
+        decoding may use per replication slot. When changes exceed the memory
+        limit set by <varname>logical_decoding_work_mem</varname>, they are
+        spilled to disk. This parameter limits the total amount of disk space
+        used by those spill files. If the limit would be exceeded, the
+        walsender process raises an error and disconnects. The replication
+        slot's state is preserved, so the subscriber can reconnect after the
+        administrator increases this limit or switches to a streaming-capable
+        output plugin.
+        If this value is specified without units, it is taken as kilobytes.
+        The default value is <literal>0</literal>, which means there is no
+        limit on spill file disk usage.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-commit-timestamp-buffers" xreflabel="commit_timestamp_buffers">
       <term><varname>commit_timestamp_buffers</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 682d13c9f22..95e5b0f0507 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -72,6 +72,15 @@
  *	  counter is updated; however transactions with size 0 are not stored in
  *	  the heap, because they have no changes to evict.
  *
+ *	  To prevent spill files from consuming unbounded disk space, the
+ *	  logical_decoding_spill_limit GUC can be set to limit the total size
+ *	  of spill files per replication slot.  We track the current on-disk
+ *	  footprint in ReorderBuffer.spillBytesOnDisk (incremented on each
+ *	  successful write, decremented when spill files are cleaned up) and
+ *	  per-transaction in ReorderBufferTXN.serialized_size.  Before
+ *	  serializing a transaction, we check whether the projected write would
+ *	  exceed the configured limit and raise an ERROR if so.
+ *
  *	  We still rely on max_changes_in_memory when loading serialized changes
  *	  back into memory. At that point we can't use the memory limit directly
  *	  as we load the subxacts independently. One option to deal with this
@@ -224,6 +233,7 @@ typedef struct ReorderBufferDiskChange
  * like.
  */
 int			logical_decoding_work_mem;
+int			logical_decoding_spill_limit;
 static const Size max_changes_in_memory = 4096; /* XXX for restore only */
 
 /* GUC variable */
@@ -276,7 +286,7 @@ static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									 bool txn_prepared);
 static void ReorderBufferMaybeMarkTXNStreamed(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static bool ReorderBufferCheckAndTruncateAbortedTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
-static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
+/* ReorderBufferCleanupSerializedTXNs is declared in reorderbuffer.h */
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
 static int	ReorderBufferTXNSizeCompare(const pairingheap_node *a, const pairingheap_node *b, void *arg);
@@ -3885,6 +3895,10 @@ ReorderBufferLargestStreamableTopTXN(ReorderBuffer *rb)
  * If debug_logical_replication_streaming is set to "immediate", stream or
  * serialize the changes immediately.
  *
+ * When spilling to disk, if logical_decoding_spill_limit is set (> 0),
+ * we check whether the projected write would exceed the configured limit
+ * and raise an ERROR if so, to prevent unbounded disk usage.
+ *
  * XXX At this point we select the transactions until we reach under the memory
  * limit, but we might also adapt a more elaborate eviction strategy - for example
  * evicting enough transactions to free certain fraction (e.g. 50%) of the memory
@@ -3964,6 +3978,36 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 			if (ReorderBufferCheckAndTruncateAbortedTXN(rb, txn))
 				continue;
 
+			/*
+			 * Check the spill-to-disk size limit before actually serializing.
+			 * We use the transaction's in-memory size as an estimate of how
+			 * much will be written, which is a reasonable approximation.
+			 *
+			 * We only check when logical_decoding_spill_limit is set (> 0).
+			 * When the limit would be exceeded, raise an ERROR.  The
+			 * walsender will exit, but the replication slot's restart_lsn is
+			 * preserved so the subscriber can reconnect after the DBA
+			 * increases the limit or switches to a streaming-capable output
+			 * plugin.
+			 */
+			if (logical_decoding_spill_limit > 0 &&
+				rb->spillBytesOnDisk + txn->size >
+				(Size) logical_decoding_spill_limit * 1024)
+			{
+				ereport(ERROR,
+						(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+						 errmsg("logical decoding spill file size limit exceeded"),
+						 errdetail("Current on-disk spill size is %zu bytes, "
+								   "transaction to spill is %zu bytes, "
+								   "limit is %d kB.",
+								   rb->spillBytesOnDisk, txn->size,
+								   logical_decoding_spill_limit),
+						 errhint("Consider increasing %s, %s, "
+								 "or using a streaming-capable output plugin.",
+								 "logical_decoding_spill_limit",
+								 "logical_decoding_work_mem")));
+			}
+
 			ReorderBufferSerializeTXN(rb, txn);
 		}
 
@@ -4286,6 +4330,13 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	}
 	pgstat_report_wait_end();
 
+	/*
+	 * Update the on-disk spill size accounting for both the transaction and
+	 * the reorder buffer.  This is used to enforce logical_decoding_spill_limit.
+	 */
+	txn->serialized_size += ondisk->size;
+	rb->spillBytesOnDisk += ondisk->size;
+
 	/*
 	 * Keep the transaction's final_lsn up to date with each change we send to
 	 * disk, so that ReorderBufferRestoreCleanup works correctly.  (We used to
@@ -4872,13 +4923,18 @@ ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn)
 					(errcode_for_file_access(),
 					 errmsg("could not remove file \"%s\": %m", path)));
 	}
+
+	/* Update the on-disk spill size accounting. */
+	Assert(rb->spillBytesOnDisk >= txn->serialized_size);
+	rb->spillBytesOnDisk -= txn->serialized_size;
+	txn->serialized_size = 0;
 }
 
 /*
  * Remove any leftover serialized reorder buffers from a slot directory after a
  * prior crash or decoding session exit.
  */
-static void
+void
 ReorderBufferCleanupSerializedTXNs(const char *slotname)
 {
 	DIR		   *spill_dir;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 08253103cb3..2ca99b31c5b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -356,6 +356,18 @@ WalSndErrorCleanup(void)
 	if (xlogreader != NULL && xlogreader->seg.ws_file >= 0)
 		wal_segment_close(xlogreader);
 
+	/*
+	 * Clean up spill files before releasing the slot, because
+	 * ReorderBufferCleanupSerializedTXNs() needs the slot name to locate the
+	 * directory.  We use the static logical_decoding_ctx to determine whether
+	 * we are in a logical decoding session.
+	 */
+	if (logical_decoding_ctx != NULL && MyReplicationSlot != NULL)
+	{
+		ReorderBufferCleanupSerializedTXNs(NameStr(MyReplicationSlot->data.name));
+		logical_decoding_ctx = NULL;
+	}
+
 	if (MyReplicationSlot != NULL)
 		ReplicationSlotRelease();
 
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 0c9854ad8fc..3168d23f387 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1849,6 +1849,16 @@
   boot_val => 'false',
 },
 
+{ name => 'logical_decoding_spill_limit', type => 'int', context => 'PGC_USERSET', group => 'RESOURCES_MEM',
+  short_desc => 'Sets the maximum total size of spill files for logical decoding.',
+  long_desc => 'Limits the total disk space used by spill files per replication slot during logical decoding. 0 means no limit.',
+  flags => 'GUC_UNIT_KB',
+  variable => 'logical_decoding_spill_limit',
+  boot_val => '0',
+  min => '0',
+  max => 'MAX_KILOBYTES',
+},
+
 { name => 'logical_decoding_work_mem', type => 'int', context => 'PGC_USERSET', group => 'RESOURCES_MEM',
   short_desc => 'Sets the maximum memory to be used for logical decoding.',
   long_desc => 'This much memory can be used by each internal reorder buffer before spilling to disk.',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e4abe6c0077..04b4fb73737 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -149,6 +149,8 @@
 #maintenance_work_mem = 64MB            # min 64kB
 #autovacuum_work_mem = -1               # min 64kB, or -1 to use maintenance_work_mem
 #logical_decoding_work_mem = 64MB       # min 64kB
+#logical_decoding_spill_limit = 0       # limits spill file disk usage per slot;
+                                        # 0 means no limit
 #max_stack_depth = 2MB                  # min 100kB
 #shared_memory_type = mmap              # the default is the first option
                                         # supported by the operating system:
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 2d717a9e152..6158643571f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -25,6 +25,7 @@
 
 /* GUC variables */
 extern PGDLLIMPORT int logical_decoding_work_mem;
+extern PGDLLIMPORT int logical_decoding_spill_limit;
 extern PGDLLIMPORT int debug_logical_replication_streaming;
 
 /* possible values for debug_logical_replication_streaming */
@@ -461,6 +462,14 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
+	/*
+	 * Size of this transaction's changes currently serialized to disk (in
+	 * bytes).  Tracked per-transaction so that we can accurately update the
+	 * ReorderBuffer-level spillBytesOnDisk counter when spill files are
+	 * cleaned up.
+	 */
+	Size		serialized_size;
+
 	/*
 	 * Private data pointer of the output plugin.
 	 */
@@ -685,6 +694,14 @@ struct ReorderBuffer
 	int64		spillCount;		/* spill-to-disk invocation counter */
 	int64		spillBytes;		/* amount of data spilled to disk */
 
+	/*
+	 * Current total size of spill files on disk for this reorder buffer (in
+	 * bytes).  Unlike spillBytes which is a cumulative statistic counter, this
+	 * tracks the actual on-disk footprint right now and is decremented when
+	 * spill files are cleaned up.  Used to enforce logical_decoding_spill_limit.
+	 */
+	Size		spillBytesOnDisk;
+
 	/* Statistics about transactions streamed to the decoding output plugin */
 	int64		streamTxns;		/* number of transactions streamed */
 	int64		streamCount;	/* streaming invocation counter */
@@ -782,5 +799,6 @@ extern uint32 ReorderBufferGetInvalidations(ReorderBuffer *rb,
 											SharedInvalidationMessage **msgs);
 
 extern void StartupReorderBuffer(void);
+extern void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 
 #endif
-- 
2.43.7



^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* RE: Add logical_decoding_spill_limit to cap spill file disk usage per slot
  2026-03-23 13:19 Add logical_decoding_spill_limit to cap spill file disk usage per slot shawn wang <[email protected]>
@ 2026-03-26 05:33 ` Hayato Kuroda (Fujitsu) <[email protected]>
  2026-04-03 16:11   ` Re: Add logical_decoding_spill_limit to cap spill file disk usage per slot shawn wang <[email protected]>
  2 siblings, 1 reply; 6+ messages in thread

From: Hayato Kuroda (Fujitsu) @ 2026-03-26 05:33 UTC (permalink / raw)
  To: 'shawn wang' <[email protected]>; +Cc: [email protected] <[email protected]>

Dear Shawn,

> We operate a fleet of PostgreSQL instances with logical replication. On several
> occasions, we have experienced production incidents where logical decoding spill
> files (pg_replslot/<slot>/xid-*.spill) grew uncontrollably — consuming tens of
> gigabytes and eventually filling up the data disk. This caused the entire instance
> to go read-only, impacting not just replication but all write workloads.

We have provided the subscription option streaming=parallel since PG16. It
replicates on-going transactions and applies immediately. Does it avoid the
issue?

> 5. Behavior on limit exceeded: An ERROR is raised with ERRCODE_CONFIGURATION_LIMIT_EXCEEDED.
> The walsender exits, but the slot's restart_lsn and confirmed_flush are preserved.
> The subscriber can reconnect after the DBA:

Not sure, but doesn't it mean the error is repeating till the GUC is increased?
Also, is there any difference for the slots's behavior, with the normal walsender's
exit case?

Best regards,
Hayato Kuroda
FUJITSU LIMITED



^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Add logical_decoding_spill_limit to cap spill file disk usage per slot
  2026-03-23 13:19 Add logical_decoding_spill_limit to cap spill file disk usage per slot shawn wang <[email protected]>
  2026-03-26 05:33 ` RE: Add logical_decoding_spill_limit to cap spill file disk usage per slot Hayato Kuroda (Fujitsu) <[email protected]>
@ 2026-04-03 16:11   ` shawn wang <[email protected]>
  0 siblings, 0 replies; 6+ messages in thread

From: shawn wang @ 2026-04-03 16:11 UTC (permalink / raw)
  To: Hayato Kuroda (Fujitsu) <[email protected]>; +Cc: [email protected] <[email protected]>

Hi Kuroda,

Thank you for the review and the great questions!

> We have provided the subscription option streaming=parallel since PG16. It
> replicates on-going transactions and applies immediately. Does it avoid
the
> issue?

streaming=parallel does significantly reduce publisher-side spill files
in the common case — when enabled, the reorder buffer streams changes
directly instead of spilling to disk.

However, it cannot guarantee 100% avoidance of spilling.  There are
several fallback scenarios in the code where streaming is not possible
and the reorder buffer falls back to spill-to-disk even when
streaming=parallel is configured:

  1. Snapshot not yet consistent (snapbuild.c — SnapBuildCurrentState()
     < SNAPBUILD_CONSISTENT), e.g. right after slot creation.

  2. Transaction is being re-decoded after a restart
     (SnapBuildXactNeedsSkip() returns true).

  3. Transaction contains TOAST partial changes
     (rbtxn_has_partial_change), which cannot be streamed.

  4. Transaction contains speculative inserts (INSERT ... ON CONFLICT),
     also flagged as partial changes.

  5. Transaction has no streamable changes yet
     (!rbtxn_has_streamable_change).

  6. Output plugin does not support streaming callbacks
     (e.g. test_decoding without the streaming option).

  7. Parallel apply worker is busy for >10 seconds — the leader falls
     back to serializing changes to disk
     (applyparallelworker.c, SHM_SEND_TIMEOUT_MS).

  8. No parallel worker available — the leader serializes the entire
     streamed transaction to disk (worker.c,
     get_transaction_apply_action → TRANS_LEADER_SERIALIZE).

Additionally, streaming is a *subscription-level* parameter that only
applies to built-in logical replication.  Users of pg_recvlogical or
third-party CDC tools (Debezium, etc.) consume changes directly from
the publisher's walsender and have no subscription to configure.

So streaming=parallel and logical_decoding_spill_limit are
complementary: streaming reduces spilling in the common case, while
the spill limit provides a hard safety net for the cases where
spilling is unavoidable.

> Not sure, but doesn't it mean the error is repeating till the GUC is
increased?

Good question.  Yes, if the same large transaction is re-decoded
without any configuration change, the same ERROR will occur again.
This is intentional — the behavior is analogous to temp_file_limit:
once the limit is hit, the operation fails, and it will keep failing
until the DBA takes action.

The DBA has several options to resolve it:

  - Increase logical_decoding_spill_limit.
  - Increase logical_decoding_work_mem (so less data is spilled).
  - Enable streaming on the subscriber (streaming=on or
    streaming=parallel), which avoids spilling in most cases.
  - Investigate and address the root cause (e.g. break up the
    large transaction).

The ERROR message includes the current spill size and the configured
limit, making it straightforward to diagnose.

> Also, is there any difference for the slots's behavior, with the normal
walsender's
> exit case?

No, the slot behavior is the same as a normal walsender exit.
Specifically:

  - The slot remains valid (it is NOT invalidated).
  - restart_lsn and confirmed_flush are preserved.
  - The subscriber can reconnect and resume from where it left off.
  - In v2, spill files are properly cleaned up in the error path
    (via WalSndErrorCleanup), so no orphaned files are left behind.

The only difference is that the walsender's exit reason is logged as
an ERROR with ERRCODE_CONFIGURATION_LIMIT_EXCEEDED, rather than a
normal shutdown.  The slot itself is in exactly the same state as if
the walsender had exited normally or the connection was dropped.

Best regards,
Shawn

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

end of thread, other threads:[~2026-04-03 16:14 UTC | newest]

Thread overview: 6+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-03-23 13:19 Add logical_decoding_spill_limit to cap spill file disk usage per slot shawn wang <[email protected]>
2026-03-23 13:23 ` shawn wang <[email protected]>
2026-03-23 14:24 ` Bharath Rupireddy <[email protected]>
2026-04-03 16:14   ` shawn wang <[email protected]>
2026-03-26 05:33 ` Hayato Kuroda (Fujitsu) <[email protected]>
2026-04-03 16:11   ` shawn wang <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox