public inbox for [email protected]
help / color / mirror / Atom feedRe: Streaming replication and WAL archive interactions
5+ messages / 4 participants
[nested] [flat]
* Re: Streaming replication and WAL archive interactions
@ 2026-02-12 06:56 Andrey Borodin <[email protected]>
2026-02-20 20:57 ` Re: Streaming replication and WAL archive interactions Harinath Kanchu <[email protected]>
2026-03-03 10:43 ` Re: Streaming replication and WAL archive interactions Jaroslav Novikov <[email protected]>
2026-03-03 12:06 ` Re: Streaming replication and WAL archive interactions Jaroslav Novikov <[email protected]>
2026-05-03 22:50 ` Re: Streaming replication and WAL archive interactions Grigory Smolkin <[email protected]>
0 siblings, 4 replies; 5+ messages in thread
From: Andrey Borodin @ 2026-02-12 06:56 UTC (permalink / raw)
To: [email protected]; +Cc: Michael Paquier <[email protected]>; Robert Haas <[email protected]>; Venkata Balaji N <[email protected]>; Andres Freund <[email protected]>; Fujii Masao <[email protected]>; Borodin Vladimir <[email protected]>; pgsql-hackers; [email protected]; Roman Khapov <[email protected]>; Kirill Reshke <[email protected]>; [email protected]
> On 11 May 2015, at 21:00, Heikki Linnakangas <[email protected]> wrote:
>
> Applied that part.
>
>> Now that we got this last-partial-segment problem out of the way, I'm
>> going to try fixing the problem you (Michael) pointed out about relying
>> on pgstat file. Meanwhile, I'd love to get more feedback on the rest of
>> the patch, and the documentation.
>
> And here is a new version of the patch. I kept the approach of using pgstat, but it now only polls pgstat every 10 seconds, and doesn't block to wait for updated stats.
Hi Heikki,
There’s a nearby thread [0] (about 10 years later) where I’m working on a problem your patch from this thread helps solve.
In datacenter large outages, 1–2% of clusters end up with gaps in their PITR timeline.
In HA setups, when the primary is lost, some WAL can be missing from the archive even though it was streamed to the standby. Many HA tools (PGConsul, Patroni, etc.) try to re-archive from the standby, but those WAL files may already have been removed.
Your “shared” archive mode addresses this: the standby keeps WAL until it’s archived. archive_mode=always plus an archive tool can work, but it’s expensive. In WAL-G, for example, the archive command does a GET on the standby’s WAL, then decrypts and compares. Switching to HEAD would reduce cost in some clouds but still adds cost.
Another option is coordinating archiving outside Postgres, but that would mean building distributed coordination into the archive tool.
Shared archive mode tackles this in Postgres itself.
I’ve retrofitted your patch, incorporated ideas from the Greenplum work [1], and made some improvements.
The patchset has three parts:
* Rebase + tests – Your original patch, rebased, with tests added.
* Timeline switching – Correct handling of timeline switches in archive status updates.
* Avoid directory scans – Skip scanning archive_status when possible, which was costly in WAL-G setups.
What do you think?
Best regards, Andrey Borodin.
Attachments:
[application/octet-stream] v4-0001-Add-archive_mode-shared-for-coordinated-WAL-archi.patch (34.5K, 2-v4-0001-Add-archive_mode-shared-for-coordinated-WAL-archi.patch)
download | inline diff:
From 334637b17d6ea1f93bb10966a064eb70ca472db9 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <[email protected]>
Date: Tue, 10 Feb 2026 12:47:32 +0500
Subject: [PATCH v4 1/3] Add archive_mode=shared for coordinated WAL archiving
Introduce a new archive_mode setting "shared" to prevent WAL history
loss during standby promotion in HA streaming replication setups.
In shared mode, the primary proactively sends archival status updates
to standbys via the replication protocol. The standby creates .ready
files for received WAL segments but defers marking them as .done until
the primary confirms archival. This prevents WAL from being recycled
before it's safely archived, addressing a critical gap in PITR continuity
during failover.
Key implementation details:
- Primary periodically sends last archived WAL segment via new
PqReplMsg_ArchiveStatusReport ('a') message
- Standby marks all segments <= reported segment as .done using
alphanumeric comparison on segment part (timeline-safe)
- Archiver skips during recovery in shared mode, activates on promotion
- Cascading replication: each standby coordinates with immediate upstream
- Startup check rejects archive_mode=on during recovery
This "push" design (primary sends status) is more efficient than "pull"
(standby queries per-segment), avoiding directory scans and stat() calls.
Based on Heikki Linnakangas's 2014 design and Greenplum's production
implementation, modernized for PostgreSQL 19.
Includes TAP tests covering basic synchronization, promotion,
cascading replication, and multiple standbys scenarios.
---
doc/src/sgml/config.sgml | 36 ++-
doc/src/sgml/high-availability.sgml | 72 ++++--
src/backend/access/transam/xlog.c | 1 +
src/backend/postmaster/pgarch.c | 17 +-
src/backend/replication/walreceiver.c | 146 +++++++++++-
src/backend/replication/walsender.c | 93 ++++++++
src/include/access/xlog.h | 1 +
src/include/libpq/protocol.h | 1 +
src/test/recovery/t/050_archive_shared.pl | 270 ++++++++++++++++++++++
9 files changed, 599 insertions(+), 38 deletions(-)
create mode 100644 src/test/recovery/t/050_archive_shared.pl
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 37342986969..5a6af71d2f8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3845,14 +3845,36 @@ include_dir 'conf.d'
are sent to archive storage by setting
<xref linkend="guc-archive-command"/> or
<xref linkend="guc-archive-library"/>. In addition to <literal>off</literal>,
- to disable, there are two modes: <literal>on</literal>, and
- <literal>always</literal>. During normal operation, there is no
- difference between the two modes, but when set to <literal>always</literal>
- the WAL archiver is enabled also during archive recovery or standby
- mode. In <literal>always</literal> mode, all files restored from the archive
- or streamed with streaming physical replication will be archived (again). See
- <xref linkend="continuous-archiving-in-standby"/> for details.
+ to disable, there are three modes: <literal>on</literal>, <literal>shared</literal>,
+ and <literal>always</literal>. During normal operation as a primary, there is no
+ difference between the three modes, but they differ during archive recovery or
+ standby mode:
</para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>on</literal>: Archives WAL only when running as a primary.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>shared</literal>: Coordinates archiving between primary and standby.
+ The standby defers WAL archival and deletion until the primary confirms
+ archival via streaming replication. This prevents WAL history loss during
+ standby promotion in high availability setups. Upon promotion, the standby
+ automatically starts archiving any remaining unarchived WAL. This mode works
+ with cascading replication, where each standby coordinates with its immediate
+ upstream server. See <xref linkend="continuous-archiving-in-standby"/> for details.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>always</literal>: Archives all WAL independently, even during recovery.
+ All files restored from the archive or streamed with streaming physical
+ replication will be archived (again), regardless of their source.
+ </para>
+ </listitem>
+ </itemizedlist>
<para>
<varname>archive_mode</varname> is a separate setting from
<varname>archive_command</varname> and
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c3f269e0364..8f1a4d6784c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1447,35 +1447,61 @@ postgres=# WAIT FOR LSN '0/306EE20';
</indexterm>
<para>
- When continuous WAL archiving is used in a standby, there are two
- different scenarios: the WAL archive can be shared between the primary
- and the standby, or the standby can have its own WAL archive. When
- the standby has its own WAL archive, set <varname>archive_mode</varname>
+ When continuous WAL archiving is used in a standby, there are three
+ different scenarios: the standby can have its own independent WAL archive,
+ the WAL archive can be shared between the primary and standby, or archiving
+ can be coordinated between them.
+ </para>
+
+ <para>
+ For an independent archive, set <varname>archive_mode</varname>
to <literal>always</literal>, and the standby will call the archive
command for every WAL segment it receives, whether it's by restoring
- from the archive or by streaming replication. The shared archive can
- be handled similarly, but the <varname>archive_command</varname> or <varname>archive_library</varname> must
- test if the file being archived exists already, and if the existing file
- has identical contents. This requires more care in the
- <varname>archive_command</varname> or <varname>archive_library</varname>, as it must
- be careful to not overwrite an existing file with different contents,
- but return success if the exactly same file is archived twice. And
- all that must be done free of race conditions, if two servers attempt
- to archive the same file at the same time.
+ from the archive or by streaming replication.
+ </para>
+
+ <para>
+ For a shared archive where both primary and standby can write, use
+ <literal>always</literal> mode as well, but the <varname>archive_command</varname>
+ or <varname>archive_library</varname> must test if the file being archived
+ exists already, and if the existing file has identical contents. This requires
+ more care in the <varname>archive_command</varname> or <varname>archive_library</varname>,
+ as it must be careful to not overwrite an existing file with different contents,
+ but return success if the exactly same file is archived twice. And all that must
+ be done free of race conditions, if two servers attempt to archive the same file
+ at the same time.
+ </para>
+
+ <para>
+ For coordinated archiving in high availability setups, use
+ <varname>archive_mode</varname>=<literal>shared</literal>. In this mode, only
+ the primary archives WAL segments. The standby creates <literal>.ready</literal>
+ files for received segments but defers actual archiving. The primary periodically
+ sends archival status updates to the standby via streaming replication, informing
+ it which segments have been archived. The standby then marks these as archived
+ and allows them to be recycled. Upon promotion, the standby automatically starts
+ archiving any remaining WAL segments that weren't confirmed as archived by the
+ former primary. This prevents WAL history loss during failover while avoiding
+ the complexity of coordinating concurrent archiving. This mode works with cascading
+ replication, where each standby coordinates with its immediate upstream server.
</para>
<para>
If <varname>archive_mode</varname> is set to <literal>on</literal>, the
- archiver is not enabled during recovery or standby mode. If the standby
- server is promoted, it will start archiving after the promotion, but
- will not archive any WAL or timeline history files that
- it did not generate itself. To get a complete
- series of WAL files in the archive, you must ensure that all WAL is
- archived, before it reaches the standby. This is inherently true with
- file-based log shipping, as the standby can only restore files that
- are found in the archive, but not if streaming replication is enabled.
- When a server is not in recovery mode, there is no difference between
- <literal>on</literal> and <literal>always</literal> modes.
+ archiver is not enabled during recovery or standby mode, and this setting
+ cannot be used on a standby. If a standby with <literal>archive_mode</literal>
+ set to <literal>on</literal> is promoted, it will start archiving after the
+ promotion, but will not archive any WAL or timeline history files that it did
+ not generate itself. To get a complete series of WAL files in the archive, you
+ must ensure that all WAL is archived before it reaches the standby. This is
+ inherently true with file-based log shipping, as the standby can only restore
+ files that are found in the archive, but not if streaming replication is enabled.
+ </para>
+
+ <para>
+ When a server is not in recovery mode, <literal>on</literal>,
+ <literal>shared</literal>, and <literal>always</literal> modes all behave
+ identically, archiving completed WAL segments.
</para>
</sect2>
</sect1>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 13ec6225b85..a751950b7cd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -195,6 +195,7 @@ const struct config_enum_entry archive_mode_options[] = {
{"always", ARCHIVE_MODE_ALWAYS, false},
{"on", ARCHIVE_MODE_ON, false},
{"off", ARCHIVE_MODE_OFF, false},
+ {"shared", ARCHIVE_MODE_SHARED, false},
{"true", ARCHIVE_MODE_ON, true},
{"false", ARCHIVE_MODE_OFF, true},
{"yes", ARCHIVE_MODE_ON, true},
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 82731e452fc..0433126150c 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -385,6 +385,15 @@ pgarch_ArchiverCopyLoop(void)
{
char xlog[MAX_XFN_CHARS + 1];
+ /*
+ * In shared archive mode during recovery, the archiver doesn't archive
+ * files. The primary is responsible for archiving, and the walreceiver
+ * marks files as .done when the primary confirms archival. After
+ * promotion, the archiver starts working normally.
+ */
+ if (XLogArchiveMode == ARCHIVE_MODE_SHARED && RecoveryInProgress())
+ return;
+
/* force directory scan in the first call to pgarch_readyXlog() */
arch_files->arch_files_size = 0;
@@ -475,10 +484,10 @@ pgarch_ArchiverCopyLoop(void)
continue;
}
- if (pgarch_archiveXlog(xlog))
- {
- /* successful */
- pgarch_archiveDone(xlog);
+ if (pgarch_archiveXlog(xlog))
+ {
+ /* successful */
+ pgarch_archiveDone(xlog);
/*
* Tell the cumulative stats system about the WAL file that we
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 10e64a7d1f4..ed0edd258bb 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -132,6 +132,11 @@ static TimestampTz wakeup[NUM_WALRCV_WAKEUPS];
static StringInfoData reply_message;
+/* Last archived WAL segment file reported by the primary */
+static char primary_last_archived[MAX_XFN_CHARS + 1];
+static TimeLineID primary_last_archived_tli = 0;
+static XLogSegNo primary_last_archived_segno = 0;
+
/* Prototypes for private functions */
static void WalRcvFetchTimeLineHistoryFiles(TimeLineID first, TimeLineID last);
static void WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *startpointTLI);
@@ -145,6 +150,7 @@ static void XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli);
static void XLogWalRcvSendReply(bool force, bool requestReply);
static void XLogWalRcvSendHSFeedback(bool immed);
static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static void ProcessArchivalReport(void);
static void WalRcvComputeNextWakeup(WalRcvWakeupReason reason, TimestampTz now);
@@ -888,6 +894,30 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len, TimeLineID tli)
XLogWalRcvSendReply(true, false);
break;
}
+ case PqReplMsg_ArchiveStatusReport:
+ {
+ /* Check that the filename looks valid */
+ if (len >= sizeof(primary_last_archived))
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg_internal("invalid archival report message with length %d",
+ (int) len)));
+
+ memcpy(primary_last_archived, buf, len);
+ primary_last_archived[len] = '\0';
+
+ /* Verify it contains only valid characters */
+ if (strspn(buf, VALID_XFN_CHARS) != len)
+ {
+ primary_last_archived[0] = '\0';
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg_internal("unexpected character in primary's last archived filename")));
+ }
+
+ ProcessArchivalReport();
+ break;
+ }
default:
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1095,12 +1125,39 @@ XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli)
/*
* Create .done file forcibly to prevent the streamed segment from being
- * archived later.
+ * archived later, unless archive_mode is 'always' or 'shared'.
+ *
+ * In 'always' mode, the standby archives independently.
+ *
+ * In 'shared' mode, we optimize by checking if this segment is already
+ * covered by the last archival report from the primary. If so, create
+ * .done directly. Otherwise, create .ready and wait for the next report.
*/
- if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
- XLogArchiveForceDone(xlogfname);
- else
+ if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
+ {
XLogArchiveNotify(xlogfname);
+ }
+ else if (XLogArchiveMode == ARCHIVE_MODE_SHARED)
+ {
+ /*
+ * In shared mode, check if this segment is already archived on primary.
+ * If we're on the same timeline and this segment is <= last archived,
+ * mark it .done immediately. Otherwise create .ready.
+ */
+ if (primary_last_archived_tli == recvFileTLI &&
+ recvSegNo <= primary_last_archived_segno)
+ {
+ XLogArchiveForceDone(xlogfname);
+ }
+ else
+ {
+ XLogArchiveNotify(xlogfname);
+ }
+ }
+ else
+ {
+ XLogArchiveForceDone(xlogfname);
+ }
recvFile = -1;
}
@@ -1277,6 +1334,87 @@ XLogWalRcvSendHSFeedback(bool immed)
primary_has_standby_xmin = false;
}
+/*
+ * Process archival report from primary.
+ *
+ * The primary sends us the last WAL segment it has archived. We scan the
+ * archive_status directory for .ready files and mark segments on the same
+ * timeline as .done if they're <= the reported segment.
+ */
+static void
+ProcessArchivalReport(void)
+{
+ TimeLineID reported_tli;
+ XLogSegNo reported_segno;
+ DIR *status_dir;
+ struct dirent *status_de;
+ char status_path[MAXPGPATH];
+
+ elog(DEBUG2, "received archival report from primary: %s",
+ primary_last_archived);
+
+ /* Parse the reported WAL filename */
+ if (!IsXLogFileName(primary_last_archived))
+ {
+ elog(DEBUG2, "invalid WAL filename in archival report: %s",
+ primary_last_archived);
+ return;
+ }
+
+ XLogFromFileName(primary_last_archived, &reported_tli, &reported_segno,
+ wal_segment_size);
+
+ /* Remember the last archived segment for XLogWalRcvClose() */
+ primary_last_archived_tli = reported_tli;
+ primary_last_archived_segno = reported_segno;
+
+ /* Scan archive_status directory for .ready files */
+ snprintf(status_path, MAXPGPATH, XLOGDIR "/archive_status");
+ status_dir = AllocateDir(status_path);
+ if (status_dir == NULL)
+ {
+ elog(DEBUG2, "could not open archive_status directory: %m");
+ return;
+ }
+
+ while ((status_de = ReadDir(status_dir, status_path)) != NULL)
+ {
+ char *ready_suffix;
+ char walfile[MAXPGPATH];
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+ /* Look for .ready files only */
+ ready_suffix = strstr(status_de->d_name, ".ready");
+ if (ready_suffix == NULL || ready_suffix[6] != '\0')
+ continue;
+
+ /* Extract WAL filename (remove .ready suffix) */
+ strlcpy(walfile, status_de->d_name, ready_suffix - status_de->d_name + 1);
+
+ /* Parse the WAL filename */
+ if (!IsXLogFileName(walfile))
+ continue;
+
+ XLogFromFileName(walfile, &file_tli, &file_segno, wal_segment_size);
+
+ /*
+ * Mark as .done if it's on the same timeline and not after the
+ * reported segment. We only process the reported timeline to avoid
+ * marking segments from parent or future timelines prematurely.
+ * XXX: Process possible TLI switches happened between status reports.
+ * For now, leave segments on previous TLIs to archive_command.
+ */
+ if (file_tli == reported_tli && file_segno <= reported_segno)
+ {
+ XLogArchiveForceDone(walfile);
+ elog(DEBUG3, "marked WAL segment %s as archived (primary archived up to %s)",
+ walfile, primary_last_archived);
+ }
+ }
+
+ FreeDir(status_dir);
+}
+
/*
* Update shared memory status upon receiving a message from primary.
*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2cde8ebc729..aa045a37d82 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -189,6 +189,17 @@ static TimestampTz last_reply_timestamp = 0;
/* Have we sent a heartbeat message asking for reply, since last reply? */
static bool waiting_for_ping_response = false;
+/*
+ * Last archived WAL file. This is fetched from pgstat periodically and sent
+ * to the standby. last_archival_report_timestamp tracks when we last sent
+ * the report to avoid excessive pgstat access.
+ */
+static char last_archived_wal[MAX_XFN_CHARS + 1];
+static TimestampTz last_archival_report_timestamp = 0;
+
+/* Interval for sending archival reports (10 seconds) */
+#define ARCHIVAL_REPORT_INTERVAL 10000
+
/*
* While streaming WAL in Copy mode, streamingDoneSending is set to true
* after we have sent CopyDone. We should not send any more CopyData messages
@@ -276,6 +287,7 @@ static void ProcessStandbyMessage(void);
static void ProcessStandbyReplyMessage(void);
static void ProcessStandbyHSFeedbackMessage(void);
static void ProcessStandbyPSRequestMessage(void);
+static void WalSndArchivalReport(void);
static void ProcessRepliesIfAny(void);
static void ProcessPendingWrites(void);
static void WalSndKeepalive(bool requestReply, XLogRecPtr writePtr);
@@ -2748,6 +2760,84 @@ ProcessStandbyHSFeedbackMessage(void)
}
}
+/*
+ * Send archival status report to standby.
+ *
+ * This is called periodically during physical replication to inform the
+ * standby about the last WAL segment archived by the primary. The standby
+ * can then mark segments up to that point as .done, allowing them to be
+ * recycled. This prevents WAL loss during standby promotion.
+ */
+static void
+WalSndArchivalReport(void)
+{
+ PgStat_ArchiverStats *archiver_stats;
+ TimestampTz now;
+ char *last_archived;
+
+ /* Only send reports when archive_mode=shared */
+ if (XLogArchiveMode != ARCHIVE_MODE_SHARED)
+ return;
+
+ /* Only send reports during physical streaming replication, not during backup */
+ if (MyWalSnd->kind != REPLICATION_KIND_PHYSICAL)
+ return;
+ if (MyWalSnd->state != WALSNDSTATE_CATCHUP &&
+ MyWalSnd->state != WALSNDSTATE_STREAMING)
+ return;
+
+ /*
+ * Don't send to temporary replication slots (used by pg_basebackup).
+ * Connections without slots (regular standbys) are OK.
+ */
+ if (MyReplicationSlot != NULL &&
+ MyReplicationSlot->data.persistency == RS_TEMPORARY)
+ return;
+
+ now = GetCurrentTimestamp();
+
+ /*
+ * Send report at most once per ARCHIVAL_REPORT_INTERVAL (10 seconds).
+ * This avoids excessive pgstat access.
+ */
+ if (now < TimestampTzPlusMilliseconds(last_archival_report_timestamp,
+ ARCHIVAL_REPORT_INTERVAL))
+ return;
+ last_archival_report_timestamp = now;
+ /*
+ * Get archiver statistics. We use non-blocking access to avoid delaying
+ * replication if stats collector is slow. If stats are unavailable or
+ * stale, we'll just try again at the next interval.
+ */
+ archiver_stats = pgstat_fetch_stat_archiver();
+ if (archiver_stats == NULL)
+ return;
+
+ last_archived = archiver_stats->last_archived_wal;
+ /*
+ * Only send a report if the last archived WAL has changed. This is both
+ * an optimization and ensures we don't send empty reports on startup.
+ */
+ if (strcmp(last_archived, last_archived_wal) == 0)
+ return;
+
+ /* Only send reports for WAL segments, not backup history files or other archived files */
+ if (!IsXLogFileName(last_archived))
+ return;
+
+ elog(DEBUG2, "sending archival report: %s", last_archived);
+
+ /* Remember what we sent */
+ strlcpy(last_archived_wal, last_archived, sizeof(last_archived_wal));
+
+ /* Construct the message... */
+ resetStringInfo(&output_message);
+ pq_sendbyte(&output_message, PqReplMsg_ArchiveStatusReport);
+ pq_sendbytes(&output_message, last_archived, strlen(last_archived));
+ /* ... and send it wrapped in CopyData */
+ pq_putmessage_noblock(PqMsg_CopyData, output_message.data, output_message.len);
+}
+
/*
* Process the request for a primary status update message.
*/
@@ -4227,6 +4317,9 @@ WalSndKeepaliveIfNecessary(void)
if (pq_flush_if_writable() != 0)
WalSndShutdown();
}
+
+ /* Send archival status report if needed */
+ WalSndArchivalReport();
}
/*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index fdfb572467b..7b0caa5cbf6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -66,6 +66,7 @@ typedef enum ArchiveMode
ARCHIVE_MODE_OFF = 0, /* disabled */
ARCHIVE_MODE_ON, /* enabled while server is running normally */
ARCHIVE_MODE_ALWAYS, /* enabled always (even during recovery) */
+ ARCHIVE_MODE_SHARED, /* shared archive between primary and standby */
} ArchiveMode;
extern PGDLLIMPORT int XLogArchiveMode;
diff --git a/src/include/libpq/protocol.h b/src/include/libpq/protocol.h
index eae8f0e7238..d22aaf9e225 100644
--- a/src/include/libpq/protocol.h
+++ b/src/include/libpq/protocol.h
@@ -72,6 +72,7 @@
/* Replication codes sent by the primary (wrapped in CopyData messages). */
+#define PqReplMsg_ArchiveStatusReport 'a'
#define PqReplMsg_Keepalive 'k'
#define PqReplMsg_PrimaryStatusUpdate 's'
#define PqReplMsg_WALData 'w'
diff --git a/src/test/recovery/t/050_archive_shared.pl b/src/test/recovery/t/050_archive_shared.pl
new file mode 100644
index 00000000000..397b71ad79d
--- /dev/null
+++ b/src/test/recovery/t/050_archive_shared.pl
@@ -0,0 +1,270 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test archive_mode=shared for coordinated WAL archiving between primary and standby
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use File::Path qw(rmtree);
+
+# Initialize primary node with archiving
+my $archive_dir = PostgreSQL::Test::Utils::tempdir();
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->append_conf('postgresql.conf', "
+archive_mode = shared
+archive_command = 'cp %p \"$archive_dir\"/%f'
+wal_keep_size = 128MB
+");
+$primary->start;
+
+# Create a test table and generate some WAL
+$primary->safe_psql('postgres', 'CREATE TABLE test_table (id int, data text);');
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'data' || i FROM generate_series(1, 500) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'data' || i FROM generate_series(501, 1000) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for archiver to archive segments
+$primary->poll_query_until('postgres',
+ "SELECT archived_count > 0 FROM pg_stat_archiver")
+ or die "Timed out waiting for archiver to start";
+
+my $archived_count = () = glob("$archive_dir/*");
+ok($archived_count > 0, "primary has archived WAL files to shared archive");
+note("Primary archived $archived_count files");
+
+# Take backup for standby
+my $backup_name = 'standby_backup';
+$primary->backup($backup_name);
+
+# Exclude possible race condition when backup WAL is last archived
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'data' || i FROM generate_series(501, 1000) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Set up standby with archive_mode=shared
+my $standby = PostgreSQL::Test::Cluster->new('standby');
+$standby->init_from_backup($primary, $backup_name, has_streaming => 1);
+$standby->append_conf('postgresql.conf', "
+archive_mode = shared
+archive_command = 'cp %p \"$archive_dir\"/%f'
+wal_receiver_status_interval = 1s
+");
+$standby->start;
+
+# Wait for standby to catch up
+$primary->wait_for_catchup($standby);
+
+# Generate more WAL on primary (these are new segments not yet archived)
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'data' || i FROM generate_series(1001, 1500) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'data' || i FROM generate_series(1501, 2000) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for standby to receive the new WAL
+$primary->wait_for_catchup($standby);
+
+# Check that standby has .ready or .done files for the newly received segments.
+# Normally they should be .ready (not yet archived by primary), but in rare cases
+# the archiver could be very fast and an archive report sent immediately, creating
+# .done files instead. Both are correct behavior - the key is that files exist.
+my $standby_archive_status = $standby->data_dir . '/pg_wal/archive_status';
+my $status_count = 0;
+if (opendir(my $dh, $standby_archive_status))
+{
+ my @files = grep { /\.(ready|done)$/ } readdir($dh);
+ $status_count = scalar(@files);
+ my $ready_count = scalar(grep { /\.ready$/ } @files);
+ my $done_count = scalar(grep { /\.done$/ } @files);
+ note("Standby has $ready_count .ready files and $done_count .done files");
+ closedir($dh);
+}
+cmp_ok($status_count, '>', 0, "standby creates archive status files for received WAL");
+
+# Generate more WAL and wait for archiving on primary
+my $initial_archived = $primary->safe_psql('postgres', 'SELECT archived_count FROM pg_stat_archiver');
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'more-data' || i FROM generate_series(2001, 2500) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'more-data2' || i FROM generate_series(2501, 3000) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for primary to archive the new segments
+$primary->poll_query_until('postgres',
+ "SELECT archived_count > $initial_archived FROM pg_stat_archiver")
+ or die "Timed out waiting for primary to archive new segments";
+
+# Wait for standby to catch up (archive status is sent during replication)
+$primary->wait_for_catchup($standby);
+
+# Wait for primary to send archival status updates and standby to process them
+# The standby should mark segments as .done after receiving archive status from primary
+my $done_count = 0;
+for (my $i = 0; $i < $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+ $done_count = 0;
+ if (opendir(my $dh, $standby_archive_status))
+ {
+ $done_count = scalar(grep { /\.done$/ } readdir($dh));
+ closedir($dh);
+ }
+ last if $done_count > 0;
+ sleep(1);
+}
+ok($done_count > 0, "standby marked segments as .done after primary's archival report");
+note("Standby has $done_count .done files");
+
+###############################################################################
+# Test 2: Standby promotion - verify archiver activates
+###############################################################################
+
+# Before promotion, verify archiver is not running on standby (shared mode during recovery)
+# In shared mode, the standby's archiver should not be archiving during recovery
+my $archived_before = $standby->safe_psql('postgres',
+ "SELECT archived_count FROM pg_stat_archiver");
+is($archived_before, '0',
+ "archiver not active on standby before promotion (archived_count=0)");
+
+# Verify standby is still in recovery before promoting
+my $in_recovery = $standby->safe_psql('postgres', "SELECT pg_is_in_recovery();");
+is($in_recovery, 't', "standby is in recovery before promotion");
+
+# Promote the standby
+$standby->promote;
+$standby->poll_query_until('postgres', "SELECT NOT pg_is_in_recovery();");
+
+# Generate WAL on new primary (former standby)
+$standby->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'post-promotion' || i FROM generate_series(2001, 2500) i;");
+$standby->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for archiver to activate and archive the new WAL
+# Check pg_stat_archiver to verify archiving is happening
+$standby->poll_query_until('postgres',
+ "SELECT archived_count > 0 FROM pg_stat_archiver")
+ or die "Timed out waiting for promoted standby to start archiving";
+pass("promoted standby started archiving");
+
+# Verify data integrity
+my $count = $standby->safe_psql('postgres', 'SELECT COUNT(*) FROM test_table;');
+ok($count >= 2500, "promoted standby has all data (got $count rows)");
+
+###############################################################################
+# Test 3: Cascading replication
+###############################################################################
+
+# Take a backup from the promoted standby (now the new primary)
+my $promoted_backup = 'promoted_backup';
+$standby->backup($promoted_backup);
+
+# Set up second-level standby (cascading from first standby, now promoted)
+my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$standby2->init_from_backup($standby, $promoted_backup, has_streaming => 1);
+$standby2->append_conf('postgresql.conf', "
+archive_mode = shared
+archive_command = 'cp %p \"$archive_dir\"/%f'
+wal_receiver_status_interval = 1s
+");
+$standby2->start;
+
+# Generate WAL on promoted standby (now primary for standby2)
+my $cascading_archived_before = $standby->safe_psql('postgres', 'SELECT archived_count FROM pg_stat_archiver');
+$standby->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'cascading' || i FROM generate_series(2501, 3000) i;");
+$standby->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for the promoted standby (acting as primary) to archive the new segment
+$standby->poll_query_until('postgres',
+ "SELECT archived_count > $cascading_archived_before FROM pg_stat_archiver")
+ or die "Timed out waiting for primary to archive segment in cascading test";
+
+# Wait for cascading standby to catch up
+$standby->wait_for_catchup($standby2);
+
+# Wait for cascading standby to receive archive status and mark segments as .done
+my $standby2_archive_status = $standby2->data_dir . '/pg_wal/archive_status';
+my $standby2_done_count = 0;
+for (my $i = 0; $i < $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+ $standby2_done_count = 0;
+ if (opendir(my $dh, $standby2_archive_status))
+ {
+ $standby2_done_count = scalar(grep { /\.done$/ } readdir($dh));
+ closedir($dh);
+ }
+ last if $standby2_done_count > 0;
+ sleep(1);
+}
+ok($standby2_done_count > 0, "cascading standby marks segments as .done");
+note("Cascading standby has $standby2_done_count .done files");
+
+# Verify cascading standby has all data
+my $standby2_count = $standby2->safe_psql('postgres', 'SELECT COUNT(*) FROM test_table;');
+ok($standby2_count >= 3000, "cascading standby has all data (got $standby2_count rows)");
+
+###############################################################################
+# Test 4: Multiple standbys from same primary
+###############################################################################
+
+# Create third standby from promoted standby (current primary)
+my $standby3 = PostgreSQL::Test::Cluster->new('standby3');
+my $backup2 = 'multi_standby_backup';
+$standby->backup($backup2);
+$standby3->init_from_backup($standby, $backup2, has_streaming => 1);
+$standby3->append_conf('postgresql.conf', "
+archive_mode = shared
+archive_command = 'cp %p \"$archive_dir\"/%f'
+wal_receiver_status_interval = 1s
+");
+$standby3->start;
+
+# Generate WAL and ensure both standbys receive it
+my $standby_archived_before = $standby->safe_psql('postgres', 'SELECT archived_count FROM pg_stat_archiver');
+$standby->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'multi' || i FROM generate_series(3001, 3500) i;");
+$standby->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for the promoted standby (acting as primary) to archive the new segment
+$standby->poll_query_until('postgres',
+ "SELECT archived_count > $standby_archived_before FROM pg_stat_archiver")
+ or die "Timed out waiting for primary to archive segment in multi-standby test";
+
+$standby->wait_for_catchup($standby2);
+$standby->wait_for_catchup($standby3);
+
+# Verify both standbys eventually mark segments as .done
+my $standby3_archive_status = $standby3->data_dir . '/pg_wal/archive_status';
+
+for (my $i = 0; $i < $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+ $standby2_done_count = 0;
+ if (opendir(my $dh, $standby2_archive_status))
+ {
+ $standby2_done_count = scalar(grep { /\.done$/ } readdir($dh));
+ closedir($dh);
+ }
+ last if $standby2_done_count > 0;
+ sleep(1);
+}
+
+my $standby3_done_count = 0;
+for (my $i = 0; $i < $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+ $standby3_done_count = 0;
+ if (opendir(my $dh, $standby3_archive_status))
+ {
+ $standby3_done_count = scalar(grep { /\.done$/ } readdir($dh));
+ closedir($dh);
+ }
+ last if $standby3_done_count > 0;
+ sleep(1);
+}
+
+ok($standby2_done_count > 0, "standby2 marks segments as .done");
+ok($standby3_done_count > 0, "standby3 marks segments as .done");
+note("standby2 has $standby2_done_count .done files, standby3 has $standby3_done_count .done files");
+
+# Verify both standbys have all data
+$standby2_count = $standby2->safe_psql('postgres', 'SELECT COUNT(*) FROM test_table;');
+my $standby3_count = $standby3->safe_psql('postgres', 'SELECT COUNT(*) FROM test_table;');
+ok($standby2_count >= 3500, "standby2 has all data (got $standby2_count rows)");
+ok($standby3_count >= 3500, "standby3 has all data (got $standby3_count rows)");
+
+done_testing();
--
2.51.2
[application/octet-stream] v4-0003-Optimize-ProcessArchivalReport-to-avoid-directory.patch (9.7K, 3-v4-0003-Optimize-ProcessArchivalReport-to-avoid-directory.patch)
download | inline diff:
From 3478544f5c9603a3f7816f03d8d48390dd870be1 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <[email protected]>
Date: Wed, 11 Feb 2026 18:17:25 +0500
Subject: [PATCH v4 3/3] Optimize ProcessArchivalReport to avoid directory
scans
When archive status reports arrive sequentially on the same timeline,
directly generate expected WAL filenames and mark them as archived
instead of scanning the entire archive_status directory.
This optimization reduces overhead in the common case where the primary
continuously archives segments. Directory scan is still used when:
- Timeline changes (to handle ancestor timelines)
- First report received
- Non-sequential reports
XLogArchiveForceDone() handles all cases internally (checking if .done
exists, if .ready exists, or creating .done if neither exists), so no
pre-check is needed.
---
src/backend/replication/walreceiver.c | 196 +++++++++++++++++---------
1 file changed, 132 insertions(+), 64 deletions(-)
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 1613a5f8752..cda6a5d2df2 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -137,6 +137,14 @@ static char primary_last_archived[MAX_XFN_CHARS + 1];
static TimeLineID primary_last_archived_tli = 0;
static XLogSegNo primary_last_archived_segno = 0;
+/*
+ * Last segment we successfully marked as .done. Used to optimize
+ * ProcessArchivalReport() by generating expected filenames instead
+ * of scanning the archive_status directory.
+ */
+static TimeLineID last_processed_tli = 0;
+static XLogSegNo last_processed_segno = 0;
+
/* Prototypes for private functions */
static void WalRcvFetchTimeLineHistoryFiles(TimeLineID first, TimeLineID last);
static void WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *startpointTLI);
@@ -1351,10 +1359,9 @@ ProcessArchivalReport(void)
{
TimeLineID reported_tli;
XLogSegNo reported_segno;
- DIR *status_dir;
- struct dirent *status_de;
char status_path[MAXPGPATH];
- List *tli_history = NIL;
+ bool use_direct_check = false;
+ XLogSegNo start_segno;
elog(DEBUG2, "received archival report from primary: %s",
primary_last_archived);
@@ -1374,90 +1381,151 @@ ProcessArchivalReport(void)
primary_last_archived_tli = reported_tli;
primary_last_archived_segno = reported_segno;
- /* Scan archive_status directory for .ready files */
- snprintf(status_path, MAXPGPATH, XLOGDIR "/archive_status");
- status_dir = AllocateDir(status_path);
- if (status_dir == NULL)
+ /*
+ * Optimization: If the new report is on the same timeline as the last
+ * processed segment and moves forward, we can directly check for .ready
+ * files for segments between last_processed_segno and reported_segno
+ * instead of scanning the entire archive_status directory.
+ *
+ * Fall back to directory scan if:
+ * - Timeline changed (need to handle ancestor timelines)
+ * - This is the first report (last_processed_tli == 0)
+ * - Reported segment is not ahead (nothing new to process)
+ */
+ if (last_processed_tli == reported_tli &&
+ last_processed_tli != 0 &&
+ reported_segno > last_processed_segno)
{
- elog(DEBUG2, "could not open archive_status directory: %m");
- return;
+ use_direct_check = true;
+ start_segno = last_processed_segno + 1;
}
- while ((status_de = ReadDir(status_dir, status_path)) != NULL)
+ if (use_direct_check)
{
- char *ready_suffix;
- char walfile[MAXPGPATH];
- TimeLineID file_tli;
- XLogSegNo file_segno;
- /* Look for .ready files only */
- ready_suffix = strstr(status_de->d_name, ".ready");
- if (ready_suffix == NULL || ready_suffix[6] != '\0')
- continue;
-
- /* Extract WAL filename (remove .ready suffix) */
- strlcpy(walfile, status_de->d_name, ready_suffix - status_de->d_name + 1);
-
- /* Parse the WAL filename */
- if (!IsXLogFileName(walfile))
- continue;
-
- XLogFromFileName(walfile, &file_tli, &file_segno, wal_segment_size);
-
/*
- * Mark as .done if:
- * 1. Same timeline and segment <= reported segment, OR
- * 2. Ancestor timeline and segment is before the timeline switch point
- *
- * For ancestor timelines: if primary archived segment X on timeline T,
- * then all segments on ancestor timelines before the switch to T must
- * have been archived (they're required to reach timeline T).
+ * Direct check: generate filenames for expected segments.
+ * XLogArchiveForceDone() will handle the case where .ready doesn't
+ * exist or .done already exists, so no need to stat() first.
*/
- if (file_tli == reported_tli && file_segno <= reported_segno)
+ XLogSegNo segno;
+
+ for (segno = start_segno; segno <= reported_segno; segno++)
{
- /* Same timeline, segment already archived */
+ char walfile[MAXFNAMELEN];
+
+ /* Generate WAL filename and mark as archived */
+ XLogFileName(walfile, reported_tli, segno, wal_segment_size);
XLogArchiveForceDone(walfile);
elog(DEBUG3, "marked WAL segment %s as archived (primary archived up to %s)",
walfile, primary_last_archived);
+
+ /* Track the last segment we processed */
+ last_processed_tli = reported_tli;
+ last_processed_segno = segno;
+ }
+ }
+ else
+ {
+ /*
+ * Directory scan: needed when timeline changed or first report.
+ * This handles both same-timeline and ancestor-timeline cases.
+ */
+ DIR *status_dir;
+ struct dirent *status_de;
+ List *tli_history = NIL;
+
+ snprintf(status_path, MAXPGPATH, XLOGDIR "/archive_status");
+ status_dir = AllocateDir(status_path);
+ if (status_dir == NULL)
+ {
+ elog(DEBUG2, "could not open archive_status directory: %m");
+ return;
}
- else if (file_tli != reported_tli)
+
+ while ((status_de = ReadDir(status_dir, status_path)) != NULL)
{
+ char *ready_suffix;
+ char walfile[MAXPGPATH];
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Look for .ready files only */
+ ready_suffix = strstr(status_de->d_name, ".ready");
+ if (ready_suffix == NULL || ready_suffix[6] != '\0')
+ continue;
+
+ /* Extract WAL filename (remove .ready suffix) */
+ strlcpy(walfile, status_de->d_name, ready_suffix - status_de->d_name + 1);
+
+ /* Parse the WAL filename */
+ if (!IsXLogFileName(walfile))
+ continue;
+
+ XLogFromFileName(walfile, &file_tli, &file_segno, wal_segment_size);
+
/*
- * Different timeline - check if it's an ancestor and if this
- * segment is before the timeline switch point. Only read timeline
- * history if we haven't already (lazy loading).
+ * Mark as .done if:
+ * 1. Same timeline and segment <= reported segment, OR
+ * 2. Ancestor timeline and segment is before the timeline switch point
*
- * Note: Timelines form a tree structure, not a linear sequence,
- * so we can't use < or > to compare them.
+ * For ancestor timelines: if primary archived segment X on timeline T,
+ * then all segments on ancestor timelines before the switch to T must
+ * have been archived (they're required to reach timeline T).
*/
- if (tli_history == NIL)
- tli_history = readTimeLineHistory(reported_tli);
-
- if (tliInHistory(file_tli, tli_history))
+ if (file_tli == reported_tli && file_segno <= reported_segno)
+ {
+ /* Same timeline, segment already archived */
+ XLogArchiveForceDone(walfile);
+ elog(DEBUG3, "marked WAL segment %s as archived (primary archived up to %s)",
+ walfile, primary_last_archived);
+ }
+ else if (file_tli != reported_tli)
{
- XLogRecPtr switchpoint;
- XLogSegNo switchpoint_segno;
-
- /* Get the point where we switched away from this timeline */
- switchpoint = tliSwitchPoint(file_tli, tli_history, NULL);
-
/*
- * If the segment is at or before the switch point, it must have
- * been archived (it's required to reach the reported timeline).
- * The segment containing the switch point belongs to the old
- * timeline up to the switch point and should be archived.
+ * Different timeline - check if it's an ancestor and if this
+ * segment is before the timeline switch point. Only read timeline
+ * history if we haven't already (lazy loading).
+ *
+ * Note: Timelines form a tree structure, not a linear sequence,
+ * so we can't use < or > to compare them.
*/
- XLByteToSeg(switchpoint, switchpoint_segno, wal_segment_size);
- if (file_segno <= switchpoint_segno)
+ if (tli_history == NIL)
+ tli_history = readTimeLineHistory(reported_tli);
+
+ if (tliInHistory(file_tli, tli_history))
{
- XLogArchiveForceDone(walfile);
- elog(DEBUG3, "marked ancestor timeline segment %s as archived (before switch to timeline %u)",
- walfile, reported_tli);
+ XLogRecPtr switchpoint;
+ XLogSegNo switchpoint_segno;
+
+ /* Get the point where we switched away from this timeline */
+ switchpoint = tliSwitchPoint(file_tli, tli_history, NULL);
+
+ /*
+ * If the segment is at or before the switch point, it must have
+ * been archived (it's required to reach the reported timeline).
+ * The segment containing the switch point belongs to the old
+ * timeline up to the switch point and should be archived.
+ */
+ XLByteToSeg(switchpoint, switchpoint_segno, wal_segment_size);
+ if (file_segno <= switchpoint_segno)
+ {
+ XLogArchiveForceDone(walfile);
+ elog(DEBUG3, "marked ancestor timeline segment %s as archived (before switch to timeline %u)",
+ walfile, reported_tli);
+ }
}
}
}
- }
- FreeDir(status_dir);
+ FreeDir(status_dir);
+
+ /*
+ * After a full directory scan following a timeline change, update
+ * our tracking to the newly reported position for future optimizations.
+ */
+ last_processed_tli = reported_tli;
+ last_processed_segno = reported_segno;
+ }
}
/*
--
2.51.2
[application/octet-stream] v4-0002-Mark-ancestor-timeline-WAL-segments-as-archived.patch (4.0K, 4-v4-0002-Mark-ancestor-timeline-WAL-segments-as-archived.patch)
download | inline diff:
From b991c5785cffe44e2d42de3a607ddda8e64ca08d Mon Sep 17 00:00:00 2001
From: Andrey Borodin <[email protected]>
Date: Tue, 10 Feb 2026 16:45:10 +0500
Subject: [PATCH v4 2/3] Mark ancestor timeline WAL segments as archived
When standby receives archive status report, check if .ready files
belong to ancestor timelines before the switch point and mark them
as .done if already archived by primary.
---
src/backend/replication/walreceiver.c | 55 ++++++++++++++++++++++++---
1 file changed, 50 insertions(+), 5 deletions(-)
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ed0edd258bb..1613a5f8752 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1143,6 +1143,11 @@ XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli)
* In shared mode, check if this segment is already archived on primary.
* If we're on the same timeline and this segment is <= last archived,
* mark it .done immediately. Otherwise create .ready.
+ *
+ * We don't check ancestor timeline cases here to avoid reading timeline
+ * history files on every segment close. ProcessArchivalReport() will
+ * handle marking ancestor timeline segments as .done when it scans
+ * the archive_status directory.
*/
if (primary_last_archived_tli == recvFileTLI &&
recvSegNo <= primary_last_archived_segno)
@@ -1349,6 +1354,7 @@ ProcessArchivalReport(void)
DIR *status_dir;
struct dirent *status_de;
char status_path[MAXPGPATH];
+ List *tli_history = NIL;
elog(DEBUG2, "received archival report from primary: %s",
primary_last_archived);
@@ -1398,18 +1404,57 @@ ProcessArchivalReport(void)
XLogFromFileName(walfile, &file_tli, &file_segno, wal_segment_size);
/*
- * Mark as .done if it's on the same timeline and not after the
- * reported segment. We only process the reported timeline to avoid
- * marking segments from parent or future timelines prematurely.
- * XXX: Process possible TLI switches happened between status reports.
- * For now, leave segments on previous TLIs to archive_command.
+ * Mark as .done if:
+ * 1. Same timeline and segment <= reported segment, OR
+ * 2. Ancestor timeline and segment is before the timeline switch point
+ *
+ * For ancestor timelines: if primary archived segment X on timeline T,
+ * then all segments on ancestor timelines before the switch to T must
+ * have been archived (they're required to reach timeline T).
*/
if (file_tli == reported_tli && file_segno <= reported_segno)
{
+ /* Same timeline, segment already archived */
XLogArchiveForceDone(walfile);
elog(DEBUG3, "marked WAL segment %s as archived (primary archived up to %s)",
walfile, primary_last_archived);
}
+ else if (file_tli != reported_tli)
+ {
+ /*
+ * Different timeline - check if it's an ancestor and if this
+ * segment is before the timeline switch point. Only read timeline
+ * history if we haven't already (lazy loading).
+ *
+ * Note: Timelines form a tree structure, not a linear sequence,
+ * so we can't use < or > to compare them.
+ */
+ if (tli_history == NIL)
+ tli_history = readTimeLineHistory(reported_tli);
+
+ if (tliInHistory(file_tli, tli_history))
+ {
+ XLogRecPtr switchpoint;
+ XLogSegNo switchpoint_segno;
+
+ /* Get the point where we switched away from this timeline */
+ switchpoint = tliSwitchPoint(file_tli, tli_history, NULL);
+
+ /*
+ * If the segment is at or before the switch point, it must have
+ * been archived (it's required to reach the reported timeline).
+ * The segment containing the switch point belongs to the old
+ * timeline up to the switch point and should be archived.
+ */
+ XLByteToSeg(switchpoint, switchpoint_segno, wal_segment_size);
+ if (file_segno <= switchpoint_segno)
+ {
+ XLogArchiveForceDone(walfile);
+ elog(DEBUG3, "marked ancestor timeline segment %s as archived (before switch to timeline %u)",
+ walfile, reported_tli);
+ }
+ }
+ }
}
FreeDir(status_dir);
--
2.51.2
^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: Streaming replication and WAL archive interactions
2026-02-12 06:56 Re: Streaming replication and WAL archive interactions Andrey Borodin <[email protected]>
@ 2026-02-20 20:57 ` Harinath Kanchu <[email protected]>
3 siblings, 0 replies; 5+ messages in thread
From: Harinath Kanchu @ 2026-02-20 20:57 UTC (permalink / raw)
To: pgsql-hackers; +Cc: [email protected]; Michael Paquier <[email protected]>; Robert Haas <[email protected]>; Venkata Balaji N <[email protected]>; Andres Freund <[email protected]>; Fujii Masao <[email protected]>; Borodin Vladimir <[email protected]>; pgsql-hackers; [email protected]; Roman Khapov <[email protected]>; Kirill Reshke <[email protected]>; [email protected]; [email protected]
Hi All,
I posted about this last year [1] -- sharing our experience
hitting WAL archival gaps in production and asking for last_archived_wal
to be surfaced on the standby via keep-alive messages.
Glad to see Andrey's patch reviving this.
If the full shared mode lands, great. But even if that's too much for
now, just exposing last_archived_wal on the standby -- in
pg_stat_wal_receiver or anywhere queryable -- would unblock everyone
solving this problem externally.
[1] https://www.postgresql.org/message-id/flat/CAO7WNRTBrPn0WU9GTimoK-0FHRynaUHa34%3DAp5puCzEipNymFg%40m...
Thanks,
Harinath
^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: Streaming replication and WAL archive interactions
2026-02-12 06:56 Re: Streaming replication and WAL archive interactions Andrey Borodin <[email protected]>
@ 2026-03-03 10:43 ` Jaroslav Novikov <[email protected]>
3 siblings, 0 replies; 5+ messages in thread
From: Jaroslav Novikov @ 2026-03-03 10:43 UTC (permalink / raw)
To: Andrey Borodin <[email protected]>; +Cc: [email protected]; Michael Paquier <[email protected]>; Robert Haas <[email protected]>; Venkata Balaji N <[email protected]>; Andres Freund <[email protected]>; Fujii Masao <[email protected]>; Borodin Vladimir <[email protected]>; pgsql-hackers; [email protected]; Roman Khapov <[email protected]>; Kirill Reshke <[email protected]>; [email protected]
> On 12 Feb 2026, at 09:56, Andrey Borodin <[email protected]> wrote:
>
> Hi Heikki,
>
> There’s a nearby thread [0] (about 10 years later) where I’m working on a problem your patch from this thread helps solve.
>
> In datacenter large outages, 1–2% of clusters end up with gaps in their PITR timeline.
> In HA setups, when the primary is lost, some WAL can be missing from the archive even though it was streamed to the standby. Many HA tools (PGConsul, Patroni, etc.) try to re-archive from the standby, but those WAL files may already have been removed.
>
> Your “shared” archive mode addresses this: the standby keeps WAL until it’s archived. archive_mode=always plus an archive tool can work, but it’s expensive. In WAL-G, for example, the archive command does a GET on the standby’s WAL, then decrypts and compares. Switching to HEAD would reduce cost in some clouds but still adds cost.
>
> Another option is coordinating archiving outside Postgres, but that would mean building distributed coordination into the archive tool.
>
> Shared archive mode tackles this in Postgres itself.
>
> I’ve retrofitted your patch, incorporated ideas from the Greenplum work [1], and made some improvements.
>
> The patchset has three parts:
> * Rebase + tests – Your original patch, rebased, with tests added.
> * Timeline switching – Correct handling of timeline switches in archive status updates.
> * Avoid directory scans – Skip scanning archive_status when possible, which was costly in WAL-G setups.
>
> What do you think?
>
> Best regards, Andrey Borodin.
>
> <v4-0001-Add-archive_mode-shared-for-coordinated-WAL-archi.patch><v4-0003-Optimize-ProcessArchivalReport-to-avoid-directory.patch><v4-0002-Mark-ancestor-timeline-WAL-segments-as-archived.patch>
Hi Andrey,
Adding the missing references [0] and [1].
[0] https://www.postgresql.org/message-id/5550D20D.6090703%40iki.fi
[1] https://github.com/open-gpdb/gpdb/commit/4f2db1929df1b5eed28f33505955636096bb4e8b
Best, Jaroslav Novikov.
^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: Streaming replication and WAL archive interactions
2026-02-12 06:56 Re: Streaming replication and WAL archive interactions Andrey Borodin <[email protected]>
@ 2026-03-03 12:06 ` Jaroslav Novikov <[email protected]>
3 siblings, 0 replies; 5+ messages in thread
From: Jaroslav Novikov @ 2026-03-03 12:06 UTC (permalink / raw)
To: Andrey Borodin <[email protected]>; +Cc: [email protected]; Michael Paquier <[email protected]>; Robert Haas <[email protected]>; Venkata Balaji N <[email protected]>; Andres Freund <[email protected]>; Fujii Masao <[email protected]>; Borodin Vladimir <[email protected]>; pgsql-hackers; [email protected]; Roman Khapov <[email protected]>; Kirill Reshke <[email protected]>; [email protected]
> On 12 Feb 2026, at 09:56, Andrey Borodin <[email protected]> wrote:
>
> Hi Heikki,
>
> There’s a nearby thread [0] (about 10 years later) where I’m working on a problem your patch from this thread helps solve.
>
> In datacenter large outages, 1–2% of clusters end up with gaps in their PITR timeline.
> In HA setups, when the primary is lost, some WAL can be missing from the archive even though it was streamed to the standby. Many HA tools (PGConsul, Patroni, etc.) try to re-archive from the standby, but those WAL files may already have been removed.
>
> Your “shared” archive mode addresses this: the standby keeps WAL until it’s archived. archive_mode=always plus an archive tool can work, but it’s expensive. In WAL-G, for example, the archive command does a GET on the standby’s WAL, then decrypts and compares. Switching to HEAD would reduce cost in some clouds but still adds cost.
>
> Another option is coordinating archiving outside Postgres, but that would mean building distributed coordination into the archive tool.
>
> Shared archive mode tackles this in Postgres itself.
>
> I’ve retrofitted your patch, incorporated ideas from the Greenplum work [1], and made some improvements.
>
> The patchset has three parts:
> * Rebase + tests – Your original patch, rebased, with tests added.
> * Timeline switching – Correct handling of timeline switches in archive status updates.
> * Avoid directory scans – Skip scanning archive_status when possible, which was costly in WAL-G setups.
>
> What do you think?
>
> Best regards, Andrey Borodin.
>
> <v4-0001-Add-archive_mode-shared-for-coordinated-WAL-archi.patch><v4-0003-Optimize-ProcessArchivalReport-to-avoid-directory.patch><v4-0002-Mark-ancestor-timeline-WAL-segments-as-archived.patch>
Hi Andrey,
Adding the missing references [0] and [1].
[0] https://www.postgresql.org/message-id/5550D20D.6090703%40iki.fi
[1] https://github.com/open-gpdb/gpdb/commit/4f2db1929df1b5eed28f33505955636096bb4e8b
Best, Jaroslav Novikov.
^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: Streaming replication and WAL archive interactions
2026-02-12 06:56 Re: Streaming replication and WAL archive interactions Andrey Borodin <[email protected]>
@ 2026-05-03 22:50 ` Grigory Smolkin <[email protected]>
3 siblings, 0 replies; 5+ messages in thread
From: Grigory Smolkin @ 2026-05-03 22:50 UTC (permalink / raw)
To: Jaroslav Novikov <[email protected]>; Andrey Borodin <[email protected]>; +Cc: [email protected]; Michael Paquier <[email protected]>; Robert Haas <[email protected]>; Venkata Balaji N <[email protected]>; Andres Freund <[email protected]>; Fujii Masao <[email protected]>; Borodin Vladimir <[email protected]>; pgsql-hackers; [email protected]; Roman Khapov <[email protected]>; Kirill Reshke <[email protected]>; [email protected]
Hello, hackers!
I would like to thank the community and all participants of this thread
for their interest in this problem.
In our production system with tens of thousands PostgreSQL clusters we
encounter exactly the same issue and are forced to synchronize upstreams
and downstreams via external means, which is quite suboptimal.
I`ve done some work on top of the proposed v4 version patch and would
like to present v5 version for a discussion.
There are a number of changes, such as sending just TLI and Segno
instead of full WAL filename, shifting some work into archiver and
adding shared memory for walreceiver/archiver synchronization.
There are a number of issues currently unresolved, which are worth a
discussion.
1. Should we update pg_stat_archiver on standby to support cascading
replication or should we just resend the report, received from upstream?
Personally I'm more inclined towards the pg_stat_archiver path, because
this way there will be less `if-else` programming and
archive_mode=shared behaviour will be more monitoring-friendly.
2. What should we do with *.backup.ready and *.partial.ready on standby?
Can we just XLogArchiveForceDone() them?
3. Should we keep the awkward part with switchpont calculation in
timeline switch case? I think all segments that are not in our server
history should just be stamped with XLogArchiveForceDone().
4. Currently XLogArchiveForceDone is forced either by walreceiver (on
receiving report from upstream) and archiver. Should we move this into
the archiver entirely?
Any feedback will be much appreciated.
Attachments:
[text/x-patch] v5_0001-Add-archive_mode-shared-for-coordinated-WAL-archiving.patch (39.9K, 3-v5_0001-Add-archive_mode-shared-for-coordinated-WAL-archiving.patch)
download | inline diff:
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 67da9a1de66..bab0d624cee 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4046,14 +4046,36 @@ include_dir 'conf.d'
are sent to archive storage by setting
<xref linkend="guc-archive-command"/> or
<xref linkend="guc-archive-library"/>. In addition to <literal>off</literal>,
- to disable, there are two modes: <literal>on</literal>, and
- <literal>always</literal>. During normal operation, there is no
- difference between the two modes, but when set to <literal>always</literal>
- the WAL archiver is enabled also during archive recovery or standby
- mode. In <literal>always</literal> mode, all files restored from the archive
- or streamed with streaming physical replication will be archived (again). See
- <xref linkend="continuous-archiving-in-standby"/> for details.
+ to disable, there are three modes: <literal>on</literal>, <literal>shared</literal>,
+ and <literal>always</literal>. During normal operation as a primary, there is no
+ difference between the three modes, but they differ during archive recovery or
+ standby mode:
</para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>on</literal>: Archives WAL only when running as a primary.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>shared</literal>: Coordinates archiving between primary and standby.
+ The standby defers WAL archival and deletion until the primary confirms
+ archival via streaming replication. This prevents WAL history loss during
+ standby promotion in high availability setups. Upon promotion, the standby
+ automatically starts archiving any remaining unarchived WAL. This mode works
+ with cascading replication, where each standby coordinates with its immediate
+ upstream server. See <xref linkend="continuous-archiving-in-standby"/> for details.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>always</literal>: Archives all WAL independently, even during recovery.
+ All files restored from the archive or streamed with streaming physical
+ replication will be archived (again), regardless of their source.
+ </para>
+ </listitem>
+ </itemizedlist>
<para>
<varname>archive_mode</varname> is a separate setting from
<varname>archive_command</varname> and
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index be8d3a5bfea..e96a60bfb4c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1449,35 +1449,61 @@ postgres=# WAIT FOR LSN '0/306EE20';
</indexterm>
<para>
- When continuous WAL archiving is used in a standby, there are two
- different scenarios: the WAL archive can be shared between the primary
- and the standby, or the standby can have its own WAL archive. When
- the standby has its own WAL archive, set <varname>archive_mode</varname>
+ When continuous WAL archiving is used in a standby, there are three
+ different scenarios: the standby can have its own independent WAL archive,
+ the WAL archive can be shared between the primary and standby, or archiving
+ can be coordinated between them.
+ </para>
+
+ <para>
+ For an independent archive, set <varname>archive_mode</varname>
to <literal>always</literal>, and the standby will call the archive
command for every WAL segment it receives, whether it's by restoring
- from the archive or by streaming replication. The shared archive can
- be handled similarly, but the <varname>archive_command</varname> or <varname>archive_library</varname> must
- test if the file being archived exists already, and if the existing file
- has identical contents. This requires more care in the
- <varname>archive_command</varname> or <varname>archive_library</varname>, as it must
- be careful to not overwrite an existing file with different contents,
- but return success if the exactly same file is archived twice. And
- all that must be done free of race conditions, if two servers attempt
- to archive the same file at the same time.
+ from the archive or by streaming replication.
+ </para>
+
+ <para>
+ For a shared archive where both primary and standby can write, use
+ <literal>always</literal> mode as well, but the <varname>archive_command</varname>
+ or <varname>archive_library</varname> must test if the file being archived
+ exists already, and if the existing file has identical contents. This requires
+ more care in the <varname>archive_command</varname> or <varname>archive_library</varname>,
+ as it must be careful to not overwrite an existing file with different contents,
+ but return success if the exactly same file is archived twice. And all that must
+ be done free of race conditions, if two servers attempt to archive the same file
+ at the same time.
+ </para>
+
+ <para>
+ For coordinated archiving in high availability setups, use
+ <varname>archive_mode</varname>=<literal>shared</literal>. In this mode, only
+ the primary archives WAL segments. The standby creates <literal>.ready</literal>
+ files for received segments but defers actual archiving. The primary periodically
+ sends archival status updates to the standby via streaming replication, informing
+ it which segments have been archived. The standby then marks these as archived
+ and allows them to be recycled. Upon promotion, the standby automatically starts
+ archiving any remaining WAL segments that weren't confirmed as archived by the
+ former primary. This prevents WAL history loss during failover while avoiding
+ the complexity of coordinating concurrent archiving. This mode works with cascading
+ replication, where each standby coordinates with its immediate upstream server.
</para>
<para>
If <varname>archive_mode</varname> is set to <literal>on</literal>, the
- archiver is not enabled during recovery or standby mode. If the standby
- server is promoted, it will start archiving after the promotion, but
- will not archive any WAL or timeline history files that
- it did not generate itself. To get a complete
- series of WAL files in the archive, you must ensure that all WAL is
- archived, before it reaches the standby. This is inherently true with
- file-based log shipping, as the standby can only restore files that
- are found in the archive, but not if streaming replication is enabled.
- When a server is not in recovery mode, there is no difference between
- <literal>on</literal> and <literal>always</literal> modes.
+ archiver is not enabled during recovery or standby mode, and this setting
+ cannot be used on a standby. If a standby with <literal>archive_mode</literal>
+ set to <literal>on</literal> is promoted, it will start archiving after the
+ promotion, but will not archive any WAL or timeline history files that it did
+ not generate itself. To get a complete series of WAL files in the archive, you
+ must ensure that all WAL is archived before it reaches the standby. This is
+ inherently true with file-based log shipping, as the standby can only restore
+ files that are found in the archive, but not if streaming replication is enabled.
+ </para>
+
+ <para>
+ When a server is not in recovery mode, <literal>on</literal>,
+ <literal>shared</literal>, and <literal>always</literal> modes all behave
+ identically, archiving completed WAL segments.
</para>
</sect2>
</sect1>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f85b5286086..e41c25e8c1e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -199,6 +199,7 @@ const struct config_enum_entry archive_mode_options[] = {
{"always", ARCHIVE_MODE_ALWAYS, false},
{"on", ARCHIVE_MODE_ON, false},
{"off", ARCHIVE_MODE_OFF, false},
+ {"shared", ARCHIVE_MODE_SHARED, false},
{"true", ARCHIVE_MODE_ON, true},
{"false", ARCHIVE_MODE_OFF, true},
{"yes", ARCHIVE_MODE_ON, true},
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 0f207ac0356..d5cbcdb45a8 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -55,6 +55,9 @@
#include "utils/resowner.h"
#include "utils/timeout.h"
#include "utils/wait_event.h"
+#include "replication/walreceiver.h"
+#include "access/timeline.h"
+#include "access/xlogarchive.h"
/* ----------
@@ -108,6 +111,15 @@ static const ArchiveModuleCallbacks *ArchiveCallbacks;
static ArchiveModuleState *archive_module_state;
static MemoryContext archive_context;
+/*
+ * Last segment we successfully marked as .done. Used to optimize
+ * ProcessArchivalReport() by generating expected filenames instead
+ * of scanning the archive_status directory.
+ */
+static TimeLineID last_processed_tli = 0;
+static XLogSegNo last_processed_segno = 0;
+ArchivalReportData *ArchReport;
+static void ProcessArchivalReport(void);
/*
* Stuff for tracking multiple files to archive from each scan of
@@ -385,6 +397,18 @@ pgarch_ArchiverCopyLoop(void)
{
char xlog[MAX_XFN_CHARS + 1];
+ /*
+ * In shared archive mode during recovery, the archiver doesn't archive
+ * files. The primary is responsible for archiving, and the walreceiver
+ * marks files as .done when the primary confirms archival. After
+ * promotion, the archiver starts working normally.
+ */
+ if (XLogArchiveMode == ARCHIVE_MODE_SHARED && RecoveryInProgress())
+ {
+ ProcessArchivalReport();
+ return;
+ }
+
/* force directory scan in the first call to pgarch_readyXlog() */
arch_files->arch_files_size = 0;
@@ -960,3 +984,169 @@ pgarch_call_module_shutdown_cb(int code, Datum arg)
if (ArchiveCallbacks->shutdown_cb != NULL)
ArchiveCallbacks->shutdown_cb(archive_module_state);
}
+
+/*
+ * Process archival report from primary.
+ *
+ * The primary sends us the last WAL segment it has archived. We scan the
+ * archive_status directory for .ready files and mark segments on the same
+ * timeline as .done if they're <= the reported segment.
+ */
+static void
+ProcessArchivalReport()
+{
+ char walfile[MAX_XFN_CHARS + 1];
+ char primary_last_archived_fname[MAX_XFN_CHARS + 1];
+ char status_path[MAXPGPATH];
+// DIR *status_dir;
+// struct dirent *status_de;
+ List *tli_history = NIL;
+
+ if (ArchReport->tli == 0 || ArchReport->segno == 0)
+ {
+ ereport(DEBUG2,
+ (errmsg("archival report from upstream was not yes received")));
+ return;
+ }
+
+ XLogFileName(primary_last_archived_fname, ArchReport->tli, ArchReport->segno, wal_segment_size);
+
+ /*
+ * Optimization: If the new report is on the same timeline as the last
+ * processed segment and moves forward, we can directly check for .ready
+ * files for segments between last_processed_segno and reported_segno
+ * instead of scanning the entire archive_status directory.
+ *
+ * Fall back to directory scan if:
+ * - Timeline changed (need to handle ancestor timelines)
+ * - This is the first report (last_processed_tli == 0)
+ * - Reported segment is not ahead (nothing new to process)
+ */
+ if (last_processed_tli == ArchReport->tli &&
+ ArchReport->segno > last_processed_segno)
+ {
+ /*
+ * Direct check: generate filenames for expected segments.
+ * XLogArchiveForceDone() will handle the case where .ready doesn't
+ * exist or .done already exists, so no need to stat() first.
+ */
+ XLogSegNo segno;
+ XLogSegNo start_segno = last_processed_segno + 1;
+
+ for (segno = start_segno; segno <= ArchReport->segno; segno++)
+ {
+ char walfile[MAXFNAMELEN];
+
+ /* Generate WAL filename and mark as archived */
+ XLogFileName(walfile, ArchReport->tli, segno, wal_segment_size);
+ XLogArchiveForceDone(walfile);
+ ereport(DEBUG3,
+ (errmsg("marked WAL segment %s as archived (primary archived up to %s)",
+ walfile, primary_last_archived_fname)));
+
+ }
+ /* Track the last segment we processed */
+ last_processed_tli = ArchReport->tli;
+ last_processed_segno = segno;
+ return;
+ }
+
+ /*
+ * Directory scan: needed when timeline changed or first report.
+ * This handles both same-timeline and ancestor-timeline cases.
+ */
+ while (pgarch_readyXlog(walfile))
+ {
+ TimeLineID file_tli;
+ XLogSegNo file_segno;
+
+ /* Parse the WAL filename */
+ // TODO: we must handle somehow partial, .history and .backup files
+ if (!IsXLogFileName(walfile))
+ continue;
+
+ elog(WARNING, "found ready file for %s", walfile);
+
+ XLogFromFileName(walfile, &file_tli, &file_segno, wal_segment_size);
+
+ /*
+ * Mark as .done if:
+ * 1. Same timeline and segment <= reported segment, OR
+ * 2. Ancestor timeline and segment is before the timeline switch point
+ *
+ * For ancestor timelines: if primary archived segment X on timeline T,
+ * then all segments on ancestor timelines before the switch to T must
+ * have been archived (they're required to reach timeline T).
+ */
+ if (file_tli == ArchReport->tli)
+ {
+ // found walfile not yet archived by upstream, we should quit here
+ if (file_segno > ArchReport->segno)
+ {
+ elog(WARNING, "segment %s is not yet archived by upstream", walfile);
+ return;
+ }
+
+ /* Same timeline, segment already archived */
+ XLogArchiveForceDone(walfile);
+ ereport(DEBUG3,
+ (errmsg("marked WAL segment %s as archived (primary archived up to %s)",
+ walfile, primary_last_archived_fname)));
+ }
+ else
+ {
+ XLogRecPtr switchpoint;
+ XLogSegNo switchpoint_segno;
+ /*
+ * Different timeline - check if it's an ancestor and if this
+ * segment is before the timeline switch point. Only read timeline
+ * history if we haven't already (lazy loading).
+ *
+ * Note: Timelines form a tree structure, not a linear sequence,
+ * so we can't use < or > to compare them.
+ */
+
+ if (tli_history == NIL)
+ tli_history = readTimeLineHistory(ArchReport->tli);
+
+ // some garbage in archive_status, should error here
+ if (!tliInHistory(file_tli, tli_history))
+ {
+ ereport(ERROR,
+ (errmsg("walfile %s is not in this server's history", walfile)));
+ continue;
+ }
+
+ /* Get the point where we switched away from this timeline */
+ switchpoint = tliSwitchPoint(file_tli, tli_history, NULL);
+
+ /*
+ * If the segment is at or before the switch point, it must have
+ * been archived (it's required to reach the reported timeline).
+ * The segment containing the switch point belongs to the old
+ * timeline up to the switch point and should be archived.
+ */
+ XLByteToSeg(switchpoint, switchpoint_segno, wal_segment_size);
+ if (file_segno > switchpoint_segno)
+ {
+ ereport(ERROR,
+ (errmsg("walfile %s is not in this server's history", walfile)));
+ continue;
+ }
+
+ XLogArchiveForceDone(walfile);
+ ereport(DEBUG3,
+ (errmsg("marked ancestor timeline segment %s as archived (before switch to timeline %u)",
+ walfile, ArchReport->tli)));
+ }
+ /*
+ * Update our tracking to the newly reported position for future optimizations.
+ */
+ last_processed_tli = file_tli;
+ last_processed_segno = file_segno;
+ // TOOD: what if durable_rename failed in XLogArchiveForceDone ?
+ // we will erroneusly move last_processed_segno forward
+ }
+
+ //FreeDir(status_dir);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b6fd332f196..b574360056c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -3401,7 +3401,9 @@ LaunchMissingBackgroundProcesses(void)
*/
if (PgArchPMChild == NULL &&
((XLogArchivingActive() && pmState == PM_RUN) ||
- (XLogArchivingAlways() && (pmState == PM_RECOVERY || pmState == PM_HOT_STANDBY))) &&
+ (XLogArchivingAlways() && (pmState == PM_RECOVERY || pmState == PM_HOT_STANDBY)) ||
+ (XLogArchivingShared())
+ ) &&
PgArchCanRestart())
PgArchPMChild = StartChildProcess(B_ARCHIVER);
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8185412a810..98ff3e4f03c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -636,7 +636,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
* Create .done file forcibly to prevent the streamed segment from
* being archived later.
*/
- if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+ if (XLogArchiveMode < ARCHIVE_MODE_ALWAYS)
XLogArchiveForceDone(xlogfname);
else
XLogArchiveNotify(xlogfname);
@@ -887,6 +887,19 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len, TimeLineID tli)
XLogWalRcvSendReply(true, false, false);
break;
}
+ case PqReplMsg_ArchiveStatusReport:
+ {
+ TimeLineID received_tli;
+ XLogSegNo received_segno;
+ StringInfoData incoming_message;
+
+ /* initialize a StringInfo with the given buffer */
+ initReadOnlyStringInfo(&incoming_message, buf, sizeof(int64) + sizeof(int64));
+ received_tli = (TimeLineID) pq_getmsgint64(&incoming_message);
+ received_segno = (XLogSegNo) pq_getmsgint64(&incoming_message);
+ StoreArchivalReport(received_tli, received_segno);
+ break;
+ }
default:
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -1094,12 +1107,39 @@ XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli)
/*
* Create .done file forcibly to prevent the streamed segment from being
- * archived later.
+ * archived later, unless archive_mode is 'always' or 'shared'.
+ *
+ * In 'always' mode, the standby archives independently.
+ *
+ * In 'shared' mode, we optimize by checking if this segment is already
+ * covered by the last archival report from the primary. If so, create
+ * .done directly. Otherwise, create .ready and wait for the next report.
*/
- if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
- XLogArchiveForceDone(xlogfname);
- else
+ if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
+ {
XLogArchiveNotify(xlogfname);
+ }
+ else if (XLogArchiveMode == ARCHIVE_MODE_SHARED)
+ {
+ /*
+ * In shared mode, check if this segment is already archived on primary.
+ * If we're on the same timeline and this segment is <= last archived,
+ * mark it .done immediately. Otherwise create .ready.
+ *
+ * We don't check ancestor timeline cases here to avoid reading timeline
+ * history files on every segment close. ProcessArchivalReport() will
+ * handle marking ancestor timeline segments as .done when it scans
+ * the archive_status directory.
+ */
+ if (IsSegnoArchivedByUpstream(recvFileTLI, recvSegNo))
+ XLogArchiveForceDone(xlogfname);
+ else
+ XLogArchiveNotify(xlogfname);
+ }
+ else
+ {
+ XLogArchiveForceDone(xlogfname);
+ }
recvFile = -1;
}
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index bd5d47be964..c8c6de51619 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -34,6 +34,7 @@
#include "utils/wait_event.h"
WalRcvData *WalRcv = NULL;
+//ArchivalReportData *ArchReport;
static void WalRcvShmemRequest(void *arg);
static void WalRcvShmemInit(void *arg);
@@ -57,6 +58,11 @@ WalRcvShmemRequest(void *arg)
.size = sizeof(WalRcvData),
.ptr = (void **) &WalRcv,
);
+
+ ShmemRequestStruct(.name = "Archival Report Data",
+ .size = sizeof(ArchivalReportData),
+ .ptr = (void **) &ArchReport,
+ );
}
/* Initialize walreceiver-related shared memory */
@@ -69,6 +75,8 @@ WalRcvShmemInit(void *arg)
SpinLockInit(&WalRcv->mutex);
pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
WalRcv->procno = INVALID_PROC_NUMBER;
+
+ MemSet(ArchReport, 0, sizeof(ArchivalReportData));
}
/* Is walreceiver running (or starting up)? */
@@ -422,3 +430,47 @@ GetReplicationTransferLatency(void)
return TimestampDifferenceMilliseconds(lastMsgSendTime,
lastMsgReceiptTime);
}
+
+/*
+ * Is segment was already archived by master?
+ * Can be used only if archive_mode=shared.
+ */
+bool
+IsSegnoArchivedByUpstream(TimeLineID tli, XLogSegNo segno)
+{
+ elog(WARNING, "received tli: %d, segno: %ld", tli, segno);
+ elog(WARNING, "ArchReport->tli: %d, ArchReport->segno: %ld", ArchReport->tli, ArchReport->segno);
+ return tli == ArchReport->tli && segno <= ArchReport->segno;
+}
+
+/*
+ * Store archival report data in shared memory for later use by
+ * walreceiver and archiver.
+ */
+void
+StoreArchivalReport(TimeLineID tli, XLogSegNo segno)
+{
+ ereport(WARNING,
+ (errmsg("received archival report from primary: tli %u, segno %ld",
+ tli, segno)));
+
+ if (tli == 0 || segno == 0)
+ {
+ ereport(WARNING,
+ (errmsg("invalid values in archival report: tli %d, segno %ld",
+ tli, segno)));
+ return;
+ }
+
+ /* nothing is changed */
+ if (tli == ArchReport->tli && segno <= ArchReport->segno)
+ return;
+
+ /* Remember the last archived segment for XLogWalRcvClose() */
+ ArchReport->tli = tli;
+ ArchReport->segno = segno;
+
+ /* Notify archiver that it's got something to do */
+ if (IsUnderPostmaster)
+ PgArchWakeup();
+}
\ No newline at end of file
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3d4ab929f91..09a8aeb33a4 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -183,6 +183,18 @@ static TimeLineID sendTimeLineNextTLI = 0;
static bool sendTimeLineIsHistoric = false;
static XLogRecPtr sendTimeLineValidUpto = InvalidXLogRecPtr;
+/*
+ * Last archived WAL file. This is fetched from pgstat periodically and sent
+ * to the standby. last_archival_report_timestamp tracks when we last sent
+ * the report to avoid excessive pgstat access.
+ */
+static TimeLineID last_archived_tli = 0;
+static XLogSegNo last_archived_segno = 0;
+static TimestampTz last_archival_report_timestamp = 0;
+
+/* Interval for sending archival reports (10 seconds) */
+#define ARCHIVAL_REPORT_INTERVAL 10000
+
/*
* How far have we sent WAL already? This is also advertised in
* MyWalSnd->sentPtr. (Actually, this is the next WAL location to send.)
@@ -297,6 +309,7 @@ static void ProcessStandbyMessage(void);
static void ProcessStandbyReplyMessage(void);
static void ProcessStandbyHSFeedbackMessage(void);
static void ProcessStandbyPSRequestMessage(void);
+static void WalSndArchivalReport(void);
static void ProcessRepliesIfAny(void);
static void ProcessPendingWrites(void);
static void WalSndKeepalive(bool requestReply, XLogRecPtr writePtr);
@@ -2799,6 +2812,91 @@ ProcessStandbyHSFeedbackMessage(void)
}
}
+/*
+ * Send archival status report to standby.
+ *
+ * This is called periodically during physical replication to inform the
+ * standby about the last WAL segment archived by the primary. The standby
+ * can then mark segments up to that point as .done, allowing them to be
+ * recycled. This prevents WAL loss during standby promotion.
+ */
+static void
+WalSndArchivalReport(void)
+{
+ PgStat_ArchiverStats *archiver_stats;
+ TimestampTz now;
+
+ /* Only send reports when archive_mode=shared */
+ if (XLogArchiveMode != ARCHIVE_MODE_SHARED)
+ return;
+
+ /* Only send reports during physical streaming replication, not during backup */
+ if (MyWalSnd->kind != REPLICATION_KIND_PHYSICAL)
+ return;
+ if (MyWalSnd->state != WALSNDSTATE_CATCHUP &&
+ MyWalSnd->state != WALSNDSTATE_STREAMING)
+ return;
+
+ /*
+ * Don't send to temporary replication slots (used by pg_basebackup).
+ * Connections without slots (regular standbys) are OK.
+ */
+ if (MyReplicationSlot != NULL &&
+ MyReplicationSlot->data.persistency == RS_TEMPORARY)
+ return;
+
+ now = GetCurrentTimestamp();
+
+ /*
+ * Send report at most once per ARCHIVAL_REPORT_INTERVAL (10 seconds).
+ * This avoids excessive pgstat access.
+ */
+ if (now < TimestampTzPlusMilliseconds(last_archival_report_timestamp,
+ ARCHIVAL_REPORT_INTERVAL))
+ return;
+ last_archival_report_timestamp = now;
+ /*
+ * Get archiver statistics. We use non-blocking access to avoid delaying
+ * replication if stats collector is slow. If stats are unavailable or
+ * stale, we'll just try again at the next interval.
+ */
+ pgstat_clear_snapshot();
+ archiver_stats = pgstat_fetch_stat_archiver();
+ if (archiver_stats == NULL)
+ return;
+
+ /* Only send reports for WAL segments, not backup history files or other archived files */
+ if (!IsXLogFileName(archiver_stats->last_archived_wal))
+ return;
+
+ /*
+ * Only send a report if the last archived WAL has changed. This is both
+ * an optimization and ensures we don't send empty reports on startup.
+ */
+ {
+ TimeLineID tli;
+ XLogSegNo segno;
+ XLogFromFileName(archiver_stats->last_archived_wal, &tli, &segno, wal_segment_size);
+ if (last_archived_tli == tli && last_archived_segno == segno)
+ return;
+
+ /* Remember what we are about to sent */
+ last_archived_tli = tli;
+ last_archived_segno = segno;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("sending archival report: tli %d, segno %ld", last_archived_segno)));
+
+ /* Construct the message... */
+ resetStringInfo(&output_message);
+ pq_sendbyte(&output_message, PqReplMsg_ArchiveStatusReport);
+ pq_sendint64(&output_message, (int64) last_archived_tli);
+ pq_sendint64(&output_message, (int64) last_archived_segno);
+ /* ... and send it wrapped in CopyData */
+ pq_putmessage_noblock(PqMsg_CopyData, output_message.data, output_message.len);
+}
+
/*
* Process the request for a primary status update message.
*/
@@ -4350,6 +4448,9 @@ WalSndKeepaliveIfNecessary(void)
{
TimestampTz ping_time;
+ /* Send archival status report if needed */
+ WalSndArchivalReport();
+
/*
* Don't send keepalive messages if timeouts are globally disabled or
* we're doing something not partaking in timeouts.
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 437b4f32349..98a94785ebf 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -67,6 +67,7 @@ typedef enum ArchiveMode
ARCHIVE_MODE_OFF = 0, /* disabled */
ARCHIVE_MODE_ON, /* enabled while server is running normally */
ARCHIVE_MODE_ALWAYS, /* enabled always (even during recovery) */
+ ARCHIVE_MODE_SHARED, /* shared archive between primary and standby */
} ArchiveMode;
extern PGDLLIMPORT int XLogArchiveMode;
@@ -104,6 +105,9 @@ extern PGDLLIMPORT bool XLogLogicalInfo;
/* Is WAL archiving enabled always (even during recovery)? */
#define XLogArchivingAlways() \
(AssertMacro(XLogArchiveMode == ARCHIVE_MODE_OFF || wal_level >= WAL_LEVEL_REPLICA), XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
+#define XLogArchivingShared() \
+ (AssertMacro(XLogArchiveMode == ARCHIVE_MODE_OFF || wal_level >= WAL_LEVEL_REPLICA), XLogArchiveMode == ARCHIVE_MODE_SHARED)
+
/*
* Is WAL-logging necessary for archival or log-shipping, or can we skip
diff --git a/src/include/libpq/protocol.h b/src/include/libpq/protocol.h
index eae8f0e7238..d22aaf9e225 100644
--- a/src/include/libpq/protocol.h
+++ b/src/include/libpq/protocol.h
@@ -72,6 +72,7 @@
/* Replication codes sent by the primary (wrapped in CopyData messages). */
+#define PqReplMsg_ArchiveStatusReport 'a'
#define PqReplMsg_Keepalive 'k'
#define PqReplMsg_PrimaryStatusUpdate 's'
#define PqReplMsg_WALData 'w'
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 47c07574d4d..2888dfb55ac 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -195,6 +195,14 @@ typedef struct
struct WalReceiverConn;
typedef struct WalReceiverConn WalReceiverConn;
+typedef struct ArchivalReportData
+{
+ TimeLineID tli;
+ XLogSegNo segno;
+} ArchivalReportData;
+/* Last archived WAL segment file reported by the primary */
+extern ArchivalReportData *ArchReport;
+
/*
* Status of walreceiver query execution.
*
@@ -502,5 +510,7 @@ extern XLogRecPtr GetWalRcvFlushRecPtr(XLogRecPtr *latestChunkStart, TimeLineID
extern XLogRecPtr GetWalRcvWriteRecPtr(void);
extern int GetReplicationApplyDelay(void);
extern int GetReplicationTransferLatency(void);
+extern bool IsSegnoArchivedByUpstream(TimeLineID tli, XLogSegNo segno);
+extern void StoreArchivalReport(TimeLineID tli, XLogSegNo segno);
#endif /* _WALRECEIVER_H */
diff --git a/src/test/recovery/t/053_archive_shared.pl b/src/test/recovery/t/053_archive_shared.pl
new file mode 100644
index 00000000000..397b71ad79d
--- /dev/null
+++ b/src/test/recovery/t/053_archive_shared.pl
@@ -0,0 +1,270 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test archive_mode=shared for coordinated WAL archiving between primary and standby
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use File::Path qw(rmtree);
+
+# Initialize primary node with archiving
+my $archive_dir = PostgreSQL::Test::Utils::tempdir();
+my $primary = PostgreSQL::Test::Cluster->new('primary');
+$primary->init(has_archiving => 1, allows_streaming => 1);
+$primary->append_conf('postgresql.conf', "
+archive_mode = shared
+archive_command = 'cp %p \"$archive_dir\"/%f'
+wal_keep_size = 128MB
+");
+$primary->start;
+
+# Create a test table and generate some WAL
+$primary->safe_psql('postgres', 'CREATE TABLE test_table (id int, data text);');
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'data' || i FROM generate_series(1, 500) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'data' || i FROM generate_series(501, 1000) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for archiver to archive segments
+$primary->poll_query_until('postgres',
+ "SELECT archived_count > 0 FROM pg_stat_archiver")
+ or die "Timed out waiting for archiver to start";
+
+my $archived_count = () = glob("$archive_dir/*");
+ok($archived_count > 0, "primary has archived WAL files to shared archive");
+note("Primary archived $archived_count files");
+
+# Take backup for standby
+my $backup_name = 'standby_backup';
+$primary->backup($backup_name);
+
+# Exclude possible race condition when backup WAL is last archived
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'data' || i FROM generate_series(501, 1000) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Set up standby with archive_mode=shared
+my $standby = PostgreSQL::Test::Cluster->new('standby');
+$standby->init_from_backup($primary, $backup_name, has_streaming => 1);
+$standby->append_conf('postgresql.conf', "
+archive_mode = shared
+archive_command = 'cp %p \"$archive_dir\"/%f'
+wal_receiver_status_interval = 1s
+");
+$standby->start;
+
+# Wait for standby to catch up
+$primary->wait_for_catchup($standby);
+
+# Generate more WAL on primary (these are new segments not yet archived)
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'data' || i FROM generate_series(1001, 1500) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'data' || i FROM generate_series(1501, 2000) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for standby to receive the new WAL
+$primary->wait_for_catchup($standby);
+
+# Check that standby has .ready or .done files for the newly received segments.
+# Normally they should be .ready (not yet archived by primary), but in rare cases
+# the archiver could be very fast and an archive report sent immediately, creating
+# .done files instead. Both are correct behavior - the key is that files exist.
+my $standby_archive_status = $standby->data_dir . '/pg_wal/archive_status';
+my $status_count = 0;
+if (opendir(my $dh, $standby_archive_status))
+{
+ my @files = grep { /\.(ready|done)$/ } readdir($dh);
+ $status_count = scalar(@files);
+ my $ready_count = scalar(grep { /\.ready$/ } @files);
+ my $done_count = scalar(grep { /\.done$/ } @files);
+ note("Standby has $ready_count .ready files and $done_count .done files");
+ closedir($dh);
+}
+cmp_ok($status_count, '>', 0, "standby creates archive status files for received WAL");
+
+# Generate more WAL and wait for archiving on primary
+my $initial_archived = $primary->safe_psql('postgres', 'SELECT archived_count FROM pg_stat_archiver');
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'more-data' || i FROM generate_series(2001, 2500) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$primary->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'more-data2' || i FROM generate_series(2501, 3000) i;");
+$primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for primary to archive the new segments
+$primary->poll_query_until('postgres',
+ "SELECT archived_count > $initial_archived FROM pg_stat_archiver")
+ or die "Timed out waiting for primary to archive new segments";
+
+# Wait for standby to catch up (archive status is sent during replication)
+$primary->wait_for_catchup($standby);
+
+# Wait for primary to send archival status updates and standby to process them
+# The standby should mark segments as .done after receiving archive status from primary
+my $done_count = 0;
+for (my $i = 0; $i < $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+ $done_count = 0;
+ if (opendir(my $dh, $standby_archive_status))
+ {
+ $done_count = scalar(grep { /\.done$/ } readdir($dh));
+ closedir($dh);
+ }
+ last if $done_count > 0;
+ sleep(1);
+}
+ok($done_count > 0, "standby marked segments as .done after primary's archival report");
+note("Standby has $done_count .done files");
+
+###############################################################################
+# Test 2: Standby promotion - verify archiver activates
+###############################################################################
+
+# Before promotion, verify archiver is not running on standby (shared mode during recovery)
+# In shared mode, the standby's archiver should not be archiving during recovery
+my $archived_before = $standby->safe_psql('postgres',
+ "SELECT archived_count FROM pg_stat_archiver");
+is($archived_before, '0',
+ "archiver not active on standby before promotion (archived_count=0)");
+
+# Verify standby is still in recovery before promoting
+my $in_recovery = $standby->safe_psql('postgres', "SELECT pg_is_in_recovery();");
+is($in_recovery, 't', "standby is in recovery before promotion");
+
+# Promote the standby
+$standby->promote;
+$standby->poll_query_until('postgres', "SELECT NOT pg_is_in_recovery();");
+
+# Generate WAL on new primary (former standby)
+$standby->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'post-promotion' || i FROM generate_series(2001, 2500) i;");
+$standby->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for archiver to activate and archive the new WAL
+# Check pg_stat_archiver to verify archiving is happening
+$standby->poll_query_until('postgres',
+ "SELECT archived_count > 0 FROM pg_stat_archiver")
+ or die "Timed out waiting for promoted standby to start archiving";
+pass("promoted standby started archiving");
+
+# Verify data integrity
+my $count = $standby->safe_psql('postgres', 'SELECT COUNT(*) FROM test_table;');
+ok($count >= 2500, "promoted standby has all data (got $count rows)");
+
+###############################################################################
+# Test 3: Cascading replication
+###############################################################################
+
+# Take a backup from the promoted standby (now the new primary)
+my $promoted_backup = 'promoted_backup';
+$standby->backup($promoted_backup);
+
+# Set up second-level standby (cascading from first standby, now promoted)
+my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$standby2->init_from_backup($standby, $promoted_backup, has_streaming => 1);
+$standby2->append_conf('postgresql.conf', "
+archive_mode = shared
+archive_command = 'cp %p \"$archive_dir\"/%f'
+wal_receiver_status_interval = 1s
+");
+$standby2->start;
+
+# Generate WAL on promoted standby (now primary for standby2)
+my $cascading_archived_before = $standby->safe_psql('postgres', 'SELECT archived_count FROM pg_stat_archiver');
+$standby->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'cascading' || i FROM generate_series(2501, 3000) i;");
+$standby->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for the promoted standby (acting as primary) to archive the new segment
+$standby->poll_query_until('postgres',
+ "SELECT archived_count > $cascading_archived_before FROM pg_stat_archiver")
+ or die "Timed out waiting for primary to archive segment in cascading test";
+
+# Wait for cascading standby to catch up
+$standby->wait_for_catchup($standby2);
+
+# Wait for cascading standby to receive archive status and mark segments as .done
+my $standby2_archive_status = $standby2->data_dir . '/pg_wal/archive_status';
+my $standby2_done_count = 0;
+for (my $i = 0; $i < $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+ $standby2_done_count = 0;
+ if (opendir(my $dh, $standby2_archive_status))
+ {
+ $standby2_done_count = scalar(grep { /\.done$/ } readdir($dh));
+ closedir($dh);
+ }
+ last if $standby2_done_count > 0;
+ sleep(1);
+}
+ok($standby2_done_count > 0, "cascading standby marks segments as .done");
+note("Cascading standby has $standby2_done_count .done files");
+
+# Verify cascading standby has all data
+my $standby2_count = $standby2->safe_psql('postgres', 'SELECT COUNT(*) FROM test_table;');
+ok($standby2_count >= 3000, "cascading standby has all data (got $standby2_count rows)");
+
+###############################################################################
+# Test 4: Multiple standbys from same primary
+###############################################################################
+
+# Create third standby from promoted standby (current primary)
+my $standby3 = PostgreSQL::Test::Cluster->new('standby3');
+my $backup2 = 'multi_standby_backup';
+$standby->backup($backup2);
+$standby3->init_from_backup($standby, $backup2, has_streaming => 1);
+$standby3->append_conf('postgresql.conf', "
+archive_mode = shared
+archive_command = 'cp %p \"$archive_dir\"/%f'
+wal_receiver_status_interval = 1s
+");
+$standby3->start;
+
+# Generate WAL and ensure both standbys receive it
+my $standby_archived_before = $standby->safe_psql('postgres', 'SELECT archived_count FROM pg_stat_archiver');
+$standby->safe_psql('postgres', "INSERT INTO test_table SELECT i, 'multi' || i FROM generate_series(3001, 3500) i;");
+$standby->safe_psql('postgres', 'SELECT pg_switch_wal();');
+
+# Wait for the promoted standby (acting as primary) to archive the new segment
+$standby->poll_query_until('postgres',
+ "SELECT archived_count > $standby_archived_before FROM pg_stat_archiver")
+ or die "Timed out waiting for primary to archive segment in multi-standby test";
+
+$standby->wait_for_catchup($standby2);
+$standby->wait_for_catchup($standby3);
+
+# Verify both standbys eventually mark segments as .done
+my $standby3_archive_status = $standby3->data_dir . '/pg_wal/archive_status';
+
+for (my $i = 0; $i < $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+ $standby2_done_count = 0;
+ if (opendir(my $dh, $standby2_archive_status))
+ {
+ $standby2_done_count = scalar(grep { /\.done$/ } readdir($dh));
+ closedir($dh);
+ }
+ last if $standby2_done_count > 0;
+ sleep(1);
+}
+
+my $standby3_done_count = 0;
+for (my $i = 0; $i < $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+ $standby3_done_count = 0;
+ if (opendir(my $dh, $standby3_archive_status))
+ {
+ $standby3_done_count = scalar(grep { /\.done$/ } readdir($dh));
+ closedir($dh);
+ }
+ last if $standby3_done_count > 0;
+ sleep(1);
+}
+
+ok($standby2_done_count > 0, "standby2 marks segments as .done");
+ok($standby3_done_count > 0, "standby3 marks segments as .done");
+note("standby2 has $standby2_done_count .done files, standby3 has $standby3_done_count .done files");
+
+# Verify both standbys have all data
+$standby2_count = $standby2->safe_psql('postgres', 'SELECT COUNT(*) FROM test_table;');
+my $standby3_count = $standby3->safe_psql('postgres', 'SELECT COUNT(*) FROM test_table;');
+ok($standby2_count >= 3500, "standby2 has all data (got $standby2_count rows)");
+ok($standby3_count >= 3500, "standby3 has all data (got $standby3_count rows)");
+
+done_testing();
^ permalink raw reply [nested|flat] 5+ messages in thread
end of thread, other threads:[~2026-05-03 22:50 UTC | newest]
Thread overview: 5+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-02-12 06:56 Re: Streaming replication and WAL archive interactions Andrey Borodin <[email protected]>
2026-02-20 20:57 ` Harinath Kanchu <[email protected]>
2026-03-03 10:43 ` Jaroslav Novikov <[email protected]>
2026-03-03 12:06 ` Jaroslav Novikov <[email protected]>
2026-05-03 22:50 ` Grigory Smolkin <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox