public inbox for [email protected]
help / color / mirror / Atom feedOptimize LISTEN/NOTIFY
120+ messages / 8 participants
[nested] [flat]
* Optimize LISTEN/NOTIFY
@ 2025-07-12 22:35 Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-07-12 22:35 UTC (permalink / raw)
To: pgsql-hackers
Hi hackers,
The current LISTEN/NOTIFY implementation is well-suited for use-cases like
cache invalidation where many backends listen on the same channel. However,
its scalability is limited when many backends listen on distinct
channels. The root of the problem is that Async_Notify must signal every
listening backend in the database, as it lacks central knowledge of which
backend is interested in which channel. This results in an O(N) number of
kill(pid, SIGUSR1) syscalls as the listener count grows.
The attached proof-of-concept patch proposes a straightforward
optimization for the single-listener case. It introduces a shared-memory
hash table mapping (dboid, channelname) to the ProcNumber of a single
listener. When NOTIFY is issued, we first check this table. If a single
listener is found, we signal only that backend. Otherwise, we fall back to
the existing broadcast behavior.
The performance impact for this pattern is significant. A benchmark [1]
measuring a NOTIFY "ping-pong" between two connections, while adding a
variable number of idle listeners, shows the following:
master (8893c3a):
0 extra listeners: 9126 TPS
10 extra listeners: 6233 TPS
100 extra listeners: 2020 TPS
1000 extra listeners: 238 TPS
0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener.patch:
0 extra listeners: 9152 TPS
10 extra listeners: 9352 TPS
100 extra listeners: 9320 TPS
1000 extra listeners: 8937 TPS
As you can see, the patched version's performance is near O(1) with respect
to the number of idle listeners, while the current implementation shows the
expected O(N) degradation.
This patch is a first-step. It uses a simple boolean has_multiple_listeners
flag in the hash entry. Once a channel gets a second listener, this flag is
set and, crucially, never cleared. The entry will then permanently indicate
"multiple listeners", even after all backends on that channel disconnect.
A more complete solution would likely use reference counting for each
channel's listeners. This would solve the "stuck entry" problem and could
also enable a further optimization: targeted signaling to all listeners of a
multi-user channel, avoiding the database-wide broadcast entirely.
The patch also includes a "wake only tail" optimization (contributed by
Marko Tikkaja) to help prevent backends from falling too far behind.
Instead of waking all lagging backends at once and creating a "thundering
herd", this logic signals only the single backend that is currently at the
queue tail. This ensures the global queue tail can always advance, relying
on a chain reaction to get backends caught up efficiently. This seems like
a sensible improvement in its own right.
Thoughts?
/Joel
[1] Benchmark tool and full results: https://github.com/joelonsql/pg-bench-listen-notify
Attachments:
[application/octet-stream] 0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener.patch (24.2K, 2-0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener.patch)
download | inline diff:
From aba0ffb2a9e1c5d77393a92c0ce43a968c23cbb5 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 15 Jun 2025 00:09:43 +0200
Subject: [PATCH] Optimize LISTEN/NOTIFY signaling for single-listener channels
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Previously, the implementation would signal every backend process that was
listening on any channel in the same database. This signaling is performed via
SendProcSignal(), which ultimately issues a kill(pid, SIGUSR1) syscall for each
listening backend.
This broadcast approach is well-suited for use cases like cache invalidation but
limits the scalability of application patterns where backends listen on distinct
channels. For example, a system of worker processes might use unique channel
names to direct work to a specific worker. In these scenarios, a NOTIFY intended
for a single listener unnecessarily triggers a syscall for every other listening
backend.
This commit improves scalability for such workloads by optimizing for
the single-listener case. By making this pattern more performant, we enable it
to be used more effectively in high-throughput systems, pushing PostgreSQL's
scalability limits for this class of applications. A new shared memory hash
table is introduced to track which backend process is listening on each channel.
When a NOTIFY is issued, if a channel has exactly one registered listener, we
can signal that specific backend directly.
The system gracefully falls back to broadcast behavior under two conditions:
1. When a channel has multiple backends listening to it.
2. If the shared hash table runs out of memory and cannot create a new entry.
To support this, the LISTEN and UNLISTEN commands, as well as the backend exit
cleanup logic in asyncQueueUnregister, are updated to manage entries in the new
channel hash table. The main signaling logic in SignalBackends has been reworked
to implement the targeted-vs-broadcast decision.
To ensure the global queue tail can always advance, this change also includes a
"wake only tail" optimization, contributed by Marko Tikkaja (johto). Instead
of waking all backends that are lagging far behind, this logic specifically
signals only the backend that is currently at the queue tail. This targeted
wake-up prevents a "thundering herd" of signals and relies on a chain
reaction—where each backend wakes the next—to process the queue efficiently.
This mechanism works in conjunction with both the new targeted signaling and
the broadcast fallback.
CAVEAT: This patch should be considered a first-step, proof-of-concept
optimization. It uses a simple boolean flag to distinguish single-listener
channels from multi-listener ones and does not track the full list of backends
for a multi-listener channel. As a result, it cannot remove a hash entry for
a channel once it has been marked as having multiple listeners, causing such
entries to persist even after all listeners have departed. A more complete
solution would likely involve reference counting to track all listening backends
for each channel. This would not only prevent stuck hash entries but could also
enable targeted signaling to all listeners of a specific multi-user channel,
further refining the optimization and avoiding the fallback to a full
database-wide broadcast.
---
src/backend/commands/async.c | 572 ++++++++++++++++++++++++++++++++---
1 file changed, 537 insertions(+), 35 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..a0b7daaef7d 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,11 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * In addition to each backend maintaining its own list of channels, we also
+ * maintain a central hash table that tracks channels with single listeners.
+ * When a channel has exactly one listening backend, we can signal just that
+ * backend. For channels with multiple listeners, we signal all listening
+ * backends.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -71,13 +74,16 @@
* make any actual updates to the effective listen state (listenChannels).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which has two modes of operation, depending on
+ * if any of our channels have multiple listening backends or not:
+ * a) If there are multiple listening backends, a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to every listening backend.
+ * b) Otherwise, such signals are only sent to each single listening backend
+ * per channel.
+ * Additionally, we use a "wake only tail" optimization: we always signal
+ * the backend furthest behind in the queue to help prevent backends from
+ * getting far behind and create a chain reaction of wake-ups.
+ * We can exclude backends that are already up to date, though.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -146,6 +152,7 @@
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
#include "utils/guc_hooks.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
#include "utils/snapmgr.h"
@@ -162,6 +169,58 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table provides an optimization by tracking which backend is
+ * listening on each channel. Channels are identified by database OID and
+ * channel name, making them database-specific.
+ *
+ * When exactly one backend listens on a channel, we signal that specific
+ * backend, avoiding unnecessary signals to all listening backends.
+ *
+ * We fall back to broadcast mode and signal all listening backends when:
+ * 1) Multiple backends listen on the same channel, OR
+ * 2) The hash table runs out of shared memory for new entries
+ *
+ * Note that CHANNEL_HASH_MAX_SIZE is not a hard limit - the hash table can
+ * store more entries than this, but performance will degrade due to bucket
+ * overflow. The actual fallback to broadcast mode occurs only when shared
+ * memory is exhausted and we cannot allocate new hash entries.
+ *
+ * The maximum size (CHANNEL_HASH_MAX_SIZE) is based on the typical OS port
+ * range. This provides a reasonable upper bound for systems that use
+ * per-connection channels.
+ *
+ */
+#define CHANNEL_HASH_INIT_SIZE 256
+#define CHANNEL_HASH_MAX_SIZE 65535
+
+/*
+ * Key structure for the channel hash table.
+ * Channels are database-specific, so we need both the database OID
+ * and the channel name to uniquely identify a channel.
+ */
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+/*
+ * Each entry contains a channel key (database OID + channel name) and a
+ * single backend ProcNumber that is listening on that channel. If multiple
+ * backends try to listen on the same channel, we mark it as having multiple
+ * listeners and fall back to broadcast behavior.
+ */
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ ProcNumber listener; /* single backend ID, or INVALID_PROC_NUMBER
+ * if multiple */
+ bool has_multiple_listeners;
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -293,6 +352,39 @@ typedef struct AsyncQueueControl
static AsyncQueueControl *asyncQueueControl;
+/* Channel hash table for single listening backend signalling */
+static HTAB *channelHash = NULL;
+
+/*
+ * GetChannelHash
+ * Get the channel hash table, initializing our backend's pointer if needed.
+ *
+ * This must be called before any access to the channel hash table.
+ * The hash table itself is created in shared memory during AsyncShmemInit,
+ * but each backend needs to get its own pointer to it.
+ */
+static HTAB *
+GetChannelHash(void)
+{
+ if (channelHash == NULL)
+ {
+ HASHCTL hash_ctl;
+
+ /* Set up to attach to the existing shared hash table */
+ MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ChannelHashKey);
+ hash_ctl.entrysize = sizeof(ChannelEntry);
+
+ channelHash = ShmemInitHash("Channel Hash",
+ CHANNEL_HASH_INIT_SIZE,
+ CHANNEL_HASH_MAX_SIZE,
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS);
+ }
+
+ return channelHash;
+}
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -458,6 +550,14 @@ static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+/* Channel hash table management functions */
+static inline void ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel);
+static void ChannelHashAddListener(const char *channel, ProcNumber procno);
+static void ChannelHashRemoveListener(const char *channel, ProcNumber procno);
+static void ChannelHashRemoveBackendFromAll(ProcNumber procno);
+static ChannelEntry * ChannelHashLookup(const char *channel);
+static List *GetPendingNotifyChannels(void);
+
/*
* Compute the difference between two queue page numbers.
* Previously this function accounted for a wraparound.
@@ -492,6 +592,9 @@ AsyncShmemSize(void)
size = add_size(size, SimpleLruShmemSize(notify_buffers, 0));
+ size = add_size(size, hash_estimate_size(CHANNEL_HASH_MAX_SIZE,
+ sizeof(ChannelEntry)));
+
return size;
}
@@ -546,6 +649,23 @@ AsyncShmemInit(void)
*/
(void) SlruScanDirectory(NotifyCtl, SlruScanDirCbDeleteAll, NULL);
}
+
+ /*
+ * Create or attach to the channel hash table.
+ */
+ {
+ HASHCTL hash_ctl;
+
+ MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ChannelHashKey);
+ hash_ctl.entrysize = sizeof(ChannelEntry);
+
+ channelHash = ShmemInitHash("Channel Hash",
+ CHANNEL_HASH_INIT_SIZE,
+ CHANNEL_HASH_MAX_SIZE,
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS);
+ }
}
@@ -1043,6 +1163,7 @@ Exec_ListenPreCommit(void)
QueuePosition head;
QueuePosition max;
ProcNumber prevListener;
+ ListCell *p;
/*
* Nothing to do if we are already listening to something, nor if we
@@ -1110,6 +1231,18 @@ Exec_ListenPreCommit(void)
QUEUE_NEXT_LISTENER(MyProcNumber) = QUEUE_FIRST_LISTENER;
QUEUE_FIRST_LISTENER = MyProcNumber;
}
+
+ /*
+ * Add all our channels to the channel hash table while we still hold
+ * exclusive lock on NotifyQueueLock.
+ */
+ foreach(p, listenChannels)
+ {
+ char *channel = (char *) lfirst(p);
+
+ ChannelHashAddListener(channel, MyProcNumber);
+ }
+
LWLockRelease(NotifyQueueLock);
/* Now we are listed in the global array, so remember we're listening */
@@ -1152,6 +1285,10 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ ChannelHashAddListener(channel, MyProcNumber);
+ LWLockRelease(NotifyQueueLock);
}
/*
@@ -1175,6 +1312,10 @@ Exec_UnlistenCommit(const char *channel)
{
listenChannels = foreach_delete_current(listenChannels, q);
pfree(lchan);
+
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ ChannelHashRemoveListener(channel, MyProcNumber);
+ LWLockRelease(NotifyQueueLock);
break;
}
}
@@ -1239,6 +1380,9 @@ asyncQueueUnregister(void)
* Need exclusive lock here to manipulate list links.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ ChannelHashRemoveBackendFromAll(MyProcNumber);
+
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
@@ -1565,12 +1709,18 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * This function operates in two modes:
+ * 1. Selective mode: When all pending notification channels have exactly one
+ * listener each, we signal only those specific backends that are listening
+ * on the channels with pending notifications.
+ * 2. Broadcast mode: When any channel has multiple listeners (or we ran out
+ * of shared memory for the channel hash table), we signal all listening
+ * backends in our database.
+ *
+ * In addition to the channel-specific signaling, we also implement a "wake
+ * only tail" optimization: we signal the backend that is furthest behind
+ * in the queue to help prevent backends from getting far behind and create
+ * a chain reaction of wake-ups. This avoids thundering herd problems.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1733,11 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ List *channels;
+ ListCell *p;
+ bool *signaled;
+ bool broadcast_mode = false;
+ bool tail_woken = false;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1594,40 +1749,159 @@ SignalBackends(void)
*/
pids = (int32 *) palloc(MaxBackends * sizeof(int32));
procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+ signaled = (bool *) palloc0(MaxBackends * sizeof(bool));
count = 0;
+ /* Get list of channels that have pending notifications */
+ channels = GetPendingNotifyChannels();
+
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /*
+ * Check if any channel has multiple listeners, in which case we would
+ * need to signal all backends anyway.
+ */
+ foreach(p, channels)
+ {
+ char *channel = (char *) lfirst(p);
+ ChannelEntry *entry = ChannelHashLookup(channel);
+
+ /*
+ * If there is no entry, it could mean we ran out of shared memory
+ * when trying to add this channel to the hash table, so we need to
+ * broadcast in that case as well.
+ */
+ if (!entry || entry->has_multiple_listeners)
+ {
+ broadcast_mode = true;
+ break;
+ }
+ }
+
+ if (broadcast_mode)
+ {
+ /*
+ * In broadcast mode, we iterate over all listening backends and
+ * signal the ones in our database that are not already caught up.
+ */
+ for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ {
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /*
+ * Always signal listeners in our own database, unless they're
+ * already caught up.
+ */
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ continue;
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ /* OK, need to signal this one */
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ }
+ }
+ else
+ {
+ /*
+ * Signal specific listening backends
+ */
+ foreach(p, channels)
+ {
+ char *channel = (char *) lfirst(p);
+ ChannelEntry *entry = ChannelHashLookup(channel);
+
+ ProcNumber i = entry->listener;
+ int32 pid;
+ QueuePosition pos;
+
+ Assert(entry && !entry->has_multiple_listeners);
+
+ if (signaled[i])
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /*
+ * Skip signaling listeners if they already caught up.
+ */
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ continue;
+
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ /* OK, need to signal this one */
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ }
+ }
+
+ /*
+ * Also check for any backends that are far behind. This ensures the
+ * global tail can advance even if they're not actively receiving
+ * notifications on their channels.
+ */
for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
{
- int32 pid = QUEUE_BACKEND_PID(i);
+ int32 pid;
QueuePosition pos;
- Assert(pid != InvalidPid);
+ /*
+ * Skip if we've already decided to signal this one.
+ */
+ if (signaled[i])
+ continue;
+
pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
- {
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
- if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
- continue;
- }
+
+ /*
+ * Skip signaling listeners if they already caught up.
+ */
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ continue;
+
+ /*
+ * Wake only tail optimization: Signal the backend that is furthest
+ * behind to help prevent backends from getting far behind in the
+ * first place. This creates a chain reaction where each backend
+ * eventually wakes up the next one as notifications are processed,
+ * avoiding thundering herd.
+ *
+ * Otherwise, only skip signaling listeners if they are not far
+ * behind.
+ */
+ if (!tail_woken && asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_TAIL),
+ QUEUE_POS_PAGE(pos)) == 0)
+ tail_woken = true;
else
- {
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
- continue;
- }
+ continue;
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
/* OK, need to signal this one */
pids[count] = pid;
procnos[count] = i;
count++;
+
+
}
+
LWLockRelease(NotifyQueueLock);
/* Now send signals */
@@ -1657,6 +1931,7 @@ SignalBackends(void)
pfree(pids);
pfree(procnos);
+ pfree(signaled);
}
/*
@@ -2395,3 +2670,230 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+/*
+ * Channel hash table management functions
+ */
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key (database OID + channel name) for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
+
+/*
+ * ChannelHashAddListener
+ * Register the given backend as a listener for the specified channel
+ * in the shared channel hash table.
+ *
+ * Caller must hold exclusive NotifyQueueLock.
+ */
+static void
+ChannelHashAddListener(const char *channel, ProcNumber procno)
+{
+ ChannelEntry *entry;
+ bool found;
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up or create the channel entry */
+ entry = (ChannelEntry *) hash_search(GetChannelHash(),
+ &key,
+ HASH_ENTER_NULL,
+ &found);
+
+ /*
+ * If hash_search returned NULL, we've run out of shared memory to
+ * allocate new hash entries. We gracefully degrade by not tracking this
+ * channel in the hash. The channel will use the fallback broadcast
+ * signalling.
+ */
+ if (entry == NULL)
+ {
+ ereport(DEBUG1,
+ (errmsg("too many notification channels are already being tracked")));
+ return;
+ }
+
+ if (!found)
+ {
+ /* New channel, initialize the entry */
+ memcpy(&entry->key, &key, sizeof(ChannelHashKey));
+ entry->listener = procno;
+ entry->has_multiple_listeners = false;
+ }
+ else
+ {
+ /* Channel already exists */
+ if (!entry->has_multiple_listeners)
+ {
+ if (entry->listener == procno)
+ return; /* Already listening */
+
+ /*
+ * Another backend is already listening on this channel. Mark it
+ * as having multiple listeners and fall back to broadcast
+ * signalling.
+ */
+ entry->has_multiple_listeners = true;
+ entry->listener = INVALID_PROC_NUMBER;
+ }
+ /* If already marked as having multiple listeners, nothing to do */
+ }
+}
+
+/*
+ * ChannelHashRemoveListener
+ * Update the channel hash when a backend stops listening on a channel.
+ *
+ * If the channel entry currently tracks exactly one listener and that
+ * listener matches the supplied procno, remove the entry altogether.
+ *
+ * If the channel has already been flagged as having multiple listeners,
+ * we no longer track individual backends; therefore we cannot remove a
+ * single backend without additional bookkeeping. In that situation we
+ * simply leave the entry in place (still marked as having multiple
+ * listeners) and return.
+ *
+ * Caller must hold exclusive NotifyQueueLock.
+ */
+static void
+ChannelHashRemoveListener(const char *channel, ProcNumber procno)
+{
+ ChannelEntry *entry;
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel entry */
+ entry = (ChannelEntry *) hash_search(GetChannelHash(),
+ &key,
+ HASH_FIND,
+ NULL);
+
+ if (!entry)
+ return; /* Channel not found */
+
+ /*
+ * If this channel has multiple listeners, we can't track individual
+ * removals. Just leave it marked as having multiple listeners.
+ */
+ if (entry->has_multiple_listeners)
+ return;
+
+ /* If this backend is the single listener, remove the channel entry */
+ if (entry->listener == procno)
+ {
+ hash_search(GetChannelHash(),
+ &key,
+ HASH_REMOVE,
+ NULL);
+ }
+}
+
+/*
+ * ChannelHashRemoveBackendFromAll
+ * Sweep the channel hash and delete any channel entries for which
+ * this backend is the only tracked listener in the current database.
+ *
+ * Caller must hold exclusive NotifyQueueLock.
+ */
+static void
+ChannelHashRemoveBackendFromAll(ProcNumber procno)
+{
+ HASH_SEQ_STATUS status;
+ ChannelEntry *entry;
+
+ hash_seq_init(&status, GetChannelHash());
+
+ while ((entry = (ChannelEntry *) hash_seq_search(&status)) != NULL)
+ {
+ if (entry->key.dboid != MyDatabaseId)
+ continue;
+
+ if (entry->has_multiple_listeners)
+ continue;
+
+ if (entry->listener == procno)
+ {
+ hash_search(GetChannelHash(),
+ &entry->key,
+ HASH_REMOVE,
+ NULL);
+ }
+ }
+}
+
+/*
+ * ChannelHashLookup
+ * Look up the channel hash entry for the given channel name in the
+ * current database.
+ *
+ * Returns NULL if the channel is not being tracked (no listeners, or channel
+ * fell back to broadcast mode because we ran out of shared memory when trying
+ * to add entries to the hash table).
+ *
+ * Caller must hold at least shared NotifyQueueLock.
+ */
+static ChannelEntry *
+ChannelHashLookup(const char *channel)
+{
+ ChannelHashKey key;
+
+ Assert(LWLockHeldByMe(NotifyQueueLock));
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ return (ChannelEntry *) hash_search(GetChannelHash(),
+ &key,
+ HASH_FIND,
+ NULL);
+}
+
+/*
+ * GetPendingNotifyChannels
+ * Get list of unique channel names from pending notifications.
+ */
+static List *
+GetPendingNotifyChannels(void)
+{
+ List *channels = NIL;
+ ListCell *p;
+ ListCell *q;
+ bool found;
+
+ if (!pendingNotifies)
+ return NIL;
+
+ /* Collect unique channel names from pending notifications */
+ foreach(p, pendingNotifies->events)
+ {
+ Notification *n = (Notification *) lfirst(p);
+ char *channel = n->data;
+
+ /* Check if we already have this channel in our list */
+ found = false;
+ foreach(q, channels)
+ {
+ char *existing = (char *) lfirst(q);
+
+ if (strcmp(existing, channel) == 0)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ channels = lappend(channels, channel);
+ }
+
+ return channels;
+}
--
2.47.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-07-12 23:18 Tom Lane <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Tom Lane @ 2025-07-12 23:18 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: pgsql-hackers
"Joel Jacobson" <[email protected]> writes:
> The attached proof-of-concept patch proposes a straightforward
> optimization for the single-listener case. It introduces a shared-memory
> hash table mapping (dboid, channelname) to the ProcNumber of a single
> listener.
What does that do to the cost and parallelizability of LISTEN/UNLISTEN?
> The patch also includes a "wake only tail" optimization (contributed by
> Marko Tikkaja) to help prevent backends from falling too far behind.
Coulda sworn we dealt with that case some years ago. In any case,
if it's independent of the other idea it should probably get its
own thread.
regards, tom lane
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-07-15 07:20 Joel Jacobson <[email protected]>
parent: Tom Lane <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-07-15 07:20 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: pgsql-hackers
On Sun, Jul 13, 2025, at 01:18, Tom Lane wrote:
> "Joel Jacobson" <[email protected]> writes:
>> The attached proof-of-concept patch proposes a straightforward
>> optimization for the single-listener case. It introduces a shared-memory
>> hash table mapping (dboid, channelname) to the ProcNumber of a single
>> listener.
>
> What does that do to the cost and parallelizability of LISTEN/UNLISTEN?
Good point. The previous patch would effectively force all LISTEN/UNLISTEN
to be serialized, which would at least hurt parallelizability.
New benchmark confirm this hypothesis.
New patch attached that combines two complementary approaches, that together
seems to scale well for both common-channel and unique-channel scenarios:
1. Partitioned Hash Locking
The Channel Hash now uses HASH_PARTITION, with an array of NUM_NOTIFY_PARTITIONS
lightweight locks. A given channel is mapped to a partition lock using
a custom hash function on (dboid, channelname).
This allows LISTEN/UNLISTEN operations on different channels to proceed
concurrently without fighting over a single global lock, addressing the
"many distinct channels" use-case.
2. Optimistic Read-Locking
For the "many backends on one channel" use-case, lock acquisition now follows
a read-then-upgrade pattern. We first acquire a LW_SHARED lock, to check the
channel's state. If the channel is already marked as has_multiple_listeners,
we can return immediately without any need for a write.
Only if we are the first or second listener on a channel do we release
the shared lock and acquire an LW_EXCLUSIVE lock to modify the hash entry.
After getting the exclusive lock, we re-verify the state to guard against
race conditions. This avoids serializing the third and all subsequent
listeners for a popular channel.
BENCHMARK
https://raw.githubusercontent.com/joelonsql/pg-bench-listen-notify/refs/heads/master/performance_ove...
https://raw.githubusercontent.com/joelonsql/pg-bench-listen-notify/refs/heads/master/performance_ove...
I didn't want to attached the images to this email because they are quite large,
due to all the details in the images.
However, since it's important this mailing list contains all relevant data discussed,
I've also included all data in the graphs formatted in ASCII/Markdown:
performance_overview.md
I've also included the raw parsed data from the pgbench output,
which has been used as input to create performance_overview.md
as well as the images:
pgbench_results_combined.csv
I've benchmarked five times per measurement, in random order.
All raw measurements have been included in the Markdown document
within { curly braces } sorted, next to the average values, to get an idea
of the variance. Stddev felt possibly misleading since I'm not sure the
data points are normally distributed, since it's benchmarking data.
I've run the benchmarks on my MacBook Pro Apple M3 Max,
using `caffeinate -dims pgbench ...`.
>> The patch also includes a "wake only tail" optimization (contributed by
>> Marko Tikkaja) to help prevent backends from falling too far behind.
>
> Coulda sworn we dealt with that case some years ago. In any case,
> if it's independent of the other idea it should probably get its
> own thread.
Maybe it's been dealt with by some other part of the system, but I can't
find any such code anywhere, it's only async.c that currently sends
PROCSIG_NOTIFY_INTERRUPT.
The wake only tail mechanism seems almost perfect, but I can think of at least
one edge-case where we could still get a problem situation:
With lots of idle backends, the rate of this one-by-one catch-up may not be fast
enough to outpace the queue's advancement, causing other idle backends
to eventually lag by more than the QUEUE_CLEANUP_DELAY threshold.
To ensure all backends are eventually processed without re-introducing
the thundering herd problem, an additional mechanism seems neessary:
I see two main options:
1. Extend the chain reaction
Once woken, a backend could signal the next backend at the queue tail,
propagating the catch-up process. This would need to be managed carefully,
perhaps with some kind of global advisory lock, to prevent multiple
cascades from running at once.
2. Centralize the work
We already have the autovacuum daemon, maybe it could also be made responsible
for kicking lagging backends?
Other ideas?
/Joel
Attached:
* pgbench-scripts.tar.gz
pgbench scripts to reproduce the results, report and images.
* performance_overview.md
Same results as in the images, but in ASCII/Markdown format.
* pgbench_results_combined.csv
Parsed output from pgbench runs, used to create performance_overview.md as well as the linked images.
* 0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener-v2.patch
Old patch just renamed to -v2
* 0002-Partition-channel-hash-to-improve-LISTEN-UNLISTEN-v2.patch
New patch with the approach explained above.
Attachments:
[text/csv] pgbench_results_combined.csv (122.0K, 2-pgbench_results_combined.csv)
download
[application/x-gzip] pgbench-scripts.tar.gz (8.6K, 3-pgbench-scripts.tar.gz)
download
[application/octet-stream] performance_overview.md (21.0K, 4-performance_overview.md)
download
[application/octet-stream] 0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener-v2.patch (24.2K, 5-0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener-v2.patch)
download | inline diff:
From aba0ffb2a9e1c5d77393a92c0ce43a968c23cbb5 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 15 Jun 2025 00:09:43 +0200
Subject: [PATCH 1/2] Optimize LISTEN/NOTIFY signaling for single-listener
channels
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Previously, the implementation would signal every backend process that was
listening on any channel in the same database. This signaling is performed via
SendProcSignal(), which ultimately issues a kill(pid, SIGUSR1) syscall for each
listening backend.
This broadcast approach is well-suited for use cases like cache invalidation but
limits the scalability of application patterns where backends listen on distinct
channels. For example, a system of worker processes might use unique channel
names to direct work to a specific worker. In these scenarios, a NOTIFY intended
for a single listener unnecessarily triggers a syscall for every other listening
backend.
This commit improves scalability for such workloads by optimizing for
the single-listener case. By making this pattern more performant, we enable it
to be used more effectively in high-throughput systems, pushing PostgreSQL's
scalability limits for this class of applications. A new shared memory hash
table is introduced to track which backend process is listening on each channel.
When a NOTIFY is issued, if a channel has exactly one registered listener, we
can signal that specific backend directly.
The system gracefully falls back to broadcast behavior under two conditions:
1. When a channel has multiple backends listening to it.
2. If the shared hash table runs out of memory and cannot create a new entry.
To support this, the LISTEN and UNLISTEN commands, as well as the backend exit
cleanup logic in asyncQueueUnregister, are updated to manage entries in the new
channel hash table. The main signaling logic in SignalBackends has been reworked
to implement the targeted-vs-broadcast decision.
To ensure the global queue tail can always advance, this change also includes a
"wake only tail" optimization, contributed by Marko Tikkaja (johto). Instead
of waking all backends that are lagging far behind, this logic specifically
signals only the backend that is currently at the queue tail. This targeted
wake-up prevents a "thundering herd" of signals and relies on a chain
reaction—where each backend wakes the next—to process the queue efficiently.
This mechanism works in conjunction with both the new targeted signaling and
the broadcast fallback.
CAVEAT: This patch should be considered a first-step, proof-of-concept
optimization. It uses a simple boolean flag to distinguish single-listener
channels from multi-listener ones and does not track the full list of backends
for a multi-listener channel. As a result, it cannot remove a hash entry for
a channel once it has been marked as having multiple listeners, causing such
entries to persist even after all listeners have departed. A more complete
solution would likely involve reference counting to track all listening backends
for each channel. This would not only prevent stuck hash entries but could also
enable targeted signaling to all listeners of a specific multi-user channel,
further refining the optimization and avoiding the fallback to a full
database-wide broadcast.
---
src/backend/commands/async.c | 572 ++++++++++++++++++++++++++++++++---
1 file changed, 537 insertions(+), 35 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..a0b7daaef7d 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,11 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * In addition to each backend maintaining its own list of channels, we also
+ * maintain a central hash table that tracks channels with single listeners.
+ * When a channel has exactly one listening backend, we can signal just that
+ * backend. For channels with multiple listeners, we signal all listening
+ * backends.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -71,13 +74,16 @@
* make any actual updates to the effective listen state (listenChannels).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which has two modes of operation, depending on
+ * if any of our channels have multiple listening backends or not:
+ * a) If there are multiple listening backends, a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to every listening backend.
+ * b) Otherwise, such signals are only sent to each single listening backend
+ * per channel.
+ * Additionally, we use a "wake only tail" optimization: we always signal
+ * the backend furthest behind in the queue to help prevent backends from
+ * getting far behind and create a chain reaction of wake-ups.
+ * We can exclude backends that are already up to date, though.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -146,6 +152,7 @@
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
#include "utils/guc_hooks.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
#include "utils/snapmgr.h"
@@ -162,6 +169,58 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table provides an optimization by tracking which backend is
+ * listening on each channel. Channels are identified by database OID and
+ * channel name, making them database-specific.
+ *
+ * When exactly one backend listens on a channel, we signal that specific
+ * backend, avoiding unnecessary signals to all listening backends.
+ *
+ * We fall back to broadcast mode and signal all listening backends when:
+ * 1) Multiple backends listen on the same channel, OR
+ * 2) The hash table runs out of shared memory for new entries
+ *
+ * Note that CHANNEL_HASH_MAX_SIZE is not a hard limit - the hash table can
+ * store more entries than this, but performance will degrade due to bucket
+ * overflow. The actual fallback to broadcast mode occurs only when shared
+ * memory is exhausted and we cannot allocate new hash entries.
+ *
+ * The maximum size (CHANNEL_HASH_MAX_SIZE) is based on the typical OS port
+ * range. This provides a reasonable upper bound for systems that use
+ * per-connection channels.
+ *
+ */
+#define CHANNEL_HASH_INIT_SIZE 256
+#define CHANNEL_HASH_MAX_SIZE 65535
+
+/*
+ * Key structure for the channel hash table.
+ * Channels are database-specific, so we need both the database OID
+ * and the channel name to uniquely identify a channel.
+ */
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+/*
+ * Each entry contains a channel key (database OID + channel name) and a
+ * single backend ProcNumber that is listening on that channel. If multiple
+ * backends try to listen on the same channel, we mark it as having multiple
+ * listeners and fall back to broadcast behavior.
+ */
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ ProcNumber listener; /* single backend ID, or INVALID_PROC_NUMBER
+ * if multiple */
+ bool has_multiple_listeners;
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -293,6 +352,39 @@ typedef struct AsyncQueueControl
static AsyncQueueControl *asyncQueueControl;
+/* Channel hash table for single listening backend signalling */
+static HTAB *channelHash = NULL;
+
+/*
+ * GetChannelHash
+ * Get the channel hash table, initializing our backend's pointer if needed.
+ *
+ * This must be called before any access to the channel hash table.
+ * The hash table itself is created in shared memory during AsyncShmemInit,
+ * but each backend needs to get its own pointer to it.
+ */
+static HTAB *
+GetChannelHash(void)
+{
+ if (channelHash == NULL)
+ {
+ HASHCTL hash_ctl;
+
+ /* Set up to attach to the existing shared hash table */
+ MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ChannelHashKey);
+ hash_ctl.entrysize = sizeof(ChannelEntry);
+
+ channelHash = ShmemInitHash("Channel Hash",
+ CHANNEL_HASH_INIT_SIZE,
+ CHANNEL_HASH_MAX_SIZE,
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS);
+ }
+
+ return channelHash;
+}
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -458,6 +550,14 @@ static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+/* Channel hash table management functions */
+static inline void ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel);
+static void ChannelHashAddListener(const char *channel, ProcNumber procno);
+static void ChannelHashRemoveListener(const char *channel, ProcNumber procno);
+static void ChannelHashRemoveBackendFromAll(ProcNumber procno);
+static ChannelEntry * ChannelHashLookup(const char *channel);
+static List *GetPendingNotifyChannels(void);
+
/*
* Compute the difference between two queue page numbers.
* Previously this function accounted for a wraparound.
@@ -492,6 +592,9 @@ AsyncShmemSize(void)
size = add_size(size, SimpleLruShmemSize(notify_buffers, 0));
+ size = add_size(size, hash_estimate_size(CHANNEL_HASH_MAX_SIZE,
+ sizeof(ChannelEntry)));
+
return size;
}
@@ -546,6 +649,23 @@ AsyncShmemInit(void)
*/
(void) SlruScanDirectory(NotifyCtl, SlruScanDirCbDeleteAll, NULL);
}
+
+ /*
+ * Create or attach to the channel hash table.
+ */
+ {
+ HASHCTL hash_ctl;
+
+ MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ChannelHashKey);
+ hash_ctl.entrysize = sizeof(ChannelEntry);
+
+ channelHash = ShmemInitHash("Channel Hash",
+ CHANNEL_HASH_INIT_SIZE,
+ CHANNEL_HASH_MAX_SIZE,
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS);
+ }
}
@@ -1043,6 +1163,7 @@ Exec_ListenPreCommit(void)
QueuePosition head;
QueuePosition max;
ProcNumber prevListener;
+ ListCell *p;
/*
* Nothing to do if we are already listening to something, nor if we
@@ -1110,6 +1231,18 @@ Exec_ListenPreCommit(void)
QUEUE_NEXT_LISTENER(MyProcNumber) = QUEUE_FIRST_LISTENER;
QUEUE_FIRST_LISTENER = MyProcNumber;
}
+
+ /*
+ * Add all our channels to the channel hash table while we still hold
+ * exclusive lock on NotifyQueueLock.
+ */
+ foreach(p, listenChannels)
+ {
+ char *channel = (char *) lfirst(p);
+
+ ChannelHashAddListener(channel, MyProcNumber);
+ }
+
LWLockRelease(NotifyQueueLock);
/* Now we are listed in the global array, so remember we're listening */
@@ -1152,6 +1285,10 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ ChannelHashAddListener(channel, MyProcNumber);
+ LWLockRelease(NotifyQueueLock);
}
/*
@@ -1175,6 +1312,10 @@ Exec_UnlistenCommit(const char *channel)
{
listenChannels = foreach_delete_current(listenChannels, q);
pfree(lchan);
+
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ ChannelHashRemoveListener(channel, MyProcNumber);
+ LWLockRelease(NotifyQueueLock);
break;
}
}
@@ -1239,6 +1380,9 @@ asyncQueueUnregister(void)
* Need exclusive lock here to manipulate list links.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ ChannelHashRemoveBackendFromAll(MyProcNumber);
+
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
@@ -1565,12 +1709,18 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * This function operates in two modes:
+ * 1. Selective mode: When all pending notification channels have exactly one
+ * listener each, we signal only those specific backends that are listening
+ * on the channels with pending notifications.
+ * 2. Broadcast mode: When any channel has multiple listeners (or we ran out
+ * of shared memory for the channel hash table), we signal all listening
+ * backends in our database.
+ *
+ * In addition to the channel-specific signaling, we also implement a "wake
+ * only tail" optimization: we signal the backend that is furthest behind
+ * in the queue to help prevent backends from getting far behind and create
+ * a chain reaction of wake-ups. This avoids thundering herd problems.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1733,11 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ List *channels;
+ ListCell *p;
+ bool *signaled;
+ bool broadcast_mode = false;
+ bool tail_woken = false;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1594,40 +1749,159 @@ SignalBackends(void)
*/
pids = (int32 *) palloc(MaxBackends * sizeof(int32));
procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+ signaled = (bool *) palloc0(MaxBackends * sizeof(bool));
count = 0;
+ /* Get list of channels that have pending notifications */
+ channels = GetPendingNotifyChannels();
+
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /*
+ * Check if any channel has multiple listeners, in which case we would
+ * need to signal all backends anyway.
+ */
+ foreach(p, channels)
+ {
+ char *channel = (char *) lfirst(p);
+ ChannelEntry *entry = ChannelHashLookup(channel);
+
+ /*
+ * If there is no entry, it could mean we ran out of shared memory
+ * when trying to add this channel to the hash table, so we need to
+ * broadcast in that case as well.
+ */
+ if (!entry || entry->has_multiple_listeners)
+ {
+ broadcast_mode = true;
+ break;
+ }
+ }
+
+ if (broadcast_mode)
+ {
+ /*
+ * In broadcast mode, we iterate over all listening backends and
+ * signal the ones in our database that are not already caught up.
+ */
+ for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ {
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /*
+ * Always signal listeners in our own database, unless they're
+ * already caught up.
+ */
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ continue;
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ /* OK, need to signal this one */
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ }
+ }
+ else
+ {
+ /*
+ * Signal specific listening backends
+ */
+ foreach(p, channels)
+ {
+ char *channel = (char *) lfirst(p);
+ ChannelEntry *entry = ChannelHashLookup(channel);
+
+ ProcNumber i = entry->listener;
+ int32 pid;
+ QueuePosition pos;
+
+ Assert(entry && !entry->has_multiple_listeners);
+
+ if (signaled[i])
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /*
+ * Skip signaling listeners if they already caught up.
+ */
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ continue;
+
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ /* OK, need to signal this one */
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ }
+ }
+
+ /*
+ * Also check for any backends that are far behind. This ensures the
+ * global tail can advance even if they're not actively receiving
+ * notifications on their channels.
+ */
for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
{
- int32 pid = QUEUE_BACKEND_PID(i);
+ int32 pid;
QueuePosition pos;
- Assert(pid != InvalidPid);
+ /*
+ * Skip if we've already decided to signal this one.
+ */
+ if (signaled[i])
+ continue;
+
pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
- {
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
- if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
- continue;
- }
+
+ /*
+ * Skip signaling listeners if they already caught up.
+ */
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ continue;
+
+ /*
+ * Wake only tail optimization: Signal the backend that is furthest
+ * behind to help prevent backends from getting far behind in the
+ * first place. This creates a chain reaction where each backend
+ * eventually wakes up the next one as notifications are processed,
+ * avoiding thundering herd.
+ *
+ * Otherwise, only skip signaling listeners if they are not far
+ * behind.
+ */
+ if (!tail_woken && asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_TAIL),
+ QUEUE_POS_PAGE(pos)) == 0)
+ tail_woken = true;
else
- {
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
- continue;
- }
+ continue;
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
/* OK, need to signal this one */
pids[count] = pid;
procnos[count] = i;
count++;
+
+
}
+
LWLockRelease(NotifyQueueLock);
/* Now send signals */
@@ -1657,6 +1931,7 @@ SignalBackends(void)
pfree(pids);
pfree(procnos);
+ pfree(signaled);
}
/*
@@ -2395,3 +2670,230 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+/*
+ * Channel hash table management functions
+ */
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key (database OID + channel name) for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
+
+/*
+ * ChannelHashAddListener
+ * Register the given backend as a listener for the specified channel
+ * in the shared channel hash table.
+ *
+ * Caller must hold exclusive NotifyQueueLock.
+ */
+static void
+ChannelHashAddListener(const char *channel, ProcNumber procno)
+{
+ ChannelEntry *entry;
+ bool found;
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up or create the channel entry */
+ entry = (ChannelEntry *) hash_search(GetChannelHash(),
+ &key,
+ HASH_ENTER_NULL,
+ &found);
+
+ /*
+ * If hash_search returned NULL, we've run out of shared memory to
+ * allocate new hash entries. We gracefully degrade by not tracking this
+ * channel in the hash. The channel will use the fallback broadcast
+ * signalling.
+ */
+ if (entry == NULL)
+ {
+ ereport(DEBUG1,
+ (errmsg("too many notification channels are already being tracked")));
+ return;
+ }
+
+ if (!found)
+ {
+ /* New channel, initialize the entry */
+ memcpy(&entry->key, &key, sizeof(ChannelHashKey));
+ entry->listener = procno;
+ entry->has_multiple_listeners = false;
+ }
+ else
+ {
+ /* Channel already exists */
+ if (!entry->has_multiple_listeners)
+ {
+ if (entry->listener == procno)
+ return; /* Already listening */
+
+ /*
+ * Another backend is already listening on this channel. Mark it
+ * as having multiple listeners and fall back to broadcast
+ * signalling.
+ */
+ entry->has_multiple_listeners = true;
+ entry->listener = INVALID_PROC_NUMBER;
+ }
+ /* If already marked as having multiple listeners, nothing to do */
+ }
+}
+
+/*
+ * ChannelHashRemoveListener
+ * Update the channel hash when a backend stops listening on a channel.
+ *
+ * If the channel entry currently tracks exactly one listener and that
+ * listener matches the supplied procno, remove the entry altogether.
+ *
+ * If the channel has already been flagged as having multiple listeners,
+ * we no longer track individual backends; therefore we cannot remove a
+ * single backend without additional bookkeeping. In that situation we
+ * simply leave the entry in place (still marked as having multiple
+ * listeners) and return.
+ *
+ * Caller must hold exclusive NotifyQueueLock.
+ */
+static void
+ChannelHashRemoveListener(const char *channel, ProcNumber procno)
+{
+ ChannelEntry *entry;
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel entry */
+ entry = (ChannelEntry *) hash_search(GetChannelHash(),
+ &key,
+ HASH_FIND,
+ NULL);
+
+ if (!entry)
+ return; /* Channel not found */
+
+ /*
+ * If this channel has multiple listeners, we can't track individual
+ * removals. Just leave it marked as having multiple listeners.
+ */
+ if (entry->has_multiple_listeners)
+ return;
+
+ /* If this backend is the single listener, remove the channel entry */
+ if (entry->listener == procno)
+ {
+ hash_search(GetChannelHash(),
+ &key,
+ HASH_REMOVE,
+ NULL);
+ }
+}
+
+/*
+ * ChannelHashRemoveBackendFromAll
+ * Sweep the channel hash and delete any channel entries for which
+ * this backend is the only tracked listener in the current database.
+ *
+ * Caller must hold exclusive NotifyQueueLock.
+ */
+static void
+ChannelHashRemoveBackendFromAll(ProcNumber procno)
+{
+ HASH_SEQ_STATUS status;
+ ChannelEntry *entry;
+
+ hash_seq_init(&status, GetChannelHash());
+
+ while ((entry = (ChannelEntry *) hash_seq_search(&status)) != NULL)
+ {
+ if (entry->key.dboid != MyDatabaseId)
+ continue;
+
+ if (entry->has_multiple_listeners)
+ continue;
+
+ if (entry->listener == procno)
+ {
+ hash_search(GetChannelHash(),
+ &entry->key,
+ HASH_REMOVE,
+ NULL);
+ }
+ }
+}
+
+/*
+ * ChannelHashLookup
+ * Look up the channel hash entry for the given channel name in the
+ * current database.
+ *
+ * Returns NULL if the channel is not being tracked (no listeners, or channel
+ * fell back to broadcast mode because we ran out of shared memory when trying
+ * to add entries to the hash table).
+ *
+ * Caller must hold at least shared NotifyQueueLock.
+ */
+static ChannelEntry *
+ChannelHashLookup(const char *channel)
+{
+ ChannelHashKey key;
+
+ Assert(LWLockHeldByMe(NotifyQueueLock));
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ return (ChannelEntry *) hash_search(GetChannelHash(),
+ &key,
+ HASH_FIND,
+ NULL);
+}
+
+/*
+ * GetPendingNotifyChannels
+ * Get list of unique channel names from pending notifications.
+ */
+static List *
+GetPendingNotifyChannels(void)
+{
+ List *channels = NIL;
+ ListCell *p;
+ ListCell *q;
+ bool found;
+
+ if (!pendingNotifies)
+ return NIL;
+
+ /* Collect unique channel names from pending notifications */
+ foreach(p, pendingNotifies->events)
+ {
+ Notification *n = (Notification *) lfirst(p);
+ char *channel = n->data;
+
+ /* Check if we already have this channel in our list */
+ found = false;
+ foreach(q, channels)
+ {
+ char *existing = (char *) lfirst(q);
+
+ if (strcmp(existing, channel) == 0)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ channels = lappend(channels, channel);
+ }
+
+ return channels;
+}
--
2.47.1
[application/octet-stream] 0002-Partition-channel-hash-to-improve-LISTEN-UNLISTEN-v2.patch (22.9K, 6-0002-Partition-channel-hash-to-improve-LISTEN-UNLISTEN-v2.patch)
download | inline diff:
From 61ab3b3a834192b0468d10ca5fe3824b1fec6065 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 13 Jul 2025 14:39:12 +0200
Subject: [PATCH 2/2] Partition channel hash to improve LISTEN/UNLISTEN
The previous commit introduced a shared hash table to optimize NOTIFY for
single-listener channels. However, all modifications to this hash table
were serialized by the global NotifyQueueLock, creating a new contention
point for concurrent LISTEN and UNLISTEN operations. This commit
removes that bottleneck by partitioning the hash table's locking.
The single NotifyQueueLock is replaced by an array of
NUM_NOTIFY_PARTITIONS lightweight locks. A custom hash function, which
mixes the dboid and channel name, is used to map a channel to a
specific partition lock. This allows operations on different channels to
proceed in parallel, as they will contend on different locks.
Furthermore, to handle high-concurrency workloads where many backends
LISTEN on the same channel, the lock acquisition logic is optimized
using a read-then-upgrade pattern:
1. A LW_SHARED lock is taken first to check the channel's state. If no
write is needed (e.g., the channel is already marked as multi-listener),
the function can return immediately. This is the fast path for the
third and all subsequent listeners on a popular channel.
2. Only if a mutation is required is the shared lock released and a
LW_EXCLUSIVE lock acquired. After acquiring the exclusive lock, the
state is re-verified to guard against race conditions before the write
is performed.
This optimistic pattern is applied to both adding and removing listeners,
ensuring that both the "many distinct channels" and "many backends on
one channel" use-cases are highly scalable.
The SignalBackends logic is also updated to follow a strict lock
ordering hierarchy (global NotifyQueueLock before any partition lock) to
prevent deadlocks when checking the hash table.
Finally, the backend exit logic in Exec_UnlistenAllCommit is refined to
iterate over the backend's local listenChannels list, performing
targeted, per-partition removals instead of a more expensive full table scan.
With these changes, the LISTEN/UNLISTEN path is no longer serialized
by a single global lock, directly addressing the scalability concerns of
the previous implementation.
---
src/backend/commands/async.c | 398 +++++++++++++++++++++--------------
1 file changed, 241 insertions(+), 157 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index a0b7daaef7d..f81a30b53e2 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -134,6 +134,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -169,6 +170,12 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Number of partitions for the channel hash table's locks.
+ * This must be a power of two.
+ */
+#define NUM_NOTIFY_PARTITIONS 128
+
/*
* Channel hash table definitions
*
@@ -176,6 +183,10 @@
* listening on each channel. Channels are identified by database OID and
* channel name, making them database-specific.
*
+ * To improve scalability of concurrent LISTEN/UNLISTEN operations, the hash
+ * table is partitioned, with each partition protected by its own LWLock. This
+ * avoids serializing all operations on a single global lock.
+ *
* When exactly one backend listens on a channel, we signal that specific
* backend, avoiding unnecessary signals to all listening backends.
*
@@ -328,6 +339,11 @@ typedef struct QueueBackendStatus
* In order to avoid deadlocks, whenever we need multiple locks, we first get
* NotifyQueueTailLock, then NotifyQueueLock, and lastly SLRU bank lock.
*
+ * The channel hash table is protected by a separate set of partitioned
+ * locks. To prevent deadlocks between these and NotifyQueueLock, the global
+ * lock-ordering rule is: always acquire NotifyQueueLock *before* acquiring
+ * any channel hash partition lock.
+ *
* Each backend uses the backend[] array entry with index equal to its
* ProcNumber. We rely on this to make SendProcSignal fast.
*
@@ -352,9 +368,16 @@ typedef struct AsyncQueueControl
static AsyncQueueControl *asyncQueueControl;
+/* Locks for partitioned channel hash table */
+static LWLock *channelHashLocks;
+static int channelHashTrancheId = 0;
+
/* Channel hash table for single listening backend signalling */
static HTAB *channelHash = NULL;
+/* Forward declaration needed by GetChannelHash */
+static uint32 channel_hash_func(const void *key, Size keysize);
+
/*
* GetChannelHash
* Get the channel hash table, initializing our backend's pointer if needed.
@@ -370,16 +393,21 @@ GetChannelHash(void)
{
HASHCTL hash_ctl;
- /* Set up to attach to the existing shared hash table */
+ /*
+ * Set up to attach to the existing shared hash table. The hash
+ * control parameters must match those used in AsyncShmemInit.
+ */
MemSet(&hash_ctl, 0, sizeof(hash_ctl));
hash_ctl.keysize = sizeof(ChannelHashKey);
hash_ctl.entrysize = sizeof(ChannelEntry);
+ hash_ctl.hash = channel_hash_func;
+ hash_ctl.num_partitions = NUM_NOTIFY_PARTITIONS;
channelHash = ShmemInitHash("Channel Hash",
CHANNEL_HASH_INIT_SIZE,
CHANNEL_HASH_MAX_SIZE,
&hash_ctl,
- HASH_ELEM | HASH_BLOBS);
+ HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
}
return channelHash;
@@ -551,10 +579,10 @@ static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
/* Channel hash table management functions */
+static LWLock *GetChannelHashLock(const char *channel);
static inline void ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel);
static void ChannelHashAddListener(const char *channel, ProcNumber procno);
static void ChannelHashRemoveListener(const char *channel, ProcNumber procno);
-static void ChannelHashRemoveBackendFromAll(ProcNumber procno);
static ChannelEntry * ChannelHashLookup(const char *channel);
static List *GetPendingNotifyChannels(void);
@@ -595,6 +623,8 @@ AsyncShmemSize(void)
size = add_size(size, hash_estimate_size(CHANNEL_HASH_MAX_SIZE,
sizeof(ChannelEntry)));
+ size = add_size(size, mul_size(NUM_NOTIFY_PARTITIONS, sizeof(LWLock)));
+
return size;
}
@@ -659,12 +689,26 @@ AsyncShmemInit(void)
MemSet(&hash_ctl, 0, sizeof(hash_ctl));
hash_ctl.keysize = sizeof(ChannelHashKey);
hash_ctl.entrysize = sizeof(ChannelEntry);
+ hash_ctl.hash = channel_hash_func;
+ hash_ctl.num_partitions = NUM_NOTIFY_PARTITIONS;
channelHash = ShmemInitHash("Channel Hash",
CHANNEL_HASH_INIT_SIZE,
CHANNEL_HASH_MAX_SIZE,
&hash_ctl,
- HASH_ELEM | HASH_BLOBS);
+ HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+ }
+
+ /* Initialize locks for the partitioned hash table */
+ channelHashLocks = (LWLock *) ShmemAlloc(mul_size(NUM_NOTIFY_PARTITIONS, sizeof(LWLock)));
+ if (!found)
+ {
+ channelHashTrancheId = LWLockNewTrancheId();
+ LWLockRegisterTranche(channelHashTrancheId, "ChannelHashPartition");
+ }
+ for (int i = 0; i < NUM_NOTIFY_PARTITIONS; i++)
+ {
+ LWLockInitialize(&channelHashLocks[i], channelHashTrancheId);
}
}
@@ -1163,7 +1207,6 @@ Exec_ListenPreCommit(void)
QueuePosition head;
QueuePosition max;
ProcNumber prevListener;
- ListCell *p;
/*
* Nothing to do if we are already listening to something, nor if we
@@ -1232,17 +1275,6 @@ Exec_ListenPreCommit(void)
QUEUE_FIRST_LISTENER = MyProcNumber;
}
- /*
- * Add all our channels to the channel hash table while we still hold
- * exclusive lock on NotifyQueueLock.
- */
- foreach(p, listenChannels)
- {
- char *channel = (char *) lfirst(p);
-
- ChannelHashAddListener(channel, MyProcNumber);
- }
-
LWLockRelease(NotifyQueueLock);
/* Now we are listed in the global array, so remember we're listening */
@@ -1286,9 +1318,7 @@ Exec_ListenCommit(const char *channel)
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
- LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
ChannelHashAddListener(channel, MyProcNumber);
- LWLockRelease(NotifyQueueLock);
}
/*
@@ -1312,10 +1342,7 @@ Exec_UnlistenCommit(const char *channel)
{
listenChannels = foreach_delete_current(listenChannels, q);
pfree(lchan);
-
- LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
ChannelHashRemoveListener(channel, MyProcNumber);
- LWLockRelease(NotifyQueueLock);
break;
}
}
@@ -1334,9 +1361,22 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ ListCell *p;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ /*
+ * Before freeing the local list, iterate through it and perform a
+ * targeted removal for each of our channels from the shared hash table.
+ */
+ foreach(p, listenChannels)
+ {
+ char *channel = (char *) lfirst(p);
+
+ ChannelHashRemoveListener(channel, MyProcNumber);
+ }
+
list_free_deep(listenChannels);
listenChannels = NIL;
}
@@ -1381,8 +1421,6 @@ asyncQueueUnregister(void)
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- ChannelHashRemoveBackendFromAll(MyProcNumber);
-
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
@@ -1755,16 +1793,26 @@ SignalBackends(void)
/* Get list of channels that have pending notifications */
channels = GetPendingNotifyChannels();
+ /*
+ * To prevent deadlocks, we must always acquire locks in the same order:
+ * global NotifyQueueLock first, then individual partition locks.
+ */
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
/*
- * Check if any channel has multiple listeners, in which case we would
- * need to signal all backends anyway.
+ * Determine if we can use targeted signaling or must broadcast. This
+ * check must be done while holding NotifyQueueLock to prevent deadlocks
+ * against other backends that might be modifying the listener list and
+ * hash table simultaneously (e.g., asyncQueueUnregister).
*/
foreach(p, channels)
{
char *channel = (char *) lfirst(p);
- ChannelEntry *entry = ChannelHashLookup(channel);
+ ChannelEntry *entry;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ LWLockAcquire(lock, LW_SHARED);
+ entry = ChannelHashLookup(channel);
/*
* If there is no entry, it could mean we ran out of shared memory
@@ -1774,8 +1822,10 @@ SignalBackends(void)
if (!entry || entry->has_multiple_listeners)
{
broadcast_mode = true;
+ LWLockRelease(lock);
break;
}
+ LWLockRelease(lock);
}
if (broadcast_mode)
@@ -1814,41 +1864,53 @@ SignalBackends(void)
else
{
/*
- * Signal specific listening backends
+ * In targeted mode, signal specific listening backends. We must
+ * re-check the hash entries here inside the lock to avoid races.
*/
foreach(p, channels)
{
char *channel = (char *) lfirst(p);
- ChannelEntry *entry = ChannelHashLookup(channel);
-
- ProcNumber i = entry->listener;
- int32 pid;
- QueuePosition pos;
-
- Assert(entry && !entry->has_multiple_listeners);
-
- if (signaled[i])
- continue;
-
- pos = QUEUE_BACKEND_POS(i);
-
- /*
- * Skip signaling listeners if they already caught up.
- */
- if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
- continue;
-
- if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
- continue;
-
- pid = QUEUE_BACKEND_PID(i);
- Assert(pid != InvalidPid);
-
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- signaled[i] = true;
- count++;
+ ChannelEntry *entry;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ LWLockAcquire(lock, LW_SHARED);
+ entry = ChannelHashLookup(channel);
+
+ if (entry && !entry->has_multiple_listeners)
+ {
+ ProcNumber i = entry->listener;
+ int32 pid;
+ QueuePosition pos;
+
+ if (signaled[i])
+ {
+ LWLockRelease(lock);
+ continue;
+ }
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ {
+ LWLockRelease(lock);
+ continue;
+ }
+
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ {
+ LWLockRelease(lock);
+ continue;
+ }
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ }
+ LWLockRelease(lock);
}
}
@@ -1879,12 +1941,10 @@ SignalBackends(void)
/*
* Wake only tail optimization: Signal the backend that is furthest
* behind to help prevent backends from getting far behind in the
- * first place. This creates a chain reaction where each backend
- * eventually wakes up the next one as notifications are processed,
- * avoiding thundering herd.
- *
- * Otherwise, only skip signaling listeners if they are not far
- * behind.
+ * first place. This finds the backend(s) on the same page as the
+ * global tail, which are the ones holding up truncation. This creates
+ * a chain reaction where each backend eventually wakes up the next one
+ * as notifications are processed, avoiding thundering herd.
*/
if (!tail_woken && asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_TAIL),
QUEUE_POS_PAGE(pos)) == 0)
@@ -1898,8 +1958,6 @@ SignalBackends(void)
pids[count] = pid;
procnos[count] = i;
count++;
-
-
}
LWLockRelease(NotifyQueueLock);
@@ -1921,9 +1979,9 @@ SignalBackends(void)
/*
* Note: assuming things aren't broken, a signal failure here could
- * only occur if the target backend exited since we released
- * NotifyQueueLock; which is unlikely but certainly possible. So we
- * just log a low-level debug message if it happens.
+ * only occur if the target backend exited since we released the lock;
+ * which is unlikely but certainly possible. So we just log a
+ * low-level debug message if it happens.
*/
if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]) < 0)
elog(DEBUG3, "could not signal backend with PID %d: %m", pid);
@@ -2675,6 +2733,47 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
* Channel hash table management functions
*/
+/*
+ * channel_hash_func
+ * Custom hash function for the channel hash table. This function ensures
+ * that the low-order bits of the hash are well-distributed, which is
+ * critical for partitioned hash tables.
+ */
+static uint32
+channel_hash_func(const void *key, Size keysize)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ uint32 h;
+
+ /*
+ * Mix the dboid and the channel name to produce a good hash. hash_any()
+ * is a high-quality portable hash function. This prevents channels with
+ * the same name in different databases from always mapping to the same
+ * partition.
+ */
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/*
+ * GetChannelHashLock
+ * Return the LWLock that protects the partition for the given channel name.
+ */
+static LWLock *
+GetChannelHashLock(const char *channel)
+{
+ ChannelHashKey key;
+ uint32 hash;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ hash = get_hash_value(GetChannelHash(), &key);
+
+ return &channelHashLocks[hash % NUM_NOTIFY_PARTITIONS];
+}
+
/*
* ChannelHashPrepareKey
* Prepare a channel key (database OID + channel name) for use as a hash key.
@@ -2689,10 +2788,22 @@ ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel)
/*
* ChannelHashAddListener
- * Register the given backend as a listener for the specified channel
- * in the shared channel hash table.
+ * Register the given backend as a listener for the specified channel.
*
- * Caller must hold exclusive NotifyQueueLock.
+ * This function uses an optimistic read-locking strategy to maximize
+ * concurrency when many backends listen on the same channel.
+ *
+ * 1. It first takes a shared lock and checks the channel's state. If the
+ * channel is already marked as having multiple listeners, no write is
+ * needed, and we can return immediately. This is the fast path for the
+ * 3rd, 4th, etc., listener on a given channel.
+ *
+ * 2. If a write is needed (either to create the entry or to mark it as
+ * multi-listener), it releases the shared lock and acquires an exclusive
+ * lock.
+ *
+ * 3. CRUCIALLY, after acquiring the exclusive lock, it must re-check the
+ * state, as another backend may have modified the entry in the interim.
*/
static void
ChannelHashAddListener(const char *channel, ProcNumber procno)
@@ -2700,135 +2811,108 @@ ChannelHashAddListener(const char *channel, ProcNumber procno)
ChannelEntry *entry;
bool found;
ChannelHashKey key;
+ LWLock *lock = GetChannelHashLock(channel);
ChannelHashPrepareKey(&key, MyDatabaseId, channel);
- /* Look up or create the channel entry */
- entry = (ChannelEntry *) hash_search(GetChannelHash(),
- &key,
- HASH_ENTER_NULL,
- &found);
+ /*
+ * FAST PATH: Optimistically take a shared lock. If the channel already
+ * has multiple listeners, we don't need to do anything.
+ */
+ LWLockAcquire(lock, LW_SHARED);
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (entry && entry->has_multiple_listeners)
+ {
+ LWLockRelease(lock);
+ return;
+ }
+ LWLockRelease(lock);
+
+ /*
+ * SLOW PATH: We need to write. Acquire exclusive lock.
+ */
+ LWLockAcquire(lock, LW_EXCLUSIVE);
/*
- * If hash_search returned NULL, we've run out of shared memory to
- * allocate new hash entries. We gracefully degrade by not tracking this
- * channel in the hash. The channel will use the fallback broadcast
- * signalling.
+ * Re-check state after acquiring exclusive lock, as it may have changed.
*/
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_ENTER_NULL, &found);
+
if (entry == NULL)
{
- ereport(DEBUG1,
- (errmsg("too many notification channels are already being tracked")));
+ /* Out of memory in the hash partition. */
+ ereport(DEBUG1, (errmsg("too many notification channels are already being tracked")));
+ LWLockRelease(lock);
return;
}
if (!found)
{
- /* New channel, initialize the entry */
- memcpy(&entry->key, &key, sizeof(ChannelHashKey));
+ /* We are the first listener. */
entry->listener = procno;
entry->has_multiple_listeners = false;
}
- else
+ else if (!entry->has_multiple_listeners)
{
- /* Channel already exists */
- if (!entry->has_multiple_listeners)
+ /* We are the second listener. */
+ if (entry->listener != procno)
{
- if (entry->listener == procno)
- return; /* Already listening */
-
- /*
- * Another backend is already listening on this channel. Mark it
- * as having multiple listeners and fall back to broadcast
- * signalling.
- */
entry->has_multiple_listeners = true;
entry->listener = INVALID_PROC_NUMBER;
}
- /* If already marked as having multiple listeners, nothing to do */
}
+ /* If entry->has_multiple_listeners is now true, do nothing. */
+ LWLockRelease(lock);
}
/*
* ChannelHashRemoveListener
* Update the channel hash when a backend stops listening on a channel.
*
- * If the channel entry currently tracks exactly one listener and that
- * listener matches the supplied procno, remove the entry altogether.
- *
- * If the channel has already been flagged as having multiple listeners,
- * we no longer track individual backends; therefore we cannot remove a
- * single backend without additional bookkeeping. In that situation we
- * simply leave the entry in place (still marked as having multiple
- * listeners) and return.
- *
- * Caller must hold exclusive NotifyQueueLock.
+ * This function uses an optimistic read-lock strategy to maximize concurrency.
+ * An exclusive lock is only taken if we are the sole listener on a channel
+ * and need to remove the entry from the hash table.
*/
static void
ChannelHashRemoveListener(const char *channel, ProcNumber procno)
{
ChannelEntry *entry;
ChannelHashKey key;
+ LWLock *lock = GetChannelHashLock(channel);
ChannelHashPrepareKey(&key, MyDatabaseId, channel);
- /* Look up the channel entry */
- entry = (ChannelEntry *) hash_search(GetChannelHash(),
- &key,
- HASH_FIND,
- NULL);
-
- if (!entry)
- return; /* Channel not found */
-
/*
- * If this channel has multiple listeners, we can't track individual
- * removals. Just leave it marked as having multiple listeners.
+ * Take a shared lock first to see if a removal is even necessary. If the
+ * entry doesn't exist, or it's a multi-listener entry, we have nothing to
+ * do. This is the fast path.
*/
- if (entry->has_multiple_listeners)
+ LWLockAcquire(lock, LW_SHARED);
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (!entry || entry->has_multiple_listeners || entry->listener != procno)
+ {
+ LWLockRelease(lock);
return;
-
- /* If this backend is the single listener, remove the channel entry */
- if (entry->listener == procno)
- {
- hash_search(GetChannelHash(),
- &key,
- HASH_REMOVE,
- NULL);
}
-}
-
-/*
- * ChannelHashRemoveBackendFromAll
- * Sweep the channel hash and delete any channel entries for which
- * this backend is the only tracked listener in the current database.
- *
- * Caller must hold exclusive NotifyQueueLock.
- */
-static void
-ChannelHashRemoveBackendFromAll(ProcNumber procno)
-{
- HASH_SEQ_STATUS status;
- ChannelEntry *entry;
+ LWLockRelease(lock);
- hash_seq_init(&status, GetChannelHash());
+ /*
+ * A removal is likely needed. Acquire an exclusive lock.
+ */
+ LWLockAcquire(lock, LW_EXCLUSIVE);
- while ((entry = (ChannelEntry *) hash_seq_search(&status)) != NULL)
+ /*
+ * Re-check the state, as another backend might have changed it. The only
+ * state change we care about is if it became a multi-listener channel, in
+ * which case we should no longer remove it.
+ */
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (entry && !entry->has_multiple_listeners && entry->listener == procno)
{
- if (entry->key.dboid != MyDatabaseId)
- continue;
-
- if (entry->has_multiple_listeners)
- continue;
-
- if (entry->listener == procno)
- {
- hash_search(GetChannelHash(),
- &entry->key,
- HASH_REMOVE,
- NULL);
- }
+ /* Still a single-listener entry for us, so remove it. */
+ (void) hash_search(GetChannelHash(), &key, HASH_REMOVE, NULL);
}
+ LWLockRelease(lock);
}
/*
@@ -2840,14 +2924,14 @@ ChannelHashRemoveBackendFromAll(ProcNumber procno)
* fell back to broadcast mode because we ran out of shared memory when trying
* to add entries to the hash table).
*
- * Caller must hold at least shared NotifyQueueLock.
+ * Caller must hold the appropriate partition lock (shared is sufficient).
*/
static ChannelEntry *
ChannelHashLookup(const char *channel)
{
ChannelHashKey key;
- Assert(LWLockHeldByMe(NotifyQueueLock));
+ Assert(LWLockHeldByMe(GetChannelHashLock(channel)));
ChannelHashPrepareKey(&key, MyDatabaseId, channel);
--
2.47.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-07-15 20:56 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-07-15 20:56 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: pgsql-hackers
On Tue, Jul 15, 2025, at 09:20, Joel Jacobson wrote:
> On Sun, Jul 13, 2025, at 01:18, Tom Lane wrote:
>> "Joel Jacobson" <[email protected]> writes:
>>> The attached proof-of-concept patch proposes a straightforward
>>> optimization for the single-listener case. It introduces a shared-memory
>>> hash table mapping (dboid, channelname) to the ProcNumber of a single
>>> listener.
>>
>> What does that do to the cost and parallelizability of LISTEN/UNLISTEN?
>
> Good point. The previous patch would effectively force all LISTEN/UNLISTEN
> to be serialized, which would at least hurt parallelizability.
>
> New benchmark confirm this hypothesis.
>
> New patch attached that combines two complementary approaches, that together
> seems to scale well for both common-channel and unique-channel scenarios:
Thanks to the FreeBSD animal failing, I see I made a shared memory blunder.
New squashed patch attached.
/Joel
Attachments:
[application/octet-stream] 0001-Subject-Optimize-LISTEN-NOTIFY-signaling-for-scalabi-v3.patch (28.2K, 2-0001-Subject-Optimize-LISTEN-NOTIFY-signaling-for-scalabi-v3.patch)
download | inline diff:
From 18004e66974fc9d4a93e00b0183959ac306c7218 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 15 Jun 2025 00:09:43 +0200
Subject: [PATCH] Subject: Optimize LISTEN/NOTIFY signaling for scalability
Previously, the implementation would signal every backend listening on any
channel in the database for every NOTIFY. While robust, this broadcast
approach limits the scalability of application patterns that rely on
targeted notifications to distinct channels. This commit improves
scalability for such workloads by introducing an optimization for
single-listener channels.
A new shared hash table is introduced to track channels that have exactly
one listener. When a NOTIFY is issued, this table is consulted; if a
single listener is found for the target channel, only that backend is
signaled. The system gracefully falls back to the original broadcast
behavior for channels with multiple listeners or if the hash table runs
out of memory.
To avoid introducing a new contention point on a global lock, the hash
table's locking is partitioned. An array of lightweight locks protects
the hash table, with a custom hash function mapping channels to lock
partitions. This allows concurrent LISTEN/UNLISTEN operations on
different channels to proceed in parallel. For high-concurrency workloads
where many backends listen on the *same* channel, an optimistic
read-then-upgrade locking pattern is used to minimize serialization. A
strict lock ordering hierarchy (global NotifyQueueLock before any
partition lock) is observed to prevent deadlocks.
This also incorporates the "wake only tail" optimization to ensure the
global queue tail can always advance without causing a thundering herd
of signals.
---
src/backend/commands/async.c | 674 +++++++++++++++++++++++++++++++++--
1 file changed, 641 insertions(+), 33 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..a5b614e1a24 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,11 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * In addition to each backend maintaining its own list of channels, we also
+ * maintain a central hash table that tracks channels with single listeners.
+ * When a channel has exactly one listening backend, we can signal just that
+ * backend. For channels with multiple listeners, we signal all listening
+ * backends.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -71,13 +74,16 @@
* make any actual updates to the effective listen state (listenChannels).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which has two modes of operation, depending on
+ * if any of our channels have multiple listening backends or not:
+ * a) If there are multiple listening backends, a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to every listening backend.
+ * b) Otherwise, such signals are only sent to each single listening backend
+ * per channel.
+ * Additionally, we use a "wake only tail" optimization: we always signal
+ * the backend furthest behind in the queue to help prevent backends from
+ * getting far behind and create a chain reaction of wake-ups.
+ * We can exclude backends that are already up to date, though.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +134,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -146,6 +153,7 @@
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
#include "utils/guc_hooks.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
#include "utils/snapmgr.h"
@@ -162,6 +170,68 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Number of partitions for the channel hash table's locks.
+ * This must be a power of two.
+ */
+#define NUM_NOTIFY_PARTITIONS 128
+
+/*
+ * Channel hash table definitions
+ *
+ * This hash table provides an optimization by tracking which backend is
+ * listening on each channel. Channels are identified by database OID and
+ * channel name, making them database-specific.
+ *
+ * To improve scalability of concurrent LISTEN/UNLISTEN operations, the hash
+ * table is partitioned, with each partition protected by its own LWLock. This
+ * avoids serializing all operations on a single global lock.
+ *
+ * When exactly one backend listens on a channel, we signal that specific
+ * backend, avoiding unnecessary signals to all listening backends.
+ *
+ * We fall back to broadcast mode and signal all listening backends when:
+ * 1) Multiple backends listen on the same channel, OR
+ * 2) The hash table runs out of shared memory for new entries
+ *
+ * Note that CHANNEL_HASH_MAX_SIZE is not a hard limit - the hash table can
+ * store more entries than this, but performance will degrade due to bucket
+ * overflow. The actual fallback to broadcast mode occurs only when shared
+ * memory is exhausted and we cannot allocate new hash entries.
+ *
+ * The maximum size (CHANNEL_HASH_MAX_SIZE) is based on the typical OS port
+ * range. This provides a reasonable upper bound for systems that use
+ * per-connection channels.
+ *
+ */
+#define CHANNEL_HASH_INIT_SIZE 256
+#define CHANNEL_HASH_MAX_SIZE 65535
+
+/*
+ * Key structure for the channel hash table.
+ * Channels are database-specific, so we need both the database OID
+ * and the channel name to uniquely identify a channel.
+ */
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+/*
+ * Each entry contains a channel key (database OID + channel name) and a
+ * single backend ProcNumber that is listening on that channel. If multiple
+ * backends try to listen on the same channel, we mark it as having multiple
+ * listeners and fall back to broadcast behavior.
+ */
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ ProcNumber listener; /* single backend ID, or INVALID_PROC_NUMBER
+ * if multiple */
+ bool has_multiple_listeners;
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -269,6 +339,11 @@ typedef struct QueueBackendStatus
* In order to avoid deadlocks, whenever we need multiple locks, we first get
* NotifyQueueTailLock, then NotifyQueueLock, and lastly SLRU bank lock.
*
+ * The channel hash table is protected by a separate set of partitioned
+ * locks. To prevent deadlocks between these and NotifyQueueLock, the global
+ * lock-ordering rule is: always acquire NotifyQueueLock *before* acquiring
+ * any channel hash partition lock.
+ *
* Each backend uses the backend[] array entry with index equal to its
* ProcNumber. We rely on this to make SendProcSignal fast.
*
@@ -293,6 +368,60 @@ typedef struct AsyncQueueControl
static AsyncQueueControl *asyncQueueControl;
+/* Locks for partitioned channel hash table */
+static LWLock *channelHashLocks;
+static int channelHashTrancheId = 0;
+
+/* Structure to hold channel hash locks and tranche ID in shared memory */
+typedef struct ChannelHashLockData
+{
+ int trancheId;
+ LWLock locks[FLEXIBLE_ARRAY_MEMBER];
+} ChannelHashLockData;
+
+static ChannelHashLockData * channelHashLockData;
+
+/* Channel hash table for single listening backend signalling */
+static HTAB *channelHash = NULL;
+
+/* Forward declaration needed by GetChannelHash */
+static uint32 channel_hash_func(const void *key, Size keysize);
+
+/*
+ * GetChannelHash
+ * Get the channel hash table, initializing our backend's pointer if needed.
+ *
+ * This must be called before any access to the channel hash table.
+ * The hash table itself is created in shared memory during AsyncShmemInit,
+ * but each backend needs to get its own pointer to it.
+ */
+static HTAB *
+GetChannelHash(void)
+{
+ if (channelHash == NULL)
+ {
+ HASHCTL hash_ctl;
+
+ /*
+ * Set up to attach to the existing shared hash table. The hash
+ * control parameters must match those used in AsyncShmemInit.
+ */
+ MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ChannelHashKey);
+ hash_ctl.entrysize = sizeof(ChannelEntry);
+ hash_ctl.hash = channel_hash_func;
+ hash_ctl.num_partitions = NUM_NOTIFY_PARTITIONS;
+
+ channelHash = ShmemInitHash("Channel Hash",
+ CHANNEL_HASH_INIT_SIZE,
+ CHANNEL_HASH_MAX_SIZE,
+ &hash_ctl,
+ HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+ }
+
+ return channelHash;
+}
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -458,6 +587,14 @@ static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+/* Channel hash table management functions */
+static LWLock *GetChannelHashLock(const char *channel);
+static inline void ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel);
+static void ChannelHashAddListener(const char *channel, ProcNumber procno);
+static void ChannelHashRemoveListener(const char *channel, ProcNumber procno);
+static ChannelEntry * ChannelHashLookup(const char *channel);
+static List *GetPendingNotifyChannels(void);
+
/*
* Compute the difference between two queue page numbers.
* Previously this function accounted for a wraparound.
@@ -492,6 +629,12 @@ AsyncShmemSize(void)
size = add_size(size, SimpleLruShmemSize(notify_buffers, 0));
+ size = add_size(size, hash_estimate_size(CHANNEL_HASH_MAX_SIZE,
+ sizeof(ChannelEntry)));
+
+ size = add_size(size, offsetof(ChannelHashLockData, locks) +
+ mul_size(NUM_NOTIFY_PARTITIONS, sizeof(LWLock)));
+
return size;
}
@@ -546,6 +689,49 @@ AsyncShmemInit(void)
*/
(void) SlruScanDirectory(NotifyCtl, SlruScanDirCbDeleteAll, NULL);
}
+
+ /*
+ * Create or attach to the channel hash table.
+ */
+ {
+ HASHCTL hash_ctl;
+
+ MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ChannelHashKey);
+ hash_ctl.entrysize = sizeof(ChannelEntry);
+ hash_ctl.hash = channel_hash_func;
+ hash_ctl.num_partitions = NUM_NOTIFY_PARTITIONS;
+
+ channelHash = ShmemInitHash("Channel Hash",
+ CHANNEL_HASH_INIT_SIZE,
+ CHANNEL_HASH_MAX_SIZE,
+ &hash_ctl,
+ HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+ }
+
+ /* Initialize locks for the partitioned hash table */
+ size = offsetof(ChannelHashLockData, locks) +
+ mul_size(NUM_NOTIFY_PARTITIONS, sizeof(LWLock));
+ channelHashLockData = (ChannelHashLockData *)
+ ShmemInitStruct("Channel Hash Lock Data", size, &found);
+ if (!found)
+ {
+ /* First time through: initialize the locks and tranche ID */
+ channelHashLockData->trancheId = LWLockNewTrancheId();
+ for (int i = 0; i < NUM_NOTIFY_PARTITIONS; i++)
+ {
+ LWLockInitialize(&channelHashLockData->locks[i],
+ channelHashLockData->trancheId);
+ }
+ }
+
+ /*
+ * Set up local pointers for convenience. We must also register the
+ * tranche ID in every backend that will use these locks.
+ */
+ channelHashLocks = channelHashLockData->locks;
+ channelHashTrancheId = channelHashLockData->trancheId;
+ LWLockRegisterTranche(channelHashTrancheId, "ChannelHashPartition");
}
@@ -1110,6 +1296,7 @@ Exec_ListenPreCommit(void)
QUEUE_NEXT_LISTENER(MyProcNumber) = QUEUE_FIRST_LISTENER;
QUEUE_FIRST_LISTENER = MyProcNumber;
}
+
LWLockRelease(NotifyQueueLock);
/* Now we are listed in the global array, so remember we're listening */
@@ -1152,6 +1339,8 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+
+ ChannelHashAddListener(channel, MyProcNumber);
}
/*
@@ -1175,6 +1364,7 @@ Exec_UnlistenCommit(const char *channel)
{
listenChannels = foreach_delete_current(listenChannels, q);
pfree(lchan);
+ ChannelHashRemoveListener(channel, MyProcNumber);
break;
}
}
@@ -1193,9 +1383,22 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ ListCell *p;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ /*
+ * Before freeing the local list, iterate through it and perform a
+ * targeted removal for each of our channels from the shared hash table.
+ */
+ foreach(p, listenChannels)
+ {
+ char *channel = (char *) lfirst(p);
+
+ ChannelHashRemoveListener(channel, MyProcNumber);
+ }
+
list_free_deep(listenChannels);
listenChannels = NIL;
}
@@ -1239,6 +1442,7 @@ asyncQueueUnregister(void)
* Need exclusive lock here to manipulate list links.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
@@ -1565,12 +1769,18 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * This function operates in two modes:
+ * 1. Selective mode: When all pending notification channels have exactly one
+ * listener each, we signal only those specific backends that are listening
+ * on the channels with pending notifications.
+ * 2. Broadcast mode: When any channel has multiple listeners (or we ran out
+ * of shared memory for the channel hash table), we signal all listening
+ * backends in our database.
+ *
+ * In addition to the channel-specific signaling, we also implement a "wake
+ * only tail" optimization: we signal the backend that is furthest behind
+ * in the queue to help prevent backends from getting far behind and create
+ * a chain reaction of wake-ups. This avoids thundering herd problems.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1793,11 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ List *channels;
+ ListCell *p;
+ bool *signaled;
+ bool broadcast_mode = false;
+ bool tail_woken = false;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1594,40 +1809,179 @@ SignalBackends(void)
*/
pids = (int32 *) palloc(MaxBackends * sizeof(int32));
procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+ signaled = (bool *) palloc0(MaxBackends * sizeof(bool));
count = 0;
+ /* Get list of channels that have pending notifications */
+ channels = GetPendingNotifyChannels();
+
+ /*
+ * To prevent deadlocks, we must always acquire locks in the same order:
+ * global NotifyQueueLock first, then individual partition locks.
+ */
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+
+ /*
+ * Determine if we can use targeted signaling or must broadcast. This
+ * check must be done while holding NotifyQueueLock to prevent deadlocks
+ * against other backends that might be modifying the listener list and
+ * hash table simultaneously (e.g., asyncQueueUnregister).
+ */
+ foreach(p, channels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(p);
+ ChannelEntry *entry;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ LWLockAcquire(lock, LW_SHARED);
+ entry = ChannelHashLookup(channel);
+
+ /*
+ * If there is no entry, it could mean we ran out of shared memory
+ * when trying to add this channel to the hash table, so we need to
+ * broadcast in that case as well.
+ */
+ if (!entry || entry->has_multiple_listeners)
+ {
+ broadcast_mode = true;
+ LWLockRelease(lock);
+ break;
+ }
+ LWLockRelease(lock);
+ }
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (broadcast_mode)
+ {
+ /*
+ * In broadcast mode, we iterate over all listening backends and
+ * signal the ones in our database that are not already caught up.
+ */
+ for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
{
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
/*
* Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
+ * already caught up.
*/
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ /* OK, need to signal this one */
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
}
- else
+ }
+ else
+ {
+ /*
+ * In targeted mode, signal specific listening backends. We must
+ * re-check the hash entries here inside the lock to avoid races.
+ */
+ foreach(p, channels)
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
- continue;
+ char *channel = (char *) lfirst(p);
+ ChannelEntry *entry;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ LWLockAcquire(lock, LW_SHARED);
+ entry = ChannelHashLookup(channel);
+
+ if (entry && !entry->has_multiple_listeners)
+ {
+ ProcNumber i = entry->listener;
+ int32 pid;
+ QueuePosition pos;
+
+ if (signaled[i])
+ {
+ LWLockRelease(lock);
+ continue;
+ }
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ {
+ LWLockRelease(lock);
+ continue;
+ }
+
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ {
+ LWLockRelease(lock);
+ continue;
+ }
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ }
+ LWLockRelease(lock);
}
+ }
+
+ /*
+ * Also check for any backends that are far behind. This ensures the
+ * global tail can advance even if they're not actively receiving
+ * notifications on their channels.
+ */
+ for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ {
+ int32 pid;
+ QueuePosition pos;
+
+ /*
+ * Skip if we've already decided to signal this one.
+ */
+ if (signaled[i])
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /*
+ * Skip signaling listeners if they already caught up.
+ */
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ continue;
+
+ /*
+ * Wake only tail optimization: Signal the backend that is furthest
+ * behind to help prevent backends from getting far behind in the
+ * first place. This finds the backend(s) on the same page as the
+ * global tail, which are the ones holding up truncation. This creates
+ * a chain reaction where each backend eventually wakes up the next
+ * one as notifications are processed, avoiding thundering herd.
+ */
+ if (!tail_woken && asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_TAIL),
+ QUEUE_POS_PAGE(pos)) == 0)
+ tail_woken = true;
+ else
+ continue;
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
/* OK, need to signal this one */
pids[count] = pid;
procnos[count] = i;
count++;
}
+
LWLockRelease(NotifyQueueLock);
/* Now send signals */
@@ -1647,9 +2001,9 @@ SignalBackends(void)
/*
* Note: assuming things aren't broken, a signal failure here could
- * only occur if the target backend exited since we released
- * NotifyQueueLock; which is unlikely but certainly possible. So we
- * just log a low-level debug message if it happens.
+ * only occur if the target backend exited since we released the lock;
+ * which is unlikely but certainly possible. So we just log a
+ * low-level debug message if it happens.
*/
if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]) < 0)
elog(DEBUG3, "could not signal backend with PID %d: %m", pid);
@@ -1657,6 +2011,7 @@ SignalBackends(void)
pfree(pids);
pfree(procnos);
+ pfree(signaled);
}
/*
@@ -2395,3 +2750,256 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+/*
+ * Channel hash table management functions
+ */
+
+/*
+ * channel_hash_func
+ * Custom hash function for the channel hash table. This function ensures
+ * that the low-order bits of the hash are well-distributed, which is
+ * critical for partitioned hash tables.
+ */
+static uint32
+channel_hash_func(const void *key, Size keysize)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ uint32 h;
+
+ /*
+ * Mix the dboid and the channel name to produce a good hash. hash_any()
+ * is a high-quality portable hash function. This prevents channels with
+ * the same name in different databases from always mapping to the same
+ * partition.
+ */
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/*
+ * GetChannelHashLock
+ * Return the LWLock that protects the partition for the given channel name.
+ */
+static LWLock *
+GetChannelHashLock(const char *channel)
+{
+ ChannelHashKey key;
+ uint32 hash;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ hash = get_hash_value(GetChannelHash(), &key);
+
+ return &channelHashLocks[hash % NUM_NOTIFY_PARTITIONS];
+}
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key (database OID + channel name) for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
+
+/*
+ * ChannelHashAddListener
+ * Register the given backend as a listener for the specified channel.
+ *
+ * This function uses an optimistic read-locking strategy to maximize
+ * concurrency when many backends listen on the same channel.
+ *
+ * 1. It first takes a shared lock and checks the channel's state. If the
+ * channel is already marked as having multiple listeners, no write is
+ * needed, and we can return immediately. This is the fast path for the
+ * 3rd, 4th, etc., listener on a given channel.
+ *
+ * 2. If a write is needed (either to create the entry or to mark it as
+ * multi-listener), it releases the shared lock and acquires an exclusive
+ * lock.
+ *
+ * 3. CRUCIALLY, after acquiring the exclusive lock, it must re-check the
+ * state, as another backend may have modified the entry in the interim.
+ */
+static void
+ChannelHashAddListener(const char *channel, ProcNumber procno)
+{
+ ChannelEntry *entry;
+ bool found;
+ ChannelHashKey key;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * FAST PATH: Optimistically take a shared lock. If the channel already
+ * has multiple listeners, we don't need to do anything.
+ */
+ LWLockAcquire(lock, LW_SHARED);
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (entry && entry->has_multiple_listeners)
+ {
+ LWLockRelease(lock);
+ return;
+ }
+ LWLockRelease(lock);
+
+ /*
+ * SLOW PATH: We need to write. Acquire exclusive lock.
+ */
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
+ /*
+ * Re-check state after acquiring exclusive lock, as it may have changed.
+ */
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_ENTER_NULL, &found);
+
+ if (entry == NULL)
+ {
+ /* Out of memory in the hash partition. */
+ ereport(DEBUG1, (errmsg("too many notification channels are already being tracked")));
+ LWLockRelease(lock);
+ return;
+ }
+
+ if (!found)
+ {
+ /* We are the first listener. */
+ entry->listener = procno;
+ entry->has_multiple_listeners = false;
+ }
+ else if (!entry->has_multiple_listeners)
+ {
+ /* We are the second listener. */
+ if (entry->listener != procno)
+ {
+ entry->has_multiple_listeners = true;
+ entry->listener = INVALID_PROC_NUMBER;
+ }
+ }
+ /* If entry->has_multiple_listeners is now true, do nothing. */
+ LWLockRelease(lock);
+}
+
+/*
+ * ChannelHashRemoveListener
+ * Update the channel hash when a backend stops listening on a channel.
+ *
+ * This function uses an optimistic read-lock strategy to maximize concurrency.
+ * An exclusive lock is only taken if we are the sole listener on a channel
+ * and need to remove the entry from the hash table.
+ */
+static void
+ChannelHashRemoveListener(const char *channel, ProcNumber procno)
+{
+ ChannelEntry *entry;
+ ChannelHashKey key;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * Take a shared lock first to see if a removal is even necessary. If the
+ * entry doesn't exist, or it's a multi-listener entry, we have nothing to
+ * do. This is the fast path.
+ */
+ LWLockAcquire(lock, LW_SHARED);
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (!entry || entry->has_multiple_listeners || entry->listener != procno)
+ {
+ LWLockRelease(lock);
+ return;
+ }
+ LWLockRelease(lock);
+
+ /*
+ * A removal is likely needed. Acquire an exclusive lock.
+ */
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
+ /*
+ * Re-check the state, as another backend might have changed it. The only
+ * state change we care about is if it became a multi-listener channel, in
+ * which case we should no longer remove it.
+ */
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (entry && !entry->has_multiple_listeners && entry->listener == procno)
+ {
+ /* Still a single-listener entry for us, so remove it. */
+ (void) hash_search(GetChannelHash(), &key, HASH_REMOVE, NULL);
+ }
+ LWLockRelease(lock);
+}
+
+/*
+ * ChannelHashLookup
+ * Look up the channel hash entry for the given channel name in the
+ * current database.
+ *
+ * Returns NULL if the channel is not being tracked (no listeners, or channel
+ * fell back to broadcast mode because we ran out of shared memory when trying
+ * to add entries to the hash table).
+ *
+ * Caller must hold the appropriate partition lock (shared is sufficient).
+ */
+static ChannelEntry *
+ChannelHashLookup(const char *channel)
+{
+ ChannelHashKey key;
+
+ Assert(LWLockHeldByMe(GetChannelHashLock(channel)));
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ return (ChannelEntry *) hash_search(GetChannelHash(),
+ &key,
+ HASH_FIND,
+ NULL);
+}
+
+/*
+ * GetPendingNotifyChannels
+ * Get list of unique channel names from pending notifications.
+ */
+static List *
+GetPendingNotifyChannels(void)
+{
+ List *channels = NIL;
+ ListCell *p;
+ ListCell *q;
+ bool found;
+
+ if (!pendingNotifies)
+ return NIL;
+
+ /* Collect unique channel names from pending notifications */
+ foreach(p, pendingNotifies->events)
+ {
+ Notification *n = (Notification *) lfirst(p);
+ char *channel = n->data;
+
+ /* Check if we already have this channel in our list */
+ found = false;
+ foreach(q, channels)
+ {
+ char *existing = (char *) lfirst(q);
+
+ if (strcmp(existing, channel) == 0)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ channels = lappend(channels, channel);
+ }
+
+ return channels;
+}
--
2.47.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-07-15 21:50 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-07-15 21:50 UTC (permalink / raw)
To: pgsql-hackers
On Tue, Jul 15, 2025, at 22:56, Joel Jacobson wrote:
> On Tue, Jul 15, 2025, at 09:20, Joel Jacobson wrote:
>> On Sun, Jul 13, 2025, at 01:18, Tom Lane wrote:
>>> "Joel Jacobson" <[email protected]> writes:
>>>> The attached proof-of-concept patch proposes a straightforward
>>>> optimization for the single-listener case. It introduces a shared-memory
>>>> hash table mapping (dboid, channelname) to the ProcNumber of a single
>>>> listener.
>>>
>>> What does that do to the cost and parallelizability of LISTEN/UNLISTEN?
>>
>> Good point. The previous patch would effectively force all LISTEN/UNLISTEN
>> to be serialized, which would at least hurt parallelizability.
>>
>> New benchmark confirm this hypothesis.
>>
>> New patch attached that combines two complementary approaches, that together
>> seems to scale well for both common-channel and unique-channel scenarios:
>
> Thanks to the FreeBSD animal failing, I see I made a shared memory blunder.
> New squashed patch attached.
>
> /Joel
> Attachments:
> * 0001-Subject-Optimize-LISTEN-NOTIFY-signaling-for-scalabi-v3.patch
(cfbot is not picking up my patch; I wonder if some filename length is exceeded,
trying a shorter filename, apologies for spamming)
/Joel
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v3.patch (28.2K, 2-0001-optimize_listen_notify-v3.patch)
download | inline diff:
From 18004e66974fc9d4a93e00b0183959ac306c7218 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 15 Jun 2025 00:09:43 +0200
Subject: [PATCH] Subject: Optimize LISTEN/NOTIFY signaling for scalability
Previously, the implementation would signal every backend listening on any
channel in the database for every NOTIFY. While robust, this broadcast
approach limits the scalability of application patterns that rely on
targeted notifications to distinct channels. This commit improves
scalability for such workloads by introducing an optimization for
single-listener channels.
A new shared hash table is introduced to track channels that have exactly
one listener. When a NOTIFY is issued, this table is consulted; if a
single listener is found for the target channel, only that backend is
signaled. The system gracefully falls back to the original broadcast
behavior for channels with multiple listeners or if the hash table runs
out of memory.
To avoid introducing a new contention point on a global lock, the hash
table's locking is partitioned. An array of lightweight locks protects
the hash table, with a custom hash function mapping channels to lock
partitions. This allows concurrent LISTEN/UNLISTEN operations on
different channels to proceed in parallel. For high-concurrency workloads
where many backends listen on the *same* channel, an optimistic
read-then-upgrade locking pattern is used to minimize serialization. A
strict lock ordering hierarchy (global NotifyQueueLock before any
partition lock) is observed to prevent deadlocks.
This also incorporates the "wake only tail" optimization to ensure the
global queue tail can always advance without causing a thundering herd
of signals.
---
src/backend/commands/async.c | 674 +++++++++++++++++++++++++++++++++--
1 file changed, 641 insertions(+), 33 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..a5b614e1a24 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,11 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * In addition to each backend maintaining its own list of channels, we also
+ * maintain a central hash table that tracks channels with single listeners.
+ * When a channel has exactly one listening backend, we can signal just that
+ * backend. For channels with multiple listeners, we signal all listening
+ * backends.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -71,13 +74,16 @@
* make any actual updates to the effective listen state (listenChannels).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which has two modes of operation, depending on
+ * if any of our channels have multiple listening backends or not:
+ * a) If there are multiple listening backends, a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to every listening backend.
+ * b) Otherwise, such signals are only sent to each single listening backend
+ * per channel.
+ * Additionally, we use a "wake only tail" optimization: we always signal
+ * the backend furthest behind in the queue to help prevent backends from
+ * getting far behind and create a chain reaction of wake-ups.
+ * We can exclude backends that are already up to date, though.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +134,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -146,6 +153,7 @@
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
#include "utils/guc_hooks.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
#include "utils/snapmgr.h"
@@ -162,6 +170,68 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Number of partitions for the channel hash table's locks.
+ * This must be a power of two.
+ */
+#define NUM_NOTIFY_PARTITIONS 128
+
+/*
+ * Channel hash table definitions
+ *
+ * This hash table provides an optimization by tracking which backend is
+ * listening on each channel. Channels are identified by database OID and
+ * channel name, making them database-specific.
+ *
+ * To improve scalability of concurrent LISTEN/UNLISTEN operations, the hash
+ * table is partitioned, with each partition protected by its own LWLock. This
+ * avoids serializing all operations on a single global lock.
+ *
+ * When exactly one backend listens on a channel, we signal that specific
+ * backend, avoiding unnecessary signals to all listening backends.
+ *
+ * We fall back to broadcast mode and signal all listening backends when:
+ * 1) Multiple backends listen on the same channel, OR
+ * 2) The hash table runs out of shared memory for new entries
+ *
+ * Note that CHANNEL_HASH_MAX_SIZE is not a hard limit - the hash table can
+ * store more entries than this, but performance will degrade due to bucket
+ * overflow. The actual fallback to broadcast mode occurs only when shared
+ * memory is exhausted and we cannot allocate new hash entries.
+ *
+ * The maximum size (CHANNEL_HASH_MAX_SIZE) is based on the typical OS port
+ * range. This provides a reasonable upper bound for systems that use
+ * per-connection channels.
+ *
+ */
+#define CHANNEL_HASH_INIT_SIZE 256
+#define CHANNEL_HASH_MAX_SIZE 65535
+
+/*
+ * Key structure for the channel hash table.
+ * Channels are database-specific, so we need both the database OID
+ * and the channel name to uniquely identify a channel.
+ */
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+/*
+ * Each entry contains a channel key (database OID + channel name) and a
+ * single backend ProcNumber that is listening on that channel. If multiple
+ * backends try to listen on the same channel, we mark it as having multiple
+ * listeners and fall back to broadcast behavior.
+ */
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ ProcNumber listener; /* single backend ID, or INVALID_PROC_NUMBER
+ * if multiple */
+ bool has_multiple_listeners;
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -269,6 +339,11 @@ typedef struct QueueBackendStatus
* In order to avoid deadlocks, whenever we need multiple locks, we first get
* NotifyQueueTailLock, then NotifyQueueLock, and lastly SLRU bank lock.
*
+ * The channel hash table is protected by a separate set of partitioned
+ * locks. To prevent deadlocks between these and NotifyQueueLock, the global
+ * lock-ordering rule is: always acquire NotifyQueueLock *before* acquiring
+ * any channel hash partition lock.
+ *
* Each backend uses the backend[] array entry with index equal to its
* ProcNumber. We rely on this to make SendProcSignal fast.
*
@@ -293,6 +368,60 @@ typedef struct AsyncQueueControl
static AsyncQueueControl *asyncQueueControl;
+/* Locks for partitioned channel hash table */
+static LWLock *channelHashLocks;
+static int channelHashTrancheId = 0;
+
+/* Structure to hold channel hash locks and tranche ID in shared memory */
+typedef struct ChannelHashLockData
+{
+ int trancheId;
+ LWLock locks[FLEXIBLE_ARRAY_MEMBER];
+} ChannelHashLockData;
+
+static ChannelHashLockData * channelHashLockData;
+
+/* Channel hash table for single listening backend signalling */
+static HTAB *channelHash = NULL;
+
+/* Forward declaration needed by GetChannelHash */
+static uint32 channel_hash_func(const void *key, Size keysize);
+
+/*
+ * GetChannelHash
+ * Get the channel hash table, initializing our backend's pointer if needed.
+ *
+ * This must be called before any access to the channel hash table.
+ * The hash table itself is created in shared memory during AsyncShmemInit,
+ * but each backend needs to get its own pointer to it.
+ */
+static HTAB *
+GetChannelHash(void)
+{
+ if (channelHash == NULL)
+ {
+ HASHCTL hash_ctl;
+
+ /*
+ * Set up to attach to the existing shared hash table. The hash
+ * control parameters must match those used in AsyncShmemInit.
+ */
+ MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ChannelHashKey);
+ hash_ctl.entrysize = sizeof(ChannelEntry);
+ hash_ctl.hash = channel_hash_func;
+ hash_ctl.num_partitions = NUM_NOTIFY_PARTITIONS;
+
+ channelHash = ShmemInitHash("Channel Hash",
+ CHANNEL_HASH_INIT_SIZE,
+ CHANNEL_HASH_MAX_SIZE,
+ &hash_ctl,
+ HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+ }
+
+ return channelHash;
+}
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -458,6 +587,14 @@ static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+/* Channel hash table management functions */
+static LWLock *GetChannelHashLock(const char *channel);
+static inline void ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel);
+static void ChannelHashAddListener(const char *channel, ProcNumber procno);
+static void ChannelHashRemoveListener(const char *channel, ProcNumber procno);
+static ChannelEntry * ChannelHashLookup(const char *channel);
+static List *GetPendingNotifyChannels(void);
+
/*
* Compute the difference between two queue page numbers.
* Previously this function accounted for a wraparound.
@@ -492,6 +629,12 @@ AsyncShmemSize(void)
size = add_size(size, SimpleLruShmemSize(notify_buffers, 0));
+ size = add_size(size, hash_estimate_size(CHANNEL_HASH_MAX_SIZE,
+ sizeof(ChannelEntry)));
+
+ size = add_size(size, offsetof(ChannelHashLockData, locks) +
+ mul_size(NUM_NOTIFY_PARTITIONS, sizeof(LWLock)));
+
return size;
}
@@ -546,6 +689,49 @@ AsyncShmemInit(void)
*/
(void) SlruScanDirectory(NotifyCtl, SlruScanDirCbDeleteAll, NULL);
}
+
+ /*
+ * Create or attach to the channel hash table.
+ */
+ {
+ HASHCTL hash_ctl;
+
+ MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ChannelHashKey);
+ hash_ctl.entrysize = sizeof(ChannelEntry);
+ hash_ctl.hash = channel_hash_func;
+ hash_ctl.num_partitions = NUM_NOTIFY_PARTITIONS;
+
+ channelHash = ShmemInitHash("Channel Hash",
+ CHANNEL_HASH_INIT_SIZE,
+ CHANNEL_HASH_MAX_SIZE,
+ &hash_ctl,
+ HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+ }
+
+ /* Initialize locks for the partitioned hash table */
+ size = offsetof(ChannelHashLockData, locks) +
+ mul_size(NUM_NOTIFY_PARTITIONS, sizeof(LWLock));
+ channelHashLockData = (ChannelHashLockData *)
+ ShmemInitStruct("Channel Hash Lock Data", size, &found);
+ if (!found)
+ {
+ /* First time through: initialize the locks and tranche ID */
+ channelHashLockData->trancheId = LWLockNewTrancheId();
+ for (int i = 0; i < NUM_NOTIFY_PARTITIONS; i++)
+ {
+ LWLockInitialize(&channelHashLockData->locks[i],
+ channelHashLockData->trancheId);
+ }
+ }
+
+ /*
+ * Set up local pointers for convenience. We must also register the
+ * tranche ID in every backend that will use these locks.
+ */
+ channelHashLocks = channelHashLockData->locks;
+ channelHashTrancheId = channelHashLockData->trancheId;
+ LWLockRegisterTranche(channelHashTrancheId, "ChannelHashPartition");
}
@@ -1110,6 +1296,7 @@ Exec_ListenPreCommit(void)
QUEUE_NEXT_LISTENER(MyProcNumber) = QUEUE_FIRST_LISTENER;
QUEUE_FIRST_LISTENER = MyProcNumber;
}
+
LWLockRelease(NotifyQueueLock);
/* Now we are listed in the global array, so remember we're listening */
@@ -1152,6 +1339,8 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+
+ ChannelHashAddListener(channel, MyProcNumber);
}
/*
@@ -1175,6 +1364,7 @@ Exec_UnlistenCommit(const char *channel)
{
listenChannels = foreach_delete_current(listenChannels, q);
pfree(lchan);
+ ChannelHashRemoveListener(channel, MyProcNumber);
break;
}
}
@@ -1193,9 +1383,22 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ ListCell *p;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ /*
+ * Before freeing the local list, iterate through it and perform a
+ * targeted removal for each of our channels from the shared hash table.
+ */
+ foreach(p, listenChannels)
+ {
+ char *channel = (char *) lfirst(p);
+
+ ChannelHashRemoveListener(channel, MyProcNumber);
+ }
+
list_free_deep(listenChannels);
listenChannels = NIL;
}
@@ -1239,6 +1442,7 @@ asyncQueueUnregister(void)
* Need exclusive lock here to manipulate list links.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
@@ -1565,12 +1769,18 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * This function operates in two modes:
+ * 1. Selective mode: When all pending notification channels have exactly one
+ * listener each, we signal only those specific backends that are listening
+ * on the channels with pending notifications.
+ * 2. Broadcast mode: When any channel has multiple listeners (or we ran out
+ * of shared memory for the channel hash table), we signal all listening
+ * backends in our database.
+ *
+ * In addition to the channel-specific signaling, we also implement a "wake
+ * only tail" optimization: we signal the backend that is furthest behind
+ * in the queue to help prevent backends from getting far behind and create
+ * a chain reaction of wake-ups. This avoids thundering herd problems.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1793,11 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ List *channels;
+ ListCell *p;
+ bool *signaled;
+ bool broadcast_mode = false;
+ bool tail_woken = false;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1594,40 +1809,179 @@ SignalBackends(void)
*/
pids = (int32 *) palloc(MaxBackends * sizeof(int32));
procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+ signaled = (bool *) palloc0(MaxBackends * sizeof(bool));
count = 0;
+ /* Get list of channels that have pending notifications */
+ channels = GetPendingNotifyChannels();
+
+ /*
+ * To prevent deadlocks, we must always acquire locks in the same order:
+ * global NotifyQueueLock first, then individual partition locks.
+ */
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+
+ /*
+ * Determine if we can use targeted signaling or must broadcast. This
+ * check must be done while holding NotifyQueueLock to prevent deadlocks
+ * against other backends that might be modifying the listener list and
+ * hash table simultaneously (e.g., asyncQueueUnregister).
+ */
+ foreach(p, channels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(p);
+ ChannelEntry *entry;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ LWLockAcquire(lock, LW_SHARED);
+ entry = ChannelHashLookup(channel);
+
+ /*
+ * If there is no entry, it could mean we ran out of shared memory
+ * when trying to add this channel to the hash table, so we need to
+ * broadcast in that case as well.
+ */
+ if (!entry || entry->has_multiple_listeners)
+ {
+ broadcast_mode = true;
+ LWLockRelease(lock);
+ break;
+ }
+ LWLockRelease(lock);
+ }
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (broadcast_mode)
+ {
+ /*
+ * In broadcast mode, we iterate over all listening backends and
+ * signal the ones in our database that are not already caught up.
+ */
+ for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
{
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
/*
* Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
+ * already caught up.
*/
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ /* OK, need to signal this one */
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
}
- else
+ }
+ else
+ {
+ /*
+ * In targeted mode, signal specific listening backends. We must
+ * re-check the hash entries here inside the lock to avoid races.
+ */
+ foreach(p, channels)
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
- continue;
+ char *channel = (char *) lfirst(p);
+ ChannelEntry *entry;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ LWLockAcquire(lock, LW_SHARED);
+ entry = ChannelHashLookup(channel);
+
+ if (entry && !entry->has_multiple_listeners)
+ {
+ ProcNumber i = entry->listener;
+ int32 pid;
+ QueuePosition pos;
+
+ if (signaled[i])
+ {
+ LWLockRelease(lock);
+ continue;
+ }
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ {
+ LWLockRelease(lock);
+ continue;
+ }
+
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ {
+ LWLockRelease(lock);
+ continue;
+ }
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ }
+ LWLockRelease(lock);
}
+ }
+
+ /*
+ * Also check for any backends that are far behind. This ensures the
+ * global tail can advance even if they're not actively receiving
+ * notifications on their channels.
+ */
+ for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ {
+ int32 pid;
+ QueuePosition pos;
+
+ /*
+ * Skip if we've already decided to signal this one.
+ */
+ if (signaled[i])
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /*
+ * Skip signaling listeners if they already caught up.
+ */
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ continue;
+
+ /*
+ * Wake only tail optimization: Signal the backend that is furthest
+ * behind to help prevent backends from getting far behind in the
+ * first place. This finds the backend(s) on the same page as the
+ * global tail, which are the ones holding up truncation. This creates
+ * a chain reaction where each backend eventually wakes up the next
+ * one as notifications are processed, avoiding thundering herd.
+ */
+ if (!tail_woken && asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_TAIL),
+ QUEUE_POS_PAGE(pos)) == 0)
+ tail_woken = true;
+ else
+ continue;
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
/* OK, need to signal this one */
pids[count] = pid;
procnos[count] = i;
count++;
}
+
LWLockRelease(NotifyQueueLock);
/* Now send signals */
@@ -1647,9 +2001,9 @@ SignalBackends(void)
/*
* Note: assuming things aren't broken, a signal failure here could
- * only occur if the target backend exited since we released
- * NotifyQueueLock; which is unlikely but certainly possible. So we
- * just log a low-level debug message if it happens.
+ * only occur if the target backend exited since we released the lock;
+ * which is unlikely but certainly possible. So we just log a
+ * low-level debug message if it happens.
*/
if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]) < 0)
elog(DEBUG3, "could not signal backend with PID %d: %m", pid);
@@ -1657,6 +2011,7 @@ SignalBackends(void)
pfree(pids);
pfree(procnos);
+ pfree(signaled);
}
/*
@@ -2395,3 +2750,256 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+/*
+ * Channel hash table management functions
+ */
+
+/*
+ * channel_hash_func
+ * Custom hash function for the channel hash table. This function ensures
+ * that the low-order bits of the hash are well-distributed, which is
+ * critical for partitioned hash tables.
+ */
+static uint32
+channel_hash_func(const void *key, Size keysize)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ uint32 h;
+
+ /*
+ * Mix the dboid and the channel name to produce a good hash. hash_any()
+ * is a high-quality portable hash function. This prevents channels with
+ * the same name in different databases from always mapping to the same
+ * partition.
+ */
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/*
+ * GetChannelHashLock
+ * Return the LWLock that protects the partition for the given channel name.
+ */
+static LWLock *
+GetChannelHashLock(const char *channel)
+{
+ ChannelHashKey key;
+ uint32 hash;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ hash = get_hash_value(GetChannelHash(), &key);
+
+ return &channelHashLocks[hash % NUM_NOTIFY_PARTITIONS];
+}
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key (database OID + channel name) for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
+
+/*
+ * ChannelHashAddListener
+ * Register the given backend as a listener for the specified channel.
+ *
+ * This function uses an optimistic read-locking strategy to maximize
+ * concurrency when many backends listen on the same channel.
+ *
+ * 1. It first takes a shared lock and checks the channel's state. If the
+ * channel is already marked as having multiple listeners, no write is
+ * needed, and we can return immediately. This is the fast path for the
+ * 3rd, 4th, etc., listener on a given channel.
+ *
+ * 2. If a write is needed (either to create the entry or to mark it as
+ * multi-listener), it releases the shared lock and acquires an exclusive
+ * lock.
+ *
+ * 3. CRUCIALLY, after acquiring the exclusive lock, it must re-check the
+ * state, as another backend may have modified the entry in the interim.
+ */
+static void
+ChannelHashAddListener(const char *channel, ProcNumber procno)
+{
+ ChannelEntry *entry;
+ bool found;
+ ChannelHashKey key;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * FAST PATH: Optimistically take a shared lock. If the channel already
+ * has multiple listeners, we don't need to do anything.
+ */
+ LWLockAcquire(lock, LW_SHARED);
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (entry && entry->has_multiple_listeners)
+ {
+ LWLockRelease(lock);
+ return;
+ }
+ LWLockRelease(lock);
+
+ /*
+ * SLOW PATH: We need to write. Acquire exclusive lock.
+ */
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
+ /*
+ * Re-check state after acquiring exclusive lock, as it may have changed.
+ */
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_ENTER_NULL, &found);
+
+ if (entry == NULL)
+ {
+ /* Out of memory in the hash partition. */
+ ereport(DEBUG1, (errmsg("too many notification channels are already being tracked")));
+ LWLockRelease(lock);
+ return;
+ }
+
+ if (!found)
+ {
+ /* We are the first listener. */
+ entry->listener = procno;
+ entry->has_multiple_listeners = false;
+ }
+ else if (!entry->has_multiple_listeners)
+ {
+ /* We are the second listener. */
+ if (entry->listener != procno)
+ {
+ entry->has_multiple_listeners = true;
+ entry->listener = INVALID_PROC_NUMBER;
+ }
+ }
+ /* If entry->has_multiple_listeners is now true, do nothing. */
+ LWLockRelease(lock);
+}
+
+/*
+ * ChannelHashRemoveListener
+ * Update the channel hash when a backend stops listening on a channel.
+ *
+ * This function uses an optimistic read-lock strategy to maximize concurrency.
+ * An exclusive lock is only taken if we are the sole listener on a channel
+ * and need to remove the entry from the hash table.
+ */
+static void
+ChannelHashRemoveListener(const char *channel, ProcNumber procno)
+{
+ ChannelEntry *entry;
+ ChannelHashKey key;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * Take a shared lock first to see if a removal is even necessary. If the
+ * entry doesn't exist, or it's a multi-listener entry, we have nothing to
+ * do. This is the fast path.
+ */
+ LWLockAcquire(lock, LW_SHARED);
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (!entry || entry->has_multiple_listeners || entry->listener != procno)
+ {
+ LWLockRelease(lock);
+ return;
+ }
+ LWLockRelease(lock);
+
+ /*
+ * A removal is likely needed. Acquire an exclusive lock.
+ */
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
+ /*
+ * Re-check the state, as another backend might have changed it. The only
+ * state change we care about is if it became a multi-listener channel, in
+ * which case we should no longer remove it.
+ */
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (entry && !entry->has_multiple_listeners && entry->listener == procno)
+ {
+ /* Still a single-listener entry for us, so remove it. */
+ (void) hash_search(GetChannelHash(), &key, HASH_REMOVE, NULL);
+ }
+ LWLockRelease(lock);
+}
+
+/*
+ * ChannelHashLookup
+ * Look up the channel hash entry for the given channel name in the
+ * current database.
+ *
+ * Returns NULL if the channel is not being tracked (no listeners, or channel
+ * fell back to broadcast mode because we ran out of shared memory when trying
+ * to add entries to the hash table).
+ *
+ * Caller must hold the appropriate partition lock (shared is sufficient).
+ */
+static ChannelEntry *
+ChannelHashLookup(const char *channel)
+{
+ ChannelHashKey key;
+
+ Assert(LWLockHeldByMe(GetChannelHashLock(channel)));
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ return (ChannelEntry *) hash_search(GetChannelHash(),
+ &key,
+ HASH_FIND,
+ NULL);
+}
+
+/*
+ * GetPendingNotifyChannels
+ * Get list of unique channel names from pending notifications.
+ */
+static List *
+GetPendingNotifyChannels(void)
+{
+ List *channels = NIL;
+ ListCell *p;
+ ListCell *q;
+ bool found;
+
+ if (!pendingNotifies)
+ return NIL;
+
+ /* Collect unique channel names from pending notifications */
+ foreach(p, pendingNotifies->events)
+ {
+ Notification *n = (Notification *) lfirst(p);
+ char *channel = n->data;
+
+ /* Check if we already have this channel in our list */
+ found = false;
+ foreach(q, channels)
+ {
+ char *existing = (char *) lfirst(q);
+
+ if (strcmp(existing, channel) == 0)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ channels = lappend(channels, channel);
+ }
+
+ return channels;
+}
--
2.47.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-07-16 00:20 Rishu Bagga <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 2 replies; 120+ messages in thread
From: Rishu Bagga @ 2025-07-16 00:20 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: pgsql-hackers
Hi Joel,
Thanks for sharing the patch.
I have a few questions based on a cursory first look.
> If a single listener is found, we signal only that backend.
> Otherwise, we fall back to the existing broadcast behavior.
The idea of not wanting to wake up all backends makes sense to me,
but I don’t understand why we want this optimization only for the case
where there is a single backend listening on a channel.
Is there a pattern of usage in LISTEN/NOTIFY where users typically
have either just one or several backends listening on a channel?
If we are doing this optimization, why not maintain a list of backends
for each channel, and only wake up those channels?
Thanks,
Rishu
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-07-16 07:00 Joel Jacobson <[email protected]>
parent: Rishu Bagga <[email protected]>
1 sibling, 0 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-07-16 07:00 UTC (permalink / raw)
To: Rishu Bagga <[email protected]>; +Cc: pgsql-hackers
On Wed, Jul 16, 2025, at 02:20, Rishu Bagga wrote:
> Hi Joel,
>
> Thanks for sharing the patch.
> I have a few questions based on a cursory first look.
>
>> If a single listener is found, we signal only that backend.
>> Otherwise, we fall back to the existing broadcast behavior.
>
> The idea of not wanting to wake up all backends makes sense to me,
> but I don’t understand why we want this optimization only for the case
> where there is a single backend listening on a channel.
>
> Is there a pattern of usage in LISTEN/NOTIFY where users typically
> have either just one or several backends listening on a channel?
>
> If we are doing this optimization, why not maintain a list of backends
> for each channel, and only wake up those channels?
Thanks for the thoughtful question. You've hit on the central design trade-off
in this optimization: how to provide targeted signaling for some workloads
without degrading performance for others.
While we don't have telemetry on real-world usage patterns of LISTEN/NOTIFY,
it seems likely that most applications fall into one of three categories,
which I've been thinking of in networking terms:
1. Broadcast-style ("hub mode")
Many backends listening on the *same* channel (e.g., for cache invalidation).
The current implementation is already well-optimized for this, behaving like
an Ethernet hub that broadcasts to all ports. Waking all listeners is efficient
because they all need the message.
2. Targeted notifications ("switch mode")
Each backend listens on its own private channel (e.g., for session events or
worker queues). This is where the current implementation scales poorly, as every
NOTIFY wakes up all listeners regardless of relevance. My patch is designed
to make this behave like an efficient Ethernet switch.
3. Selective multicast-style ("group mode")
A subset of backends shares a channel, but not all. This is the tricky middle
ground. Your question, "why not maintain a list of backends for each channel,
and only wake up those channels?" is exactly the right one to ask.
A full listener list seems like the obvious path to optimizing for *all* cases.
However, the devil is in the details of concurrency and performance. Managing
such a list would require heavier locking, which would create a new bottleneck
and degrade the scalability of LISTEN/UNLISTEN operations—especially for
the "hub mode" case where many backends rapidly subscribe to the same popular
channel.
This patch makes a deliberate architectural choice:
Prioritize a massive, low-risk win for "switch mode" while rigorously protecting
the performance of "hub mode".
It introduces a targeted fast path for single-listener channels and cleanly
falls back to the existing, well-performing broadcast model for everything else.
This brings us back to "group mode", which remains an open optimization problem.
A possible approach could be to track listeners up to a small threshold *K*
(e.g., store up to 4 ProcNumber's in the hash entry). If the count exceeds *K*,
we would flip a "broadcast" flag and revert to hub-mode behavior.
However, this path has a critical drawback:
1. Performance Penalty for Hub Mode
With the current patch, after the second listener joins a channel,
the has_multiple_listeners flag is set. Every subsequent listener can acquire
a shared lock, see the flag is true, and immediately continue. This is
a highly concurrent, read-only operation that does not require mutating shared
state.
In contrast, the K-listener approach would force every new listener (from the
third up to the K-th) to acquire an exclusive lock to mutate the shared
listener array**. This would serialize LISTEN operations on popular channels,
creating the very contention point this patch successfully avoids and directly
harming the hub-mode use case that currently works well.
2. Uncertainty
Compounding this, without clear data on typical "group" sizes, choosing a value
for *K* is a shot in the dark. A small *K* might not help much, while
a large *K* would increase the shared memory footprint and worsen the
serialization penalty.
For these reasons, attempting to build a switch that also optimizes for
multicast risks undermining the architectural clarity and performance of
both the switch and hub models.
This patch, therefore, draws a clean line. It provides a precise,
low-cost path for switch-mode workloads and preserves the existing,
well-performing path for hub-mode workloads. While this leaves "group mode"
unoptimized for now, it ensures we make two common use cases better without
making any use case worse. The new infrastructure is flexible, leaving
the door open should a better approach for "group mode" emerge in
the future—one that doesn't compromise the other two.
Benchmarks updated showing master vs 0001-optimize_listen_notify-v3.patch:
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/plot.png
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_connections_equa...
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_fixed_connection...
I've not included the benchmark CSV data in this mail, since it's quite heavy,
160kB, and I couldn't see any significant performance changes since v2.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-07-17 07:43 Joel Jacobson <[email protected]>
parent: Rishu Bagga <[email protected]>
1 sibling, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-07-17 07:43 UTC (permalink / raw)
To: Rishu Bagga <[email protected]>; +Cc: pgsql-hackers
On Wed, Jul 16, 2025, at 02:20, Rishu Bagga wrote:
> If we are doing this optimization, why not maintain a list of backends
> for each channel, and only wake up those channels?
Thanks for a contributing a great idea, it actually turned out to work
really well in practice!
The attached new v4 of the patch implements your multicast idea:
---
Improve NOTIFY scalability with multicast signaling
Previously, NOTIFY would signal all listening backends in a database for
any channel with more than one listener. This broadcast approach scales
poorly for workloads that rely on targeted notifications to small groups
of backends, as every NOTIFY could wake up many unrelated processes.
This commit introduces a multicast signaling optimization to improve
scalability for such use-cases. A new GUC, `notify_multicast_threshold`,
is added to control the maximum number of listeners to track per
channel. When a NOTIFY is issued, if the number of listeners is at or
below this threshold, only those specific backends are signaled. If the
limit is exceeded, the system falls back to the original broadcast
behavior.
The default for this threshold is set to 16. Benchmarks show this
provides a good balance, with significant performance gains for small to
medium-sized listener groups and diminishing returns for higher values.
Setting the threshold to 0 disables multicast signaling, forcing a
fallback to the broadcast path for all notifications.
To implement this, a new partitioned hash table is introduced in shared
memory to track listeners. Locking is managed with an optimistic
read-then-upgrade pattern. This allows concurrent LISTEN/UNLISTEN
operations on *different* channels to proceed in parallel, as they will
only acquire locks on their respective partitions.
For correctness and to prevent deadlocks, a strict lock ordering
hierarchy (NotifyQueueLock before any partition lock) is observed. The
signaling path in NOTIFY must acquire the global NotifyQueueLock first
before consulting the partitioned hash table, which serializes
concurrent NOTIFYs. The primary concurrency win is for LISTEN/UNLISTEN
operations, which are now much more scalable.
The "wake only tail" optimization, which signals backends that are far
behind in the queue, is also included to ensure the global queue tail
can always advance.
Thanks to Rishu Bagga for the multicast idea.
---
BENCHMARK
To find the optimal default notify_multicast_threshold value,
I created a new benchmark tool that spawns one "ping" worker that sends
notifications to a channel, and multiple "pong" workers that listen on channels
and all immediately reply back to the "ping" worker, and when all replies
have been received, the cycle repeats.
By measuring how many complete round-trips can be performed per second,
it evaluates the impact of different multicast threshold settings.
The results below show the effect of setting the notify_multicast_threshold
just below, or exactly at the N backends per channel, to compare broadcast
vs multicast, for different sizes of multicast groups (where 1 would be the
old targeted mode, optimized for specifically earlier).
K = notify_multicast_threshold
With 2 backends per channel (32 channels total):
patch-v4 (K=1): 8,477 TPS
patch-v4 (K=2): 27,748 TPS (3.3x improvement)
With 4 backends per channel (16 channels total):
patch-v4 (K=1): 7,367 TPS
patch-v4 (K=4): 18,777 TPS (2.6x improvement)
With 8 backends per channel (8 channels total):
patch-v4 (K=1): 5,892 TPS
patch-v4 (K=8): 8,620 TPS (1.5x improvement)
With 16 backends per channel (4 channels total):
patch-v4 (K=1): 4,202 TPS
patch-v4 (K=16): 4,750 TPS (1.1x improvement)
I also reran the old ping-pong as well as the pgbench benchmarks,
and I couldn't detect any negative impact, testing with
notify_multicast_threshold {1, 8, 16}.
Ping-pong benchmark:
Extra Connections: 0
--------------------------------------------------------------------------------
Version Max TPS vs Master All Values (sorted)
-------------------------------------------------------------------------------------
master 9119 baseline {9088, 9095, 9119}
patch-v4 (t=1) 9116 -0.0% {9082, 9090, 9116}
patch-v4 (t=8) 9106 -0.2% {9086, 9102, 9106}
patch-v4 (t=16) 9134 +0.2% {9082, 9116, 9134}
Extra Connections: 10
--------------------------------------------------------------------------------
Version Max TPS vs Master All Values (sorted)
-------------------------------------------------------------------------------------
master 6237 baseline {6224, 6227, 6237}
patch-v4 (t=1) 9358 +50.0% {9302, 9345, 9358}
patch-v4 (t=8) 9348 +49.9% {9266, 9312, 9348}
patch-v4 (t=16) 9408 +50.8% {9339, 9407, 9408}
Extra Connections: 100
--------------------------------------------------------------------------------
Version Max TPS vs Master All Values (sorted)
-------------------------------------------------------------------------------------
master 2028 baseline {2026, 2027, 2028}
patch-v4 (t=1) 9278 +357.3% {9222, 9235, 9278}
patch-v4 (t=8) 9227 +354.8% {9184, 9207, 9227}
patch-v4 (t=16) 9250 +355.9% {9180, 9243, 9250}
Extra Connections: 1000
--------------------------------------------------------------------------------
Version Max TPS vs Master All Values (sorted)
-------------------------------------------------------------------------------------
master 239 baseline {239, 239, 239}
patch-v4 (t=1) 8841 +3594.1% {8819, 8840, 8841}
patch-v4 (t=8) 8835 +3591.7% {8802, 8826, 8835}
patch-v4 (t=16) 8855 +3599.8% {8787, 8843, 8855}
Among my pgbench benchmarks, results seems unaffected in these benchmarks:
listen_unique.sql
listen_common.sql
listen_unlisten_unique.sql
listen_unlisten_common.sql
The listen_notify_unique.sql benchmark shows similar improvements
for all notify_multicast_threshold values tested,
which is expected, since this benchmark uses unique channels,
so a higher notify_multicast_threshold shouldn't affect the results,
which it didn't:
# TEST `listen_notify_unique.sql`
```sql
LISTEN channel_:client_id;
NOTIFY channel_:client_id;
```
## 1 Connection, 1 Job
- **master**: 63696 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 63377 TPS (-0.5%)
- **optimize_listen_notify_v4 (t=8.0)**: 62890 TPS (-1.3%)
- **optimize_listen_notify_v4 (t=16.0)**: 63114 TPS (-0.9%)
## 2 Connections, 2 Jobs
- **master**: 90967 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 109423 TPS (+20.3%)
- **optimize_listen_notify_v4 (t=8.0)**: 109107 TPS (+19.9%)
- **optimize_listen_notify_v4 (t=16.0)**: 109608 TPS (+20.5%)
## 4 Connections, 4 Jobs
- **master**: 114333 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 140986 TPS (+23.3%)
- **optimize_listen_notify_v4 (t=8.0)**: 141263 TPS (+23.6%)
- **optimize_listen_notify_v4 (t=16.0)**: 141327 TPS (+23.6%)
## 8 Connections, 8 Jobs
- **master**: 64429 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 93787 TPS (+45.6%)
- **optimize_listen_notify_v4 (t=8.0)**: 93828 TPS (+45.6%)
- **optimize_listen_notify_v4 (t=16.0)**: 93875 TPS (+45.7%)
## 16 Connections, 16 Jobs
- **master**: 41704 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 84791 TPS (+103.3%)
- **optimize_listen_notify_v4 (t=8.0)**: 88330 TPS (+111.8%)
- **optimize_listen_notify_v4 (t=16.0)**: 84827 TPS (+103.4%)
## 32 Connections, 32 Jobs
- **master**: 25988 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 83197 TPS (+220.1%)
- **optimize_listen_notify_v4 (t=8.0)**: 83453 TPS (+221.1%)
- **optimize_listen_notify_v4 (t=16.0)**: 83576 TPS (+221.6%)
## 1000 Connections, 1 Job
- **master**: 105 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 3097 TPS (+2852.1%)
- **optimize_listen_notify_v4 (t=8.0)**: 3079 TPS (+2835.1%)
- **optimize_listen_notify_v4 (t=16.0)**: 3080 TPS (+2835.9%)
## 1000 Connections, 2 Jobs
- **master**: 108 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 2981 TPS (+2671.7%)
- **optimize_listen_notify_v4 (t=8.0)**: 3091 TPS (+2774.4%)
- **optimize_listen_notify_v4 (t=16.0)**: 3097 TPS (+2779.6%)
## 1000 Connections, 4 Jobs
- **master**: 105 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 2947 TPS (+2705.5%)
- **optimize_listen_notify_v4 (t=8.0)**: 2994 TPS (+2751.0%)
- **optimize_listen_notify_v4 (t=16.0)**: 2992 TPS (+2748.7%)
## 1000 Connections, 8 Jobs
- **master**: 107 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 3064 TPS (+2777.0%)
- **optimize_listen_notify_v4 (t=8.0)**: 2981 TPS (+2698.5%)
- **optimize_listen_notify_v4 (t=16.0)**: 2979 TPS (+2696.8%)
## 1000 Connections, 16 Jobs
- **master**: 101 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 3068 TPS (+2923.2%)
- **optimize_listen_notify_v4 (t=8.0)**: 2950 TPS (+2806.4%)
- **optimize_listen_notify_v4 (t=16.0)**: 2940 TPS (+2796.8%)
## 1000 Connections, 32 Jobs
- **master**: 102 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 2980 TPS (+2815.0%)
- **optimize_listen_notify_v4 (t=8.0)**: 3034 TPS (+2867.9%)
- **optimize_listen_notify_v4 (t=16.0)**: 2962 TPS (+2798.0%)
Here are some plots that includes the above results:
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/plot-v4.png
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_connections_equa...
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_fixed_connection...
/Joel
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v4.patch (35.9K, 2-0001-optimize_listen_notify-v4.patch)
download | inline diff:
From 32f2b6818169381f2795e7c3264bb3710e9f6eae Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 15 Jun 2025 00:09:43 +0200
Subject: [PATCH] Improve NOTIFY scalability with multicast signaling
Previously, NOTIFY would signal all listening backends in a database for
any channel with more than one listener. This broadcast approach scales
poorly for workloads that rely on targeted notifications to small groups
of backends, as every NOTIFY could wake up many unrelated processes.
This commit introduces a multicast signaling optimization to improve
scalability for such use-cases. A new GUC, `notify_multicast_threshold`,
is added to control the maximum number of listeners to track per
channel. When a NOTIFY is issued, if the number of listeners is at or
below this threshold, only those specific backends are signaled. If the
limit is exceeded, the system falls back to the original broadcast
behavior.
The default for this threshold is set to 16. Benchmarks show this
provides a good balance, with significant performance gains for small to
medium-sized listener groups and diminishing returns for higher values.
Setting the threshold to 0 disables multicast signaling, forcing a
fallback to the broadcast path for all notifications.
To implement this, a new partitioned hash table is introduced in shared
memory to track listeners. Locking is managed with an optimistic
read-then-upgrade pattern. This allows concurrent LISTEN/UNLISTEN
operations on *different* channels to proceed in parallel, as they will
only acquire locks on their respective partitions.
For correctness and to prevent deadlocks, a strict lock ordering
hierarchy (NotifyQueueLock before any partition lock) is observed. The
signaling path in NOTIFY must acquire the global NotifyQueueLock first
before consulting the partitioned hash table, which serializes
concurrent NOTIFYs. The primary concurrency win is for LISTEN/UNLISTEN
operations, which are now much more scalable.
The "wake only tail" optimization, which signals backends that are far
behind in the queue, is also included to ensure the global queue tail
can always advance.
Thanks to Rishu Bagga for the multicast idea.
---
src/backend/commands/async.c | 825 ++++++++++++++++++++++++++--
src/backend/utils/init/globals.c | 1 +
src/backend/utils/misc/guc_tables.c | 13 +
src/include/miscadmin.h | 1 +
src/include/utils/guc_hooks.h | 1 +
5 files changed, 808 insertions(+), 33 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..56a74b707fc 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,13 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * In addition to each backend maintaining its own list of channels, we also
+ * maintain a central hash table that tracks listeners for each channel, up
+ * to a configurable threshold ('notify_multicast_threshold'). When the
+ * number of listeners is within this threshold, we can perform a targeted
+ * "multicast" by signaling only those specific backends. If the number of
+ * listeners exceeds the threshold, we fall back to the original broadcast
+ * behavior of signaling all listening backends in the database.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -71,13 +76,17 @@
* make any actual updates to the effective listen state (listenChannels).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which has two modes of operation:
+ * a) Multicast mode: For channels with a number of listeners not exceeding
+ * 'notify_multicast_threshold', signals are sent only to those specific
+ * backends.
+ * b) Broadcast mode: If any channel being notified has more listeners than
+ * the threshold, we revert to the original behavior and send a
+ * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend in the database.
+ * Additionally, we use a "wake only tail" optimization: we always signal
+ * the backend furthest behind in the queue to help prevent backends from
+ * getting far behind and create a chain reaction of wake-ups.
+ * We can exclude backends that are already up to date, though.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +137,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -146,6 +156,7 @@
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
#include "utils/guc_hooks.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
#include "utils/snapmgr.h"
@@ -162,6 +173,71 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Number of partitions for the channel hash table's locks.
+ * This must be a power of two.
+ */
+#define NUM_NOTIFY_PARTITIONS 128
+
+/*
+ * Channel hash table definitions
+ *
+ * This hash table provides an optimization by tracking which backends are
+ * listening on each channel, up to a certain threshold. Channels are
+ * identified by database OID and channel name, making them
+ * database-specific.
+ *
+ * To improve scalability of concurrent LISTEN/UNLISTEN operations, the hash
+ * table is partitioned, with each partition protected by its own LWLock.
+ * This avoids serializing all operations on a single global lock.
+ *
+ * When the number of backends listening on a channel is at or below
+ * 'notify_multicast_threshold', we store their ProcNumbers and signal them
+ * directly (multicast).
+ *
+ * We fall back to broadcast mode and signal all listening backends when:
+ * 1) More backends listen on the same channel than the threshold allows, OR
+ * 2) The hash table runs out of shared memory for new entries
+ *
+ * Note that CHANNEL_HASH_MAX_SIZE is not a hard limit - the hash table can
+ * store more entries than this, but performance will degrade due to bucket
+ * overflow. The actual fallback to broadcast mode occurs only when shared
+ * memory is exhausted and we cannot allocate new hash entries.
+ *
+ * The maximum size (CHANNEL_HASH_MAX_SIZE) is based on the typical OS port
+ * range. This provides a reasonable upper bound for systems that use
+ * per-connection channels.
+ *
+ */
+#define CHANNEL_HASH_INIT_SIZE 256
+#define CHANNEL_HASH_MAX_SIZE 65535
+
+/*
+ * Key structure for the channel hash table.
+ * Channels are database-specific, so we need both the database OID
+ * and the channel name to uniquely identify a channel.
+ */
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+/*
+ * Each entry contains a channel key (database OID + channel name) and an array
+ * of listening backend ProcNumbers, up to notify_multicast_threshold. If the
+ * number of listeners exceeds the threshold, we mark the channel for
+ * broadcast and stop tracking individual listeners.
+ */
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ bool is_broadcast; /* True if num_listeners > threshold */
+ uint8 num_listeners; /* Number of listeners currently stored */
+ /* Listeners array follows, of size notify_multicast_threshold */
+ ProcNumber listeners[FLEXIBLE_ARRAY_MEMBER];
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -269,6 +345,11 @@ typedef struct QueueBackendStatus
* In order to avoid deadlocks, whenever we need multiple locks, we first get
* NotifyQueueTailLock, then NotifyQueueLock, and lastly SLRU bank lock.
*
+ * The channel hash table is protected by a separate set of partitioned
+ * locks. To prevent deadlocks between these and NotifyQueueLock, the global
+ * lock-ordering rule is: always acquire NotifyQueueLock *before* acquiring
+ * any channel hash partition lock.
+ *
* Each backend uses the backend[] array entry with index equal to its
* ProcNumber. We rely on this to make SendProcSignal fast.
*
@@ -293,6 +374,69 @@ typedef struct AsyncQueueControl
static AsyncQueueControl *asyncQueueControl;
+/* Locks for partitioned channel hash table */
+static LWLock *channelHashLocks;
+static int channelHashTrancheId = 0;
+
+/* Structure to hold channel hash locks and tranche ID in shared memory */
+typedef struct ChannelHashLockData
+{
+ int trancheId;
+ LWLock locks[FLEXIBLE_ARRAY_MEMBER];
+} ChannelHashLockData;
+
+static ChannelHashLockData * channelHashLockData;
+
+/* Channel hash table for multicast signalling */
+static HTAB *channelHash = NULL;
+
+/* Forward declaration needed by GetChannelHash */
+static uint32 channel_hash_func(const void *key, Size keysize);
+
+/*
+ * GetChannelHash
+ * Get the channel hash table, initializing our backend's pointer if needed.
+ *
+ * This must be called before any access to the channel hash table.
+ * The hash table itself is created in shared memory during AsyncShmemInit,
+ * but each backend needs to get its own pointer to it.
+ */
+static HTAB *
+GetChannelHash(void)
+{
+ if (channelHash == NULL)
+ {
+ HASHCTL hash_ctl;
+ Size entrysize;
+
+ /*
+ * Set up to attach to the existing shared hash table. The hash
+ * control parameters must match those used in AsyncShmemInit.
+ */
+ MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ChannelHashKey);
+
+ /*
+ * The size of a channel entry is flexible. We must have enough space
+ * for the maximum number of listeners specified by the threshold.
+ */
+ entrysize = add_size(offsetof(ChannelEntry, listeners),
+ mul_size(notify_multicast_threshold, sizeof(ProcNumber)));
+ hash_ctl.entrysize = entrysize;
+
+ hash_ctl.hash = channel_hash_func;
+ hash_ctl.num_partitions = NUM_NOTIFY_PARTITIONS;
+
+ channelHash = ShmemInitHash("Channel Hash",
+ CHANNEL_HASH_INIT_SIZE,
+ CHANNEL_HASH_MAX_SIZE,
+ &hash_ctl,
+ HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+ }
+
+ return channelHash;
+}
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -458,6 +602,14 @@ static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+/* Channel hash table management functions */
+static LWLock *GetChannelHashLock(const char *channel);
+static inline void ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel);
+static void ChannelHashAddListener(const char *channel, ProcNumber procno);
+static void ChannelHashRemoveListener(const char *channel, ProcNumber procno);
+static ChannelEntry * ChannelHashLookup(const char *channel);
+static List *GetPendingNotifyChannels(void);
+
/*
* Compute the difference between two queue page numbers.
* Previously this function accounted for a wraparound.
@@ -485,6 +637,7 @@ Size
AsyncShmemSize(void)
{
Size size;
+ Size entrysize;
/* This had better match AsyncShmemInit */
size = mul_size(MaxBackends, sizeof(QueueBackendStatus));
@@ -492,6 +645,18 @@ AsyncShmemSize(void)
size = add_size(size, SimpleLruShmemSize(notify_buffers, 0));
+ /*
+ * The size of a channel entry is flexible. We must allocate enough space
+ * for the maximum number of listeners specified by the threshold.
+ */
+ entrysize = add_size(offsetof(ChannelEntry, listeners),
+ mul_size(notify_multicast_threshold, sizeof(ProcNumber)));
+ size = add_size(size, hash_estimate_size(CHANNEL_HASH_MAX_SIZE,
+ entrysize));
+
+ size = add_size(size, offsetof(ChannelHashLockData, locks) +
+ mul_size(NUM_NOTIFY_PARTITIONS, sizeof(LWLock)));
+
return size;
}
@@ -546,6 +711,58 @@ AsyncShmemInit(void)
*/
(void) SlruScanDirectory(NotifyCtl, SlruScanDirCbDeleteAll, NULL);
}
+
+ /*
+ * Create or attach to the channel hash table.
+ */
+ {
+ HASHCTL hash_ctl;
+ Size entrysize;
+
+ MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ChannelHashKey);
+
+ /*
+ * The size of a channel entry is flexible. We must have enough space
+ * for the maximum number of listeners specified by the threshold.
+ */
+ entrysize = add_size(offsetof(ChannelEntry, listeners),
+ mul_size(notify_multicast_threshold, sizeof(ProcNumber)));
+ hash_ctl.entrysize = entrysize;
+
+ hash_ctl.hash = channel_hash_func;
+ hash_ctl.num_partitions = NUM_NOTIFY_PARTITIONS;
+
+ channelHash = ShmemInitHash("Channel Hash",
+ CHANNEL_HASH_INIT_SIZE,
+ CHANNEL_HASH_MAX_SIZE,
+ &hash_ctl,
+ HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+ }
+
+ /* Initialize locks for the partitioned hash table */
+ size = offsetof(ChannelHashLockData, locks) +
+ mul_size(NUM_NOTIFY_PARTITIONS, sizeof(LWLock));
+ channelHashLockData = (ChannelHashLockData *)
+ ShmemInitStruct("Channel Hash Lock Data", size, &found);
+ if (!found)
+ {
+ /* First time through: initialize the locks and tranche ID */
+ channelHashLockData->trancheId = LWLockNewTrancheId();
+ for (int i = 0; i < NUM_NOTIFY_PARTITIONS; i++)
+ {
+ LWLockInitialize(&channelHashLockData->locks[i],
+ channelHashLockData->trancheId);
+ }
+ }
+
+ /*
+ * Set up local pointers for convenience. We must also register the
+ * tranche ID in every backend that will use these locks.
+ */
+ channelHashLocks = channelHashLockData->locks;
+ channelHashTrancheId = channelHashLockData->trancheId;
+ LWLockRegisterTranche(channelHashTrancheId, "ChannelHashPartition");
}
@@ -1110,6 +1327,7 @@ Exec_ListenPreCommit(void)
QUEUE_NEXT_LISTENER(MyProcNumber) = QUEUE_FIRST_LISTENER;
QUEUE_FIRST_LISTENER = MyProcNumber;
}
+
LWLockRelease(NotifyQueueLock);
/* Now we are listed in the global array, so remember we're listening */
@@ -1152,6 +1370,8 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+
+ ChannelHashAddListener(channel, MyProcNumber);
}
/*
@@ -1175,6 +1395,7 @@ Exec_UnlistenCommit(const char *channel)
{
listenChannels = foreach_delete_current(listenChannels, q);
pfree(lchan);
+ ChannelHashRemoveListener(channel, MyProcNumber);
break;
}
}
@@ -1193,9 +1414,22 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ ListCell *p;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ /*
+ * Before freeing the local list, iterate through it and perform a
+ * targeted removal for each of our channels from the shared hash table.
+ */
+ foreach(p, listenChannels)
+ {
+ char *channel = (char *) lfirst(p);
+
+ ChannelHashRemoveListener(channel, MyProcNumber);
+ }
+
list_free_deep(listenChannels);
listenChannels = NIL;
}
@@ -1239,6 +1473,7 @@ asyncQueueUnregister(void)
* Need exclusive lock here to manipulate list links.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
@@ -1565,12 +1800,18 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * This function operates in two modes:
+ * 1. Multicast mode: If all pending notification channels have a number of
+ * listeners at or below the 'notify_multicast_threshold', we signal only
+ * those specific backends.
+ * 2. Broadcast mode: If any channel has more listeners than the threshold (or
+ * we ran out of shared memory for the channel hash table), we signal all
+ * listening backends in our database.
+ *
+ * In addition to the channel-specific signaling, we also implement a "wake
+ * only tail" optimization: we signal the backend that is furthest behind
+ * in the queue to help prevent backends from getting far behind and create
+ * a chain reaction of wake-ups. This avoids thundering herd problems.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1824,11 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ List *channels;
+ ListCell *p;
+ bool *signaled;
+ bool broadcast_mode = false;
+ bool tail_woken = false;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1594,40 +1840,173 @@ SignalBackends(void)
*/
pids = (int32 *) palloc(MaxBackends * sizeof(int32));
procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+ signaled = (bool *) palloc0(MaxBackends * sizeof(bool));
count = 0;
+ /* Get list of channels that have pending notifications */
+ channels = GetPendingNotifyChannels();
+
+ /*
+ * To prevent deadlocks, we must always acquire locks in the same order:
+ * global NotifyQueueLock first, then individual partition locks.
+ */
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+
+ /*
+ * Determine if we can use targeted signaling or must broadcast. This
+ * check must be done while holding NotifyQueueLock to prevent deadlocks
+ * against other backends that might be modifying the listener list and
+ * hash table simultaneously (e.g., asyncQueueUnregister).
+ */
+ foreach(p, channels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(p);
+ ChannelEntry *entry;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ LWLockAcquire(lock, LW_SHARED);
+ entry = ChannelHashLookup(channel);
+
+ /*
+ * If there is no entry, it could mean we ran out of shared memory
+ * when trying to add this channel to the hash table. If the entry is
+ * marked for broadcast, we must use broadcast mode.
+ */
+ if (!entry || entry->is_broadcast)
+ {
+ broadcast_mode = true;
+ LWLockRelease(lock);
+ break;
+ }
+ LWLockRelease(lock);
+ }
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (broadcast_mode)
+ {
+ /*
+ * In broadcast mode, we iterate over all listening backends and
+ * signal the ones in our database that are not already caught up.
+ */
+ for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
{
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
/*
* Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
+ * already caught up.
*/
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ /* OK, need to signal this one */
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
}
- else
+ }
+ else
+ {
+ /*
+ * In multicast mode, signal specific listening backends. We must
+ * re-check the hash entries here inside the lock to avoid races.
+ */
+ foreach(p, channels)
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
- continue;
+ char *channel = (char *) lfirst(p);
+ ChannelEntry *entry;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ LWLockAcquire(lock, LW_SHARED);
+ entry = ChannelHashLookup(channel);
+
+ if (entry && !entry->is_broadcast)
+ {
+ for (int j = 0; j < entry->num_listeners; j++)
+ {
+ ProcNumber i = entry->listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (signaled[i])
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ continue;
+
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ }
+ }
+ LWLockRelease(lock);
}
+ }
+
+ /*
+ * Also check for any backends that are far behind. This ensures the
+ * global tail can advance even if they're not actively receiving
+ * notifications on their channels.
+ */
+ for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ {
+ int32 pid;
+ QueuePosition pos;
+
+ /*
+ * Skip if we've already decided to signal this one.
+ */
+ if (signaled[i])
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /*
+ * Skip signaling listeners if they already caught up.
+ */
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ continue;
+
+ /*
+ * Wake only tail optimization: Signal the backend that is furthest
+ * behind to help prevent backends from getting far behind in the
+ * first place. This finds the backend(s) on the same page as the
+ * global tail, which are the ones holding up truncation. This creates
+ * a chain reaction where each backend eventually wakes up the next
+ * one as notifications are processed, avoiding thundering herd.
+ */
+ if (!tail_woken && asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_TAIL),
+ QUEUE_POS_PAGE(pos)) == 0)
+ tail_woken = true;
+ else
+ continue;
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
/* OK, need to signal this one */
pids[count] = pid;
procnos[count] = i;
count++;
}
+
LWLockRelease(NotifyQueueLock);
/* Now send signals */
@@ -1647,9 +2026,9 @@ SignalBackends(void)
/*
* Note: assuming things aren't broken, a signal failure here could
- * only occur if the target backend exited since we released
- * NotifyQueueLock; which is unlikely but certainly possible. So we
- * just log a low-level debug message if it happens.
+ * only occur if the target backend exited since we released the lock;
+ * which is unlikely but certainly possible. So we just log a
+ * low-level debug message if it happens.
*/
if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]) < 0)
elog(DEBUG3, "could not signal backend with PID %d: %m", pid);
@@ -1657,6 +2036,7 @@ SignalBackends(void)
pfree(pids);
pfree(procnos);
+ pfree(signaled);
}
/*
@@ -2395,3 +2775,382 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+/*
+ * GUC check_hook for notify_multicast_threshold
+ */
+bool
+check_notify_multicast_threshold(int *newval, void **extra, GucSource source)
+{
+ /*
+ * We don't allow values less than 0. A value of 0 is special and means
+ * the multicast optimization is disabled entirely.
+ */
+ if (*newval < 0)
+ {
+ GUC_check_errdetail("notify_multicast_threshold must be non-negative.");
+ return false;
+ }
+
+ return true;
+}
+
+/*
+ * Channel hash table management functions
+ */
+
+/*
+ * channel_hash_func
+ * Custom hash function for the channel hash table. This function ensures
+ * that the low-order bits of the hash are well-distributed, which is
+ * critical for partitioned hash tables.
+ */
+static uint32
+channel_hash_func(const void *key, Size keysize)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ uint32 h;
+
+ /*
+ * Mix the dboid and the channel name to produce a good hash. hash_any()
+ * is a high-quality portable hash function. This prevents channels with
+ * the same name in different databases from always mapping to the same
+ * partition.
+ */
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/*
+ * GetChannelHashLock
+ * Return the LWLock that protects the partition for the given channel name.
+ */
+static LWLock *
+GetChannelHashLock(const char *channel)
+{
+ ChannelHashKey key;
+ uint32 hash;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ hash = get_hash_value(GetChannelHash(), &key);
+
+ return &channelHashLocks[hash % NUM_NOTIFY_PARTITIONS];
+}
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key (database OID + channel name) for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
+
+/*
+ * ChannelHashAddListener
+ * Register the given backend as a listener for the specified channel.
+ *
+ * This function uses an optimistic read-locking strategy to maximize
+ * concurrency. An exclusive lock is only taken when mutating the listener
+ * list.
+ *
+ * 1. It first takes a shared lock. If the channel is already in broadcast
+ * mode, or if the current backend is already in the listener list, no write
+ * is needed and we can return immediately.
+ *
+ * 2. If a write is needed, it releases the shared lock and acquires an
+ * exclusive lock.
+ *
+ * 3. CRUCIALLY, after acquiring the exclusive lock, it must re-check the
+ * state, as another backend may have modified the entry in the interim.
+ *
+ * 4. If the number of listeners is below 'notify_multicast_threshold', the
+ * new listener is added. If the threshold is reached, the channel is
+ * converted to broadcast mode.
+ */
+static void
+ChannelHashAddListener(const char *channel, ProcNumber procno)
+{
+ ChannelEntry *entry;
+ bool found;
+ ChannelHashKey key;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ /*
+ * If the threshold is zero, this optimization is disabled. All channels
+ * immediately use broadcast, so we don't need to track them.
+ */
+ if (notify_multicast_threshold <= 0)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * FAST PATH: Optimistically take a shared lock. If the channel is already
+ * in broadcast mode, or if we are already listed, we are done.
+ */
+ LWLockAcquire(lock, LW_SHARED);
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (entry)
+ {
+ if (entry->is_broadcast)
+ {
+ LWLockRelease(lock);
+ return;
+ }
+ /* Check if we are already in the list */
+ for (int i = 0; i < entry->num_listeners; i++)
+ {
+ if (entry->listeners[i] == procno)
+ {
+ LWLockRelease(lock);
+ return;
+ }
+ }
+ }
+ LWLockRelease(lock);
+
+ /*
+ * SLOW PATH: We need to write. Acquire exclusive lock.
+ */
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
+ /*
+ * Re-check state after acquiring exclusive lock, as it may have changed.
+ */
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_ENTER_NULL, &found);
+
+ if (entry == NULL)
+ {
+ /* Out of memory in the hash partition. */
+ ereport(DEBUG1, (errmsg("too many notification channels are already being tracked")));
+ LWLockRelease(lock);
+ return;
+ }
+
+ if (!found)
+ {
+ /* First listener for this channel. */
+ entry->is_broadcast = false;
+ entry->num_listeners = 1;
+ entry->listeners[0] = procno;
+ }
+ else
+ {
+ /* Entry already exists, re-check everything. */
+ bool already_present = false;
+
+ if (entry->is_broadcast)
+ {
+ /* Another backend set it to broadcast mode. We're done. */
+ LWLockRelease(lock);
+ return;
+ }
+
+ for (int i = 0; i < entry->num_listeners; i++)
+ {
+ if (entry->listeners[i] == procno)
+ {
+ already_present = true;
+ break;
+ }
+ }
+
+ if (!already_present)
+ {
+ if (entry->num_listeners < notify_multicast_threshold)
+ {
+ /* Add ourselves to the list of listeners. */
+ entry->listeners[entry->num_listeners] = procno;
+ entry->num_listeners++;
+ }
+ else
+ {
+ /* We are the listener that exceeds the threshold. */
+ entry->is_broadcast = true;
+ entry->num_listeners = 0; /* Clear the list */
+ }
+ }
+ }
+ LWLockRelease(lock);
+}
+
+/*
+ * ChannelHashRemoveListener
+ * Update the channel hash when a backend stops listening on a channel.
+ *
+ * This function uses an optimistic read-lock strategy. An exclusive lock is
+ * only taken if we are in the listener list for a channel and need to remove
+ * ourselves. If a channel is in broadcast mode, we cannot safely modify it,
+ * as we can't know which backends are listening.
+ */
+static void
+ChannelHashRemoveListener(const char *channel, ProcNumber procno)
+{
+ ChannelEntry *entry;
+ ChannelHashKey key;
+ LWLock *lock = GetChannelHashLock(channel);
+ bool present = false;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * Take a shared lock first to see if a removal is even possible. If the
+ * entry doesn't exist, is in broadcast mode, or we're not in its list, we
+ * have nothing to do. This is the fast path.
+ */
+ LWLockAcquire(lock, LW_SHARED);
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (!entry || entry->is_broadcast)
+ {
+ LWLockRelease(lock);
+ return;
+ }
+
+ /* Check if we are in the list */
+ for (int i = 0; i < entry->num_listeners; i++)
+ {
+ if (entry->listeners[i] == procno)
+ {
+ present = true;
+ break;
+ }
+ }
+ if (!present)
+ {
+ LWLockRelease(lock);
+ return;
+ }
+ LWLockRelease(lock);
+
+ /* A removal is likely needed. Acquire an exclusive lock. */
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
+ /*
+ * Re-check the state. Another backend might have changed it (e.g., to
+ * broadcast mode).
+ */
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (entry && !entry->is_broadcast)
+ {
+ int i;
+
+ for (i = 0; i < entry->num_listeners; i++)
+ {
+ if (entry->listeners[i] == procno)
+ {
+ /*
+ * Found our procno. Remove it from the listener array.
+ *
+ * If this is the last listener, we remove the entire hash
+ * entry for the channel.
+ */
+ if (entry->num_listeners == 1)
+ {
+ (void) hash_search(GetChannelHash(), &key, HASH_REMOVE, NULL);
+ }
+ else
+ {
+ /*
+ * To remove an element from the array while keeping it
+ * contiguous, we first decrement the listener count.
+ * Then, we shift all subsequent elements one position to
+ * the left, overwriting the element we want to remove.
+ *
+ * The `if (i < entry->num_listeners)` condition
+ * explicitly handles the case where the last element in
+ * the array is being removed. In that scenario, `i`
+ * equals the new `num_listeners`, so no memory movement
+ * is necessary, and the `memmove` is correctly skipped.
+ */
+ entry->num_listeners--;
+ if (i < entry->num_listeners)
+ {
+ Size size_to_move;
+
+ size_to_move = mul_size(entry->num_listeners - i,
+ sizeof(ProcNumber));
+ memmove(&entry->listeners[i],
+ &entry->listeners[i + 1],
+ size_to_move);
+ }
+ }
+ break; /* Found and removed, exit loop. */
+ }
+ }
+ }
+ LWLockRelease(lock);
+}
+
+/*
+ * ChannelHashLookup
+ * Look up the channel hash entry for the given channel name in the
+ * current database.
+ *
+ * Returns NULL if the channel is not being tracked (no listeners, or channel
+ * fell back to broadcast mode because we ran out of shared memory when trying
+ * to add entries to the hash table).
+ *
+ * Caller must hold the appropriate partition lock (shared is sufficient).
+ */
+static ChannelEntry *
+ChannelHashLookup(const char *channel)
+{
+ ChannelHashKey key;
+
+ Assert(LWLockHeldByMe(GetChannelHashLock(channel)));
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ return (ChannelEntry *) hash_search(GetChannelHash(),
+ &key,
+ HASH_FIND,
+ NULL);
+}
+
+/*
+ * GetPendingNotifyChannels
+ * Get list of unique channel names from pending notifications.
+ */
+static List *
+GetPendingNotifyChannels(void)
+{
+ List *channels = NIL;
+ ListCell *p;
+ ListCell *q;
+ bool found;
+
+ if (!pendingNotifies)
+ return NIL;
+
+ /* Collect unique channel names from pending notifications */
+ foreach(p, pendingNotifies->events)
+ {
+ Notification *n = (Notification *) lfirst(p);
+ char *channel = n->data;
+
+ /* Check if we already have this channel in our list */
+ found = false;
+ foreach(q, channels)
+ {
+ char *existing = (char *) lfirst(q);
+
+ if (strcmp(existing, channel) == 0)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ channels = lappend(channels, channel);
+ }
+
+ return channels;
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..25196e3246b 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -162,6 +162,7 @@ int commit_timestamp_buffers = 0;
int multixact_member_buffers = 32;
int multixact_offset_buffers = 16;
int notify_buffers = 16;
+int notify_multicast_threshold = 16;
int serializable_buffers = 32;
int subtransaction_buffers = 0;
int transaction_buffers = 0;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index d14b1678e7f..1e642f9f69e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2464,6 +2464,19 @@ struct config_int ConfigureNamesInt[] =
check_notify_buffers, NULL, NULL
},
+ {
+ {"notify_multicast_threshold", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the maximum number of listeners to track per channel for multicast signaling."),
+ gettext_noop("When the number of listeners on a channel exceeds this threshold, "
+ "NOTIFY will signal all listening backends rather than just those "
+ "listening on the specific channel. Setting to 0 disables multicast "
+ "signaling entirely."),
+ },
+ ¬ify_multicast_threshold,
+ 16, 0, MAX_BACKENDS,
+ check_notify_multicast_threshold, NULL, NULL
+ },
+
{
{"serializable_buffers", PGC_POSTMASTER, RESOURCES_MEM,
gettext_noop("Sets the size of the dedicated buffer pool used for the serializable transaction cache."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..b23492653f3 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -182,6 +182,7 @@ extern PGDLLIMPORT int commit_timestamp_buffers;
extern PGDLLIMPORT int multixact_member_buffers;
extern PGDLLIMPORT int multixact_offset_buffers;
extern PGDLLIMPORT int notify_buffers;
+extern PGDLLIMPORT int notify_multicast_threshold;
extern PGDLLIMPORT int serializable_buffers;
extern PGDLLIMPORT int subtransaction_buffers;
extern PGDLLIMPORT int transaction_buffers;
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 82ac8646a8d..ed3a00bb7e4 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -92,6 +92,7 @@ extern bool check_multixact_member_buffers(int *newval, void **extra,
extern bool check_multixact_offset_buffers(int *newval, void **extra,
GucSource source);
extern bool check_notify_buffers(int *newval, void **extra, GucSource source);
+extern bool check_notify_multicast_threshold(int *newval, void **extra, GucSource source);
extern bool check_primary_slot_name(char **newval, void **extra,
GucSource source);
extern bool check_random_seed(double *newval, void **extra, GucSource source);
--
2.47.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-07-23 01:39 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-07-23 01:39 UTC (permalink / raw)
To: pgsql-hackers; +Cc: Thomas Munro <[email protected]>; Heikki Linnakangas <[email protected]>; Rishu Bagga <[email protected]>
On Thu, Jul 17, 2025, at 09:43, Joel Jacobson wrote:
> On Wed, Jul 16, 2025, at 02:20, Rishu Bagga wrote:
>> If we are doing this optimization, why not maintain a list of backends
>> for each channel, and only wake up those channels?
>
> Thanks for a contributing a great idea, it actually turned out to work
> really well in practice!
>
> The attached new v4 of the patch implements your multicast idea:
Hi hackers,
While my previous attempts of $subject has only focused on optimizing
the multi-channel scenario, I thought it would be really nice if
LISTEN/NOTIFY could be optimize in the general case, benefiting all
users, including those who just listen on a single channel.
To my surprise, this was not only possible, but actually quite simple.
The main idea in this patch, is to introduce an atomic state machine,
with three states, IDLE, SIGNALLED, and PROCESSED, so that we don't
interrupt backends that are already in the process of catching up.
Thanks to Thomas Munro for making me aware of his, Heikki Linnakanga's
and others work in the "Interrupts vs signals" [1] thread.
Maybe my patch is redundant due to their patch set, I'm not really sure?
Their patch seems to refactors the underlying wakeup mechanism. It
replaces the old, complex chain of events (SIGUSR1 signal -> handler ->
flag -> latch) with a single, direct function call: SendInterrupt(). For
async.c, this seems to be a low-level plumbing change that simplifies
how a notification wakeup is delivered.
My patch optimizes the high-level notification protocol. It introduces a
state machine (IDLE, SIGNALLED, PROCESSING) to only signal backends when
needed.
In their patch, in asyn.c's SignalBackends(), they do
SendInterrupt(INTERRUPT_ASYNC_NOTIFY, procno) instead of
SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]). They don't
seem to check if the backend is already signalled or not, but maybe
SendInterrupt() has signal coalescing built-in so it would be a noop
with almost no cost?
I'm happy to rebase my LISTEN/NOTIFY work on top of [1], but I could
also see benefits of doing the opposite.
I'm also happy to help with benchmarking of your work in [1].
Note that this patch doesn't contain the hash table to keep track of
listeners per backend, as proposed in earlier patches. I will propose
such a patch again later, but first we need to figure out if I should
rebase onto [1] or master (HEAD).
--- PATCH ---
Optimize NOTIFY signaling to avoid redundant backend signals
Previously, a NOTIFY would send SIGUSR1 to all listening backends, which
could lead to a "thundering herd" of redundant signals under high
traffic. To address this inefficiency, this patch replaces the simple
volatile notifyInterruptPending flag with a per-backend atomic state
machine, stored in asyncQueueControl->backend[i].state. This state
variable can be in one of three states: IDLE (awaiting signal),
SIGNALLED (signal received, work pending), or PROCESSING (actively
reading the queue).
From the notifier's perspective, SignalBackends now uses an atomic
compare-and-swap (CAS) to transition a listener from IDLE to SIGNALLED.
Only on a successful transition is a signal sent. If the listener is
already SIGNALLED or another notifier wins the race, no redundant signal
is sent. If the listener is in the PROCESSING state, the notifier will
also transition it to SIGNALLED to ensure the listener re-scans the
queue after its current work is done.
On the listener side, ProcessIncomingNotify first transitions its state
from SIGNALLED to PROCESSING. After reading notifications, it attempts
to transition from PROCESSING back to IDLE. If this CAS fails, it means
a new notification arrived during processing and a notifier has already
set the state back to SIGNALLED. The listener then simply re-latches
itself to process the new notifications, avoiding a tight loop.
The primary benefit is a significant reduction in syscall overhead and
unnecessary kernel wakeups in high-traffic scenarios. This dramatically
improves performance for workloads with many concurrent notifiers.
Benchmarks show a substantial increase in NOTIFY-only transaction
throughput, with gains exceeding 200% at higher
concurrency levels.
src/backend/commands/async.c | 209 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------
src/backend/tcop/postgres.c | 4 ++--
src/include/commands/async.h | 4 +++-
3 files changed, 185 insertions(+), 32 deletions(-)
--- BENCHMARK ---
The attached benchmark script does LISTEN on one connection,
and then uses pgbench to send NOTIFY on a varying number of
connections and jobs, to cause a high procsignal load.
I've run the benchmark on my MacBook Pro M3 Max,
10 seconds per run, 3 runs.
(I reused the same benchmark script as in the other thread, "Optimize ProcSignal to avoid redundant SIGUSR1 signals")
Connections=Jobs | TPS (master) | TPS (patch) | Relative Diff (%) | StdDev (master) | StdDev (patch)
------------------+--------------+-------------+-------------------+-----------------+----------------
1 | 118833 | 151510 | 27.50% | 484 | 923
2 | 156005 | 239051 | 53.23% | 3145 | 1596
4 | 177351 | 250910 | 41.48% | 4305 | 4891
8 | 116597 | 171944 | 47.47% | 1549 | 2752
16 | 40835 | 165482 | 305.25% | 2695 | 2825
32 | 37940 | 145150 | 282.58% | 2533 | 1566
64 | 35495 | 131836 | 271.42% | 1837 | 573
128 | 40193 | 121333 | 201.88% | 2254 | 874
(8 rows)
/Joel
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2B3MkS21yK4jL4cgZywdnnGKiBg0jatoV6kzaniBmcqbQ%4...
Attachments:
[application/octet-stream] 0001-Optimize-NOTIFY-signaling-to-avoid-redundant-backend.patch (14.4K, 2-0001-Optimize-NOTIFY-signaling-to-avoid-redundant-backend.patch)
download | inline diff:
From d4f01cda8bcd4042f0d751d73e13b561d8b1eaab Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Tue, 22 Jul 2025 10:32:34 +0200
Subject: [PATCH] Optimize NOTIFY signaling to avoid redundant backend signals
Previously, a NOTIFY would send SIGUSR1 to all listening backends, which
could lead to a "thundering herd" of redundant signals under high
traffic. To address this inefficiency, this patch replaces the simple
volatile notifyInterruptPending flag with a per-backend atomic state
machine, stored in asyncQueueControl->backend[i].state. This state
variable can be in one of three states: IDLE (awaiting signal),
SIGNALLED (signal received, work pending), or PROCESSING (actively
reading the queue).
From the notifier's perspective, SignalBackends now uses an atomic
compare-and-swap (CAS) to transition a listener from IDLE to SIGNALLED.
Only on a successful transition is a signal sent. If the listener is
already SIGNALLED or another notifier wins the race, no redundant signal
is sent. If the listener is in the PROCESSING state, the notifier will
also transition it to SIGNALLED to ensure the listener re-scans the
queue after its current work is done.
On the listener side, ProcessIncomingNotify first transitions its state
from SIGNALLED to PROCESSING. After reading notifications, it attempts
to transition from PROCESSING back to IDLE. If this CAS fails, it means
a new notification arrived during processing and a notifier has already
set the state back to SIGNALLED. The listener then simply re-latches
itself to process the new notifications, avoiding a tight loop.
The primary benefit is a significant reduction in syscall overhead and
unnecessary kernel wakeups in high-traffic scenarios. This dramatically
improves performance for workloads with many concurrent notifiers.
Benchmarks show a substantial increase in NOTIFY-only transaction
throughput, with gains exceeding 200% at higher
concurrency levels.
---
src/backend/commands/async.c | 209 ++++++++++++++++++++++++++++++-----
src/backend/tcop/postgres.c | 4 +-
src/include/commands/async.h | 4 +-
3 files changed, 185 insertions(+), 32 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..ae20017af9b 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -150,8 +150,19 @@
#include "utils/ps_status.h"
#include "utils/snapmgr.h"
#include "utils/timestamp.h"
+#include "port/atomics.h"
+/*
+ * Async notification state machine states
+ */
+typedef enum AsyncListenerState
+{
+ ASYNC_STATE_IDLE = 0, /* Backend is idle, waiting for signal */
+ ASYNC_STATE_SIGNALLED = 1, /* Backend has been signaled, will process soon */
+ ASYNC_STATE_PROCESSING = 2 /* Backend is actively processing notifications */
+} AsyncListenerState;
+
/*
* Maximum size of a NOTIFY payload, including terminating NULL. This
* must be kept small enough so that a notification message fits on one
@@ -246,6 +257,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ pg_atomic_uint32 state; /* async state machine state */
} QueueBackendStatus;
/*
@@ -301,6 +313,7 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_STATE(i) (asyncQueueControl->backend[i].state)
/*
* The SLRU buffer area through which we access the notification queue
@@ -405,12 +418,10 @@ static NotificationList *pendingNotifies = NULL;
/*
* Inbound notifications are initially processed by HandleNotifyInterrupt(),
- * called from inside a signal handler. That just sets the
- * notifyInterruptPending flag and sets the process
+ * called from inside a signal handler. That just sets the process
* latch. ProcessNotifyInterrupt() will then be called whenever it's safe to
* actually deal with the interrupt.
*/
-volatile sig_atomic_t notifyInterruptPending = false;
/* True if we've registered an on_shmem_exit cleanup */
static bool unlistenExitRegistered = false;
@@ -527,6 +538,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ pg_atomic_init_u32(&QUEUE_BACKEND_STATE(i), ASYNC_STATE_IDLE);
}
}
@@ -1099,6 +1111,8 @@ Exec_ListenPreCommit(void)
QUEUE_BACKEND_POS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
+ /* Initialize the atomic state to IDLE */
+ pg_atomic_write_u32(&QUEUE_BACKEND_STATE(MyProcNumber), ASYNC_STATE_IDLE);
/* Insert backend into list of listeners at correct position */
if (prevListener != INVALID_PROC_NUMBER)
{
@@ -1242,6 +1256,8 @@ asyncQueueUnregister(void)
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
+ /* Reset state to IDLE to prevent zombie listeners */
+ pg_atomic_write_u32(&QUEUE_BACKEND_STATE(MyProcNumber), ASYNC_STATE_IDLE);
/* and remove it from the list */
if (QUEUE_FIRST_LISTENER == MyProcNumber)
QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);
@@ -1634,25 +1650,84 @@ SignalBackends(void)
for (int i = 0; i < count; i++)
{
int32 pid = pids[i];
+ ProcNumber procno = procnos[i];
+ uint32 expected;
+ bool signal_needed = false;
/*
- * If we are signaling our own process, no need to involve the kernel;
- * just set the flag directly.
+ * Implement state machine transitions for the notifier.
+ * We use a loop to handle race conditions where the state
+ * changes between our read and the CAS operation.
*/
- if (pid == MyProcPid)
+ uint32 current_state = pg_atomic_read_membarrier_u32(&QUEUE_BACKEND_STATE(procno));
+
+ switch (current_state)
{
- notifyInterruptPending = true;
- continue;
+ case ASYNC_STATE_IDLE:
+ /* Try to transition from IDLE to SIGNALLED */
+ expected = ASYNC_STATE_IDLE;
+ if (pg_atomic_compare_exchange_u32(&QUEUE_BACKEND_STATE(procno),
+ &expected,
+ ASYNC_STATE_SIGNALLED))
+ {
+ /* Success - need to send signal */
+ signal_needed = true;
+ if (Trace_notify)
+ elog(DEBUG1, "SignalBackends: transitioned backend %d from IDLE to SIGNALLED", pid);
+ }
+ /* Another notifier already signaled - we're done */
+ break;
+
+ case ASYNC_STATE_SIGNALLED:
+ /* Backend is already signaled - nothing to do */
+ if (Trace_notify)
+ elog(DEBUG1, "SignalBackends: backend %d already in SIGNALLED state, skipping", pid);
+ break;
+
+ case ASYNC_STATE_PROCESSING:
+ /* Try to transition from PROCESSING to SIGNALLED */
+ expected = ASYNC_STATE_PROCESSING;
+ if (pg_atomic_compare_exchange_u32(&QUEUE_BACKEND_STATE(procno),
+ &expected,
+ ASYNC_STATE_SIGNALLED))
+ {
+ /* Success - need to send signal for re-scan */
+ signal_needed = true;
+ if (Trace_notify)
+ elog(DEBUG1, "SignalBackends: transitioned backend %d from PROCESSING to SIGNALLED for re-scan", pid);
+ break;
+ }
+ /* Another notifier already signaled - we're done */
+ break;
+
+ default:
+ /* Should never happen */
+ elog(ERROR, "unexpected async state %u for backend %d",
+ current_state, pid);
}
- /*
- * Note: assuming things aren't broken, a signal failure here could
- * only occur if the target backend exited since we released
- * NotifyQueueLock; which is unlikely but certainly possible. So we
- * just log a low-level debug message if it happens.
- */
- if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]) < 0)
- elog(DEBUG3, "could not signal backend with PID %d: %m", pid);
+ /* Send signal if needed */
+ if (signal_needed)
+ {
+ /*
+ * For our own process, no need to involve the kernel
+ */
+ if (pid == MyProcPid)
+ {
+ SetLatch(MyLatch);
+ }
+ else
+ {
+ /*
+ * Note: assuming things aren't broken, a signal failure here could
+ * only occur if the target backend exited since we released
+ * NotifyQueueLock; which is unlikely but certainly possible. So we
+ * just log a low-level debug message if it happens.
+ */
+ if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procno) < 0)
+ elog(DEBUG3, "could not signal backend with PID %d: %m", pid);
+ }
+ }
}
pfree(pids);
@@ -1805,20 +1880,43 @@ HandleNotifyInterrupt(void)
{
/*
* Note: this is called by a SIGNAL HANDLER. You must be very wary what
- * you do here.
+ * you do here. The actual state transition has already been done by
+ * the notifier before sending the signal, so we only need to set the
+ * latch to ensure the backend wakes up and processes the notification.
*/
- /* signal that work needs to be done */
- notifyInterruptPending = true;
-
/* make sure the event is processed in due course */
SetLatch(MyLatch);
}
+/*
+ * IsNotifyInterruptPending
+ *
+ * Check if there's a pending notify interrupt for this backend
+ */
+bool
+IsNotifyInterruptPending(void)
+{
+ uint32 state;
+
+ /* If not registered as a listener, no notifications are pending */
+ if (!amRegisteredListener)
+ return false;
+
+ /*
+ * Read the current state with a memory barrier to ensure we see
+ * the most recent value written by notifiers.
+ */
+ state = pg_atomic_read_membarrier_u32(&QUEUE_BACKEND_STATE(MyProcNumber));
+
+ /* Notification is pending if state is SIGNALLED */
+ return (state == ASYNC_STATE_SIGNALLED);
+}
+
/*
* ProcessNotifyInterrupt
*
- * This is called if we see notifyInterruptPending set, just before
+ * This is called if we see a notification interrupt is pending, just before
* transmitting ReadyForQuery at the end of a frontend command, and
* also if a notify signal occurs while reading from the frontend.
* HandleNotifyInterrupt() will cause the read to be interrupted
@@ -1837,7 +1935,7 @@ ProcessNotifyInterrupt(bool flush)
return; /* not really idle */
/* Loop in case another signal arrives while sending messages */
- while (notifyInterruptPending)
+ while (IsNotifyInterruptPending())
ProcessIncomingNotify(flush);
}
@@ -2182,28 +2280,81 @@ asyncQueueAdvanceTail(void)
static void
ProcessIncomingNotify(bool flush)
{
- /* We *must* reset the flag */
- notifyInterruptPending = false;
+ uint32 expected;
- /* Do nothing else if we aren't actively listening */
+ /* Do nothing if we aren't actively listening */
if (listenChannels == NIL)
return;
+ /*
+ * Perform state transition from SIGNALLED to PROCESSING.
+ * This is the "acquire lock" operation for the listener.
+ */
+ expected = ASYNC_STATE_SIGNALLED;
+ if (!pg_atomic_compare_exchange_u32(&QUEUE_BACKEND_STATE(MyProcNumber),
+ &expected,
+ ASYNC_STATE_PROCESSING))
+ {
+ /*
+ * CAS failed - the state was not SIGNALLED. This should not happen
+ * as ProcessNotifyInterrupt only calls us when state is SIGNALLED.
+ */
+ elog(ERROR, "unexpected async state %u in ProcessIncomingNotify, expected SIGNALLED",
+ expected);
+ }
+
if (Trace_notify)
- elog(DEBUG1, "ProcessIncomingNotify");
+ elog(DEBUG1, "ProcessIncomingNotify: transitioned to PROCESSING");
set_ps_display("notify interrupt");
/*
- * We must run asyncQueueReadAllNotifications inside a transaction, else
- * bad things happen if it gets an error.
- */
+ * We must run asyncQueueReadAllNotifications inside a transaction, else
+ * bad things happen if it gets an error.
+ */
StartTransactionCommand();
asyncQueueReadAllNotifications();
CommitTransactionCommand();
+ /*
+ * Try to transition from PROCESSING back to IDLE.
+ * This is the "release lock" operation for the listener.
+ */
+ expected = ASYNC_STATE_PROCESSING;
+ if (pg_atomic_compare_exchange_u32(&QUEUE_BACKEND_STATE(MyProcNumber),
+ &expected,
+ ASYNC_STATE_IDLE))
+ {
+ /* Success - we're done, transitioned to IDLE */
+ if (Trace_notify)
+ elog(DEBUG1, "ProcessIncomingNotify: transitioned to IDLE");
+ }
+ else
+ {
+ /* CAS failed - check what the new state is */
+ if (expected == ASYNC_STATE_SIGNALLED)
+ {
+ /*
+ * A notifier set our state to SIGNALLED while we were processing.
+ * We are done with this batch of work, but we know there is more
+ * to do. Rather than loop here and risk starving other backend
+ * activity, we set our own latch to ensure we are woken up again
+ * to re-process, and then exit. The state is left as SIGNALLED.
+ */
+ if (Trace_notify)
+ elog(DEBUG1, "ProcessIncomingNotify: signalled while processing");
+ SetLatch(MyLatch);
+ }
+ else
+ {
+ /* Any other state is an error */
+ elog(ERROR, "unexpected async state %u when trying to return to IDLE",
+ expected);
+ }
+ }
+
/*
* If this isn't an end-of-command case, we must flush the notify messages
* to ensure frontend gets them promptly.
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2f8c3d5f918..3216247a58b 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -512,7 +512,7 @@ ProcessClientReadInterrupt(bool blocked)
ProcessCatchupInterrupt();
/* Process notify interrupts, if any */
- if (notifyInterruptPending)
+ if (IsNotifyInterruptPending())
ProcessNotifyInterrupt(true);
}
else if (ProcDiePending)
@@ -4603,7 +4603,7 @@ PostgresMain(const char *dbname, const char *username)
* were received during the just-finished transaction, they'll
* be seen by the client before ReadyForQuery is.
*/
- if (notifyInterruptPending)
+ if (IsNotifyInterruptPending())
ProcessNotifyInterrupt(false);
/*
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index f75c3df9556..7f2e0ac0b9f 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -17,7 +17,6 @@
extern PGDLLIMPORT bool Trace_notify;
extern PGDLLIMPORT int max_notify_queue_pages;
-extern PGDLLIMPORT volatile sig_atomic_t notifyInterruptPending;
extern Size AsyncShmemSize(void);
extern void AsyncShmemInit(void);
@@ -46,4 +45,7 @@ extern void HandleNotifyInterrupt(void);
/* process interrupts */
extern void ProcessNotifyInterrupt(bool flush);
+/* check if notification interrupt is pending */
+extern bool IsNotifyInterruptPending(void);
+
#endif /* ASYNC_H */
--
2.47.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-07-23 02:44 Thomas Munro <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Thomas Munro @ 2025-07-23 02:44 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: pgsql-hackers; Heikki Linnakangas <[email protected]>; Rishu Bagga <[email protected]>
On Wed, Jul 23, 2025 at 1:39 PM Joel Jacobson <[email protected]> wrote:
> In their patch, in asyn.c's SignalBackends(), they do
> SendInterrupt(INTERRUPT_ASYNC_NOTIFY, procno) instead of
> SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]). They don't
> seem to check if the backend is already signalled or not, but maybe
> SendInterrupt() has signal coalescing built-in so it would be a noop
> with almost no cost?
Yeah:
+ old_pending = pg_atomic_fetch_or_u32(&proc->pendingInterrupts, interruptMask);
+
+ /*
+ * If the process is currently blocked waiting for an interrupt to arrive,
+ * and the interrupt wasn't already pending, wake it up.
+ */
+ if ((old_pending & (interruptMask | SLEEPING_ON_INTERRUPTS)) ==
SLEEPING_ON_INTERRUPTS)
+ WakeupOtherProc(proc);
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-07-24 21:03 Joel Jacobson <[email protected]>
parent: Thomas Munro <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-07-24 21:03 UTC (permalink / raw)
To: Thomas Munro <[email protected]>; +Cc: pgsql-hackers; Heikki Linnakangas <[email protected]>; Rishu Bagga <[email protected]>
On Wed, Jul 23, 2025, at 04:44, Thomas Munro wrote:
> On Wed, Jul 23, 2025 at 1:39 PM Joel Jacobson <[email protected]> wrote:
>> In their patch, in asyn.c's SignalBackends(), they do
>> SendInterrupt(INTERRUPT_ASYNC_NOTIFY, procno) instead of
>> SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]). They don't
>> seem to check if the backend is already signalled or not, but maybe
>> SendInterrupt() has signal coalescing built-in so it would be a noop
>> with almost no cost?
>
> Yeah:
>
> + old_pending = pg_atomic_fetch_or_u32(&proc->pendingInterrupts, interruptMask);
> +
> + /*
> + * If the process is currently blocked waiting for an interrupt to arrive,
> + * and the interrupt wasn't already pending, wake it up.
> + */
> + if ((old_pending & (interruptMask | SLEEPING_ON_INTERRUPTS)) ==
> SLEEPING_ON_INTERRUPTS)
> + WakeupOtherProc(proc);
Thanks for confirming the coalescing logic in SendInterrupt. That's a
great low-level optimization. It's clear we're both targeting the same
problem of redundant wake-ups under contention, but approaching it from
different architectural levels.
The core difference, as I see it, is *where* the state management
resides. The "Interrupts vs signals" patch set creates a unified
machinery where the 'pending' state for all subsystems is combined into
a single atomic bitmask. This is a valid approach.
However, I've been exploring an alternative pattern that decouples the
state management from the signaling machinery, allowing each subsystem
to manage its own state independently. I believe this leads to a
simpler, more modular migration path. I've developed a two-patch series
for `async.c` to demonstrate this concept.
1. The first patch introduces a lock-free, atomic finite state machine
(FSM) entirely within async.c. By using a subsystem-specific atomic
integer and CAS operations, async.c can now robustly manage its own
listener states (IDLE, SIGNALLED, PROCESSING). This solves the
redundant signal problem at the source, as notifiers can now observe
a listener's state and refrain from sending a wakeup if one is
already pending.
2. The second patch demonstrates that once state is managed locally, the
wakeup mechanism becomes trivial.** The expensive `SendProcSignal`
call is replaced with a direct `SetLatch`. This leverages the
existing, highly-optimized `WaitEventSet` infrastructure as a simple,
efficient "poke."
This suggests a powerful, incremental migration pattern: first, fix a
subsystem's state management internally; second, replace its wakeup
mechanism. This vertical, module-by-module approach seems complementary
to the horizontal, layer-by-layer refactoring in the "Interrupts vs
signals" thread.
I'll post a more detailed follow-up in that thread to discuss the
broader architectural implications. Attached are the two patches,
reframed to better illustrate this two-step pattern.
/Joel
#!/bin/bash
# Configuration for the PostgreSQL instances using absolute paths.
# This script does NOT modify the shell's PATH variable.
# --- Master Config ---
MASTER_NAME="master"
MASTER_PORT=5432
MASTER_BIN_PATH="$HOME/pg-master/bin"
MASTER_DATA="$HOME/pg-master-data"
MASTER_LOG="/tmp/pg-master.log"
# --- Patch v1 Config ---
PATCH_NAME="patch-v1"
PATCH_PORT=5432
PATCH_BIN_PATH="$HOME/pg-patch-v1/bin"
PATCH_DATA="$HOME/pg-patch-v1-data"
PATCH_LOG="/tmp/pg-patch-v1.log"
# --- Patch v2 Config ---
PATCH_V2_NAME="patch-v2"
PATCH_V2_PORT=5432
PATCH_V2_BIN_PATH="$HOME/pg-patch-v2/bin"
PATCH_V2_DATA="$HOME/pg-patch-v2-data"
PATCH_V2_LOG="/tmp/pg-patch-v2.log"
# Benchmark settings
CHANNEL_NAME="mychannel"
CONNECTIONS=(64 128)
DURATION=10 # Benchmark duration in seconds for each run
MEASUREMENTS=3 # Number of measurements per configuration
# CSV output file
CSV_OUTPUT="benchmark_results.csv"
# Temporary files
PGBENCH_SCRIPT=$(mktemp)
# --- Cleanup Function ---
# Ensures that servers are stopped and temp files are removed on script exit.
cleanup() {
echo ""
echo "Cleaning up..."
# Ensure all servers are stopped, silencing errors if they are not running.
# Use absolute paths and explicit data directories.
"$MASTER_BIN_PATH/pg_ctl" -D "$MASTER_DATA" -m fast stop &> /dev/null
"$PATCH_BIN_PATH/pg_ctl" -D "$PATCH_DATA" -m fast stop &> /dev/null
"$PATCH_V2_BIN_PATH/pg_ctl" -D "$PATCH_V2_DATA" -m fast stop &> /dev/null
rm -f "$PGBENCH_SCRIPT"
echo "Cleanup complete."
}
# Trap the script's exit (normal or interrupted) to run the cleanup function
trap cleanup EXIT
# Initialize CSV file with headers
# echo "version,connections,jobs,tps,run" > "$CSV_OUTPUT"
# --- Benchmark Function ---
# A generic function to run the benchmark for a given configuration.
# It starts, benchmarks, and then stops the specified server instance.
run_benchmark() {
local name=$1
local port=$2
local bin_path=$3
local data_path=$4
local log_file=$5
echo "--- Starting benchmark for: $name ---"
# Set PGPORT for client tools (pgbench, psql) for this run
export PGPORT=$port
# 1. Start the server using absolute path and explicit data directory
echo "Starting $name server on port $port..."
"$bin_path/pg_ctl" -D "$data_path" -l "$log_file" -o "-p $port" start
sleep 2 # Give server a moment to become available
# Create the pgbench script content
cat > "$PGBENCH_SCRIPT" << EOF
NOTIFY ${CHANNEL_NAME};
EOF
# 2. Start the listener in the background for this server
(echo "LISTEN ${CHANNEL_NAME};"; sleep 100000) | "$bin_path/psql" -d postgres &> /dev/null &
local listener_pid=$!
# 3. Run the benchmark loop
echo "Running pgbench for connection counts: ${CONNECTIONS[*]}"
for c in "${CONNECTIONS[@]}"; do
echo " Testing with $c connections ($MEASUREMENTS measurements per run)..."
# Run multiple measurements for each connection count
for m in $(seq 1 $MEASUREMENTS); do
# Run pgbench and extract TPS value
tps=$("$bin_path/pgbench" -d postgres -f "$PGBENCH_SCRIPT" -c "$c" -j "$c" -T "$DURATION" -n \
| grep -E '^tps' \
| awk '{printf "%.0f", $3}')
# Write to CSV: version,connections,jobs,tps,run
echo "$name,$c,$c,$tps,$m" >> "$CSV_OUTPUT"
done
done
# 4. Stop the listener and the server
kill "$listener_pid" &> /dev/null
echo "Stopping $name server..."
"$bin_path/pg_ctl" -D "$data_path" -m fast stop &> /dev/null
echo "--- Benchmark for $name complete ---"
echo ""
}
# --- Main Execution ---
# 1. Run benchmark for master
# run_benchmark "$MASTER_NAME" "$MASTER_PORT" "$MASTER_BIN_PATH" "$MASTER_DATA" "$MASTER_LOG"
# 2. Run benchmark for patch-v1
# run_benchmark "$PATCH_NAME" "$PATCH_PORT" "$PATCH_BIN_PATH" "$PATCH_DATA" "$PATCH_LOG"
# 3. Run benchmark for patch-v2
run_benchmark "$PATCH_V2_NAME" "$PATCH_V2_PORT" "$PATCH_V2_BIN_PATH" "$PATCH_V2_DATA" "$PATCH_V2_LOG"
# 4. Generate report using PostgreSQL
echo ""
echo "# BENCHMARK"
echo ""
echo "## TPS"
# Start the master server to run the analysis
export PGPORT=$MASTER_PORT
"$MASTER_BIN_PATH/pg_ctl" -D "$MASTER_DATA" -l "$MASTER_LOG" -o "-p $MASTER_PORT" start &> /dev/null
sleep 2
# Create analysis database and load data
"$MASTER_BIN_PATH/psql" -d postgres -q << EOF
-- Create a temporary database for analysis
DROP DATABASE IF EXISTS bench_analysis;
CREATE DATABASE bench_analysis;
\c bench_analysis
-- Create table for benchmark results
CREATE TABLE benchmark_results (
version TEXT,
connections INT,
jobs INT,
tps NUMERIC,
run INT
);
-- Load CSV data
\COPY benchmark_results FROM '$CSV_OUTPUT' CSV HEADER
-- Generate comparison report
WITH avg_results AS (
SELECT
version,
connections,
AVG(tps) AS avg_tps,
STDDEV(tps) AS stddev_tps,
COUNT(*) AS runs
FROM benchmark_results
GROUP BY version, connections
),
comparison AS (
SELECT
m.connections,
m.avg_tps AS master_tps,
p1.avg_tps AS patch_v1_tps,
p2.avg_tps AS patch_v2_tps,
CASE
WHEN m.avg_tps > 0 THEN ((p1.avg_tps - m.avg_tps) / m.avg_tps * 100)
ELSE 0
END AS relative_diff_patch_v1_pct,
CASE
WHEN m.avg_tps > 0 THEN ((p2.avg_tps - m.avg_tps) / m.avg_tps * 100)
ELSE 0
END AS relative_diff_patch_v2_pct
FROM avg_results m
JOIN avg_results p1 ON m.connections = p1.connections
JOIN avg_results p2 ON m.connections = p2.connections
WHERE m.version = 'master' AND p1.version = 'patch-v1' AND p2.version = 'patch-v2'
ORDER BY m.connections
)
SELECT
connections AS "N backends",
ROUND(master_tps) AS "master",
ROUND(patch_v1_tps) AS "patch-v1",
ROUND(patch_v2_tps) AS "patch-v2"
FROM comparison
ORDER BY connections;
EOF
echo ""
echo "## TPS speed-up vs master"
"$MASTER_BIN_PATH/psql" -d bench_analysis -q << EOF
SELECT
connections AS "N backends",
CASE WHEN relative_diff_patch_v1_pct >= 0 THEN '+' ELSE '' END ||
ROUND(relative_diff_patch_v1_pct) || '%' AS "patch-v1",
CASE WHEN relative_diff_patch_v2_pct >= 0 THEN '+' ELSE '' END ||
ROUND(relative_diff_patch_v2_pct) || '%' AS "patch-v2"
FROM (
WITH avg_results AS (
SELECT
version,
connections,
AVG(tps) AS avg_tps
FROM benchmark_results
GROUP BY version, connections
)
SELECT
m.connections,
CASE
WHEN m.avg_tps > 0 THEN ((p1.avg_tps - m.avg_tps) / m.avg_tps * 100)
ELSE 0
END AS relative_diff_patch_v1_pct,
CASE
WHEN m.avg_tps > 0 THEN ((p2.avg_tps - m.avg_tps) / m.avg_tps * 100)
ELSE 0
END AS relative_diff_patch_v2_pct
FROM avg_results m
JOIN avg_results p1 ON m.connections = p1.connections
JOIN avg_results p2 ON m.connections = p2.connections
WHERE m.version = 'master' AND p1.version = 'patch-v1' AND p2.version = 'patch-v2'
) AS comparison
ORDER BY connections;
EOF
# Stop the server
"$MASTER_BIN_PATH/pg_ctl" -D "$MASTER_DATA" -m fast stop &> /dev/null
echo ""
echo "CSV results saved to: $CSV_OUTPUT"
# BENCHMARK
A single backend does `LISTEN mychannel;` and stays idle,
then pgbench is run 3 times for each <N backends>.
script.sql: NOTIFY mychannel;
% pgbench" -f script.sql -c <N backends> -j <N backends> -T 10 -n
## TPS
N backends | master | patch-v1 | patch-v2
------------+--------+----------+----------
1 | 117343 | 151422 | 150735
2 | 158427 | 236705 | 239004
4 | 177454 | 250783 | 250782
8 | 116521 | 155466 | 180418
16 | 45627 | 144740 | 163491
32 | 37281 | 135602 | 146659
64 | 36608 | 123870 | 131202
128 | 34798 | 120302 | 119041
(8 rows)
## TPS speed-up vs master
N backends | patch-v1 | patch-v2
------------+----------+----------
1 | +29% | +28%
2 | +49% | +51%
4 | +41% | +41%
8 | +33% | +55%
16 | +217% | +258%
32 | +264% | +293%
64 | +238% | +258%
128 | +246% | +242%
(8 rows)
Attachments:
[text/plain] pgbench-script.txt (7.0K, 2-pgbench-script.txt)
download | inline:
#!/bin/bash
# Configuration for the PostgreSQL instances using absolute paths.
# This script does NOT modify the shell's PATH variable.
# --- Master Config ---
MASTER_NAME="master"
MASTER_PORT=5432
MASTER_BIN_PATH="$HOME/pg-master/bin"
MASTER_DATA="$HOME/pg-master-data"
MASTER_LOG="/tmp/pg-master.log"
# --- Patch v1 Config ---
PATCH_NAME="patch-v1"
PATCH_PORT=5432
PATCH_BIN_PATH="$HOME/pg-patch-v1/bin"
PATCH_DATA="$HOME/pg-patch-v1-data"
PATCH_LOG="/tmp/pg-patch-v1.log"
# --- Patch v2 Config ---
PATCH_V2_NAME="patch-v2"
PATCH_V2_PORT=5432
PATCH_V2_BIN_PATH="$HOME/pg-patch-v2/bin"
PATCH_V2_DATA="$HOME/pg-patch-v2-data"
PATCH_V2_LOG="/tmp/pg-patch-v2.log"
# Benchmark settings
CHANNEL_NAME="mychannel"
CONNECTIONS=(64 128)
DURATION=10 # Benchmark duration in seconds for each run
MEASUREMENTS=3 # Number of measurements per configuration
# CSV output file
CSV_OUTPUT="benchmark_results.csv"
# Temporary files
PGBENCH_SCRIPT=$(mktemp)
# --- Cleanup Function ---
# Ensures that servers are stopped and temp files are removed on script exit.
cleanup() {
echo ""
echo "Cleaning up..."
# Ensure all servers are stopped, silencing errors if they are not running.
# Use absolute paths and explicit data directories.
"$MASTER_BIN_PATH/pg_ctl" -D "$MASTER_DATA" -m fast stop &> /dev/null
"$PATCH_BIN_PATH/pg_ctl" -D "$PATCH_DATA" -m fast stop &> /dev/null
"$PATCH_V2_BIN_PATH/pg_ctl" -D "$PATCH_V2_DATA" -m fast stop &> /dev/null
rm -f "$PGBENCH_SCRIPT"
echo "Cleanup complete."
}
# Trap the script's exit (normal or interrupted) to run the cleanup function
trap cleanup EXIT
# Initialize CSV file with headers
# echo "version,connections,jobs,tps,run" > "$CSV_OUTPUT"
# --- Benchmark Function ---
# A generic function to run the benchmark for a given configuration.
# It starts, benchmarks, and then stops the specified server instance.
run_benchmark() {
local name=$1
local port=$2
local bin_path=$3
local data_path=$4
local log_file=$5
echo "--- Starting benchmark for: $name ---"
# Set PGPORT for client tools (pgbench, psql) for this run
export PGPORT=$port
# 1. Start the server using absolute path and explicit data directory
echo "Starting $name server on port $port..."
"$bin_path/pg_ctl" -D "$data_path" -l "$log_file" -o "-p $port" start
sleep 2 # Give server a moment to become available
# Create the pgbench script content
cat > "$PGBENCH_SCRIPT" << EOF
NOTIFY ${CHANNEL_NAME};
EOF
# 2. Start the listener in the background for this server
(echo "LISTEN ${CHANNEL_NAME};"; sleep 100000) | "$bin_path/psql" -d postgres &> /dev/null &
local listener_pid=$!
# 3. Run the benchmark loop
echo "Running pgbench for connection counts: ${CONNECTIONS[*]}"
for c in "${CONNECTIONS[@]}"; do
echo " Testing with $c connections ($MEASUREMENTS measurements per run)..."
# Run multiple measurements for each connection count
for m in $(seq 1 $MEASUREMENTS); do
# Run pgbench and extract TPS value
tps=$("$bin_path/pgbench" -d postgres -f "$PGBENCH_SCRIPT" -c "$c" -j "$c" -T "$DURATION" -n \
| grep -E '^tps' \
| awk '{printf "%.0f", $3}')
# Write to CSV: version,connections,jobs,tps,run
echo "$name,$c,$c,$tps,$m" >> "$CSV_OUTPUT"
done
done
# 4. Stop the listener and the server
kill "$listener_pid" &> /dev/null
echo "Stopping $name server..."
"$bin_path/pg_ctl" -D "$data_path" -m fast stop &> /dev/null
echo "--- Benchmark for $name complete ---"
echo ""
}
# --- Main Execution ---
# 1. Run benchmark for master
# run_benchmark "$MASTER_NAME" "$MASTER_PORT" "$MASTER_BIN_PATH" "$MASTER_DATA" "$MASTER_LOG"
# 2. Run benchmark for patch-v1
# run_benchmark "$PATCH_NAME" "$PATCH_PORT" "$PATCH_BIN_PATH" "$PATCH_DATA" "$PATCH_LOG"
# 3. Run benchmark for patch-v2
run_benchmark "$PATCH_V2_NAME" "$PATCH_V2_PORT" "$PATCH_V2_BIN_PATH" "$PATCH_V2_DATA" "$PATCH_V2_LOG"
# 4. Generate report using PostgreSQL
echo ""
echo "# BENCHMARK"
echo ""
echo "## TPS"
# Start the master server to run the analysis
export PGPORT=$MASTER_PORT
"$MASTER_BIN_PATH/pg_ctl" -D "$MASTER_DATA" -l "$MASTER_LOG" -o "-p $MASTER_PORT" start &> /dev/null
sleep 2
# Create analysis database and load data
"$MASTER_BIN_PATH/psql" -d postgres -q << EOF
-- Create a temporary database for analysis
DROP DATABASE IF EXISTS bench_analysis;
CREATE DATABASE bench_analysis;
\c bench_analysis
-- Create table for benchmark results
CREATE TABLE benchmark_results (
version TEXT,
connections INT,
jobs INT,
tps NUMERIC,
run INT
);
-- Load CSV data
\COPY benchmark_results FROM '$CSV_OUTPUT' CSV HEADER
-- Generate comparison report
WITH avg_results AS (
SELECT
version,
connections,
AVG(tps) AS avg_tps,
STDDEV(tps) AS stddev_tps,
COUNT(*) AS runs
FROM benchmark_results
GROUP BY version, connections
),
comparison AS (
SELECT
m.connections,
m.avg_tps AS master_tps,
p1.avg_tps AS patch_v1_tps,
p2.avg_tps AS patch_v2_tps,
CASE
WHEN m.avg_tps > 0 THEN ((p1.avg_tps - m.avg_tps) / m.avg_tps * 100)
ELSE 0
END AS relative_diff_patch_v1_pct,
CASE
WHEN m.avg_tps > 0 THEN ((p2.avg_tps - m.avg_tps) / m.avg_tps * 100)
ELSE 0
END AS relative_diff_patch_v2_pct
FROM avg_results m
JOIN avg_results p1 ON m.connections = p1.connections
JOIN avg_results p2 ON m.connections = p2.connections
WHERE m.version = 'master' AND p1.version = 'patch-v1' AND p2.version = 'patch-v2'
ORDER BY m.connections
)
SELECT
connections AS "N backends",
ROUND(master_tps) AS "master",
ROUND(patch_v1_tps) AS "patch-v1",
ROUND(patch_v2_tps) AS "patch-v2"
FROM comparison
ORDER BY connections;
EOF
echo ""
echo "## TPS speed-up vs master"
"$MASTER_BIN_PATH/psql" -d bench_analysis -q << EOF
SELECT
connections AS "N backends",
CASE WHEN relative_diff_patch_v1_pct >= 0 THEN '+' ELSE '' END ||
ROUND(relative_diff_patch_v1_pct) || '%' AS "patch-v1",
CASE WHEN relative_diff_patch_v2_pct >= 0 THEN '+' ELSE '' END ||
ROUND(relative_diff_patch_v2_pct) || '%' AS "patch-v2"
FROM (
WITH avg_results AS (
SELECT
version,
connections,
AVG(tps) AS avg_tps
FROM benchmark_results
GROUP BY version, connections
)
SELECT
m.connections,
CASE
WHEN m.avg_tps > 0 THEN ((p1.avg_tps - m.avg_tps) / m.avg_tps * 100)
ELSE 0
END AS relative_diff_patch_v1_pct,
CASE
WHEN m.avg_tps > 0 THEN ((p2.avg_tps - m.avg_tps) / m.avg_tps * 100)
ELSE 0
END AS relative_diff_patch_v2_pct
FROM avg_results m
JOIN avg_results p1 ON m.connections = p1.connections
JOIN avg_results p2 ON m.connections = p2.connections
WHERE m.version = 'master' AND p1.version = 'patch-v1' AND p2.version = 'patch-v2'
) AS comparison
ORDER BY connections;
EOF
# Stop the server
"$MASTER_BIN_PATH/pg_ctl" -D "$MASTER_DATA" -m fast stop &> /dev/null
echo ""
echo "CSV results saved to: $CSV_OUTPUT"
[text/plain] pgbench-results.txt (1020B, 3-pgbench-results.txt)
download | inline:
# BENCHMARK
A single backend does `LISTEN mychannel;` and stays idle,
then pgbench is run 3 times for each <N backends>.
script.sql: NOTIFY mychannel;
% pgbench" -f script.sql -c <N backends> -j <N backends> -T 10 -n
## TPS
N backends | master | patch-v1 | patch-v2
------------+--------+----------+----------
1 | 117343 | 151422 | 150735
2 | 158427 | 236705 | 239004
4 | 177454 | 250783 | 250782
8 | 116521 | 155466 | 180418
16 | 45627 | 144740 | 163491
32 | 37281 | 135602 | 146659
64 | 36608 | 123870 | 131202
128 | 34798 | 120302 | 119041
(8 rows)
## TPS speed-up vs master
N backends | patch-v1 | patch-v2
------------+----------+----------
1 | +29% | +28%
2 | +49% | +51%
4 | +41% | +41%
8 | +33% | +55%
16 | +217% | +258%
32 | +264% | +293%
64 | +238% | +258%
128 | +246% | +242%
(8 rows)
[application/octet-stream] 0001-Optimize-LISTEN-NOTIFY-signaling-with-a-lock-free-at.patch (14.1K, 4-0001-Optimize-LISTEN-NOTIFY-signaling-with-a-lock-free-at.patch)
download | inline diff:
From 17777283bda5fa41b430e4f71a7246d3f04a94bf Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Tue, 22 Jul 2025 10:32:34 +0200
Subject: [PATCH 1/2] Optimize LISTEN/NOTIFY signaling with a lock-free atomic
state machine
This commit introduces a powerful pattern for modernizing inter-process
communication by refactoring the LISTEN/NOTIFY subsystem to use a
lock-free, atomic finite state machine (FSM). This directly addresses
the historical lack of safe, efficient state synchronization primitives.
Previously, if multiple transactions sent notifications concurrently,
each would unconditionally attempt to signal all listening backends.
This resulted in a storm of superfluous signals to listeners that were
already pending a wakeup, causing unnecessary system call overhead.
By introducing an atomic per-backend state (IDLE, SIGNALLED, PROCESSING)
in shared memory and manipulated via compare-and-swap (CAS), this
inefficiency is eliminated. A notifier can now atomically transition a
listener's state from IDLE to SIGNALLED, ensuring that only the first
notifier for a given idle listener dispatches a wakeup. The FSM also
robustly handles race conditions where new notifications arrive while a
listener is PROCESSING, guaranteeing no work is ever missed.
This FSM pattern is a generalizable solution for managing concurrency in
PostgreSQL. By modeling inter-process interactions as explicit state
transitions, we can build more robust and performant subsystems. This
commit demonstrates the pattern's effectiveness within async.c, and by
cleanly solving the state management problem first, it enables a
subsequent, trivial optimization of the wakeup mechanism itself.
---
src/backend/commands/async.c | 209 ++++++++++++++++++++++++++++++-----
src/backend/tcop/postgres.c | 4 +-
src/include/commands/async.h | 4 +-
3 files changed, 185 insertions(+), 32 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..ae20017af9b 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -150,8 +150,19 @@
#include "utils/ps_status.h"
#include "utils/snapmgr.h"
#include "utils/timestamp.h"
+#include "port/atomics.h"
+/*
+ * Async notification state machine states
+ */
+typedef enum AsyncListenerState
+{
+ ASYNC_STATE_IDLE = 0, /* Backend is idle, waiting for signal */
+ ASYNC_STATE_SIGNALLED = 1, /* Backend has been signaled, will process soon */
+ ASYNC_STATE_PROCESSING = 2 /* Backend is actively processing notifications */
+} AsyncListenerState;
+
/*
* Maximum size of a NOTIFY payload, including terminating NULL. This
* must be kept small enough so that a notification message fits on one
@@ -246,6 +257,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ pg_atomic_uint32 state; /* async state machine state */
} QueueBackendStatus;
/*
@@ -301,6 +313,7 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_STATE(i) (asyncQueueControl->backend[i].state)
/*
* The SLRU buffer area through which we access the notification queue
@@ -405,12 +418,10 @@ static NotificationList *pendingNotifies = NULL;
/*
* Inbound notifications are initially processed by HandleNotifyInterrupt(),
- * called from inside a signal handler. That just sets the
- * notifyInterruptPending flag and sets the process
+ * called from inside a signal handler. That just sets the process
* latch. ProcessNotifyInterrupt() will then be called whenever it's safe to
* actually deal with the interrupt.
*/
-volatile sig_atomic_t notifyInterruptPending = false;
/* True if we've registered an on_shmem_exit cleanup */
static bool unlistenExitRegistered = false;
@@ -527,6 +538,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ pg_atomic_init_u32(&QUEUE_BACKEND_STATE(i), ASYNC_STATE_IDLE);
}
}
@@ -1099,6 +1111,8 @@ Exec_ListenPreCommit(void)
QUEUE_BACKEND_POS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
+ /* Initialize the atomic state to IDLE */
+ pg_atomic_write_u32(&QUEUE_BACKEND_STATE(MyProcNumber), ASYNC_STATE_IDLE);
/* Insert backend into list of listeners at correct position */
if (prevListener != INVALID_PROC_NUMBER)
{
@@ -1242,6 +1256,8 @@ asyncQueueUnregister(void)
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
+ /* Reset state to IDLE to prevent zombie listeners */
+ pg_atomic_write_u32(&QUEUE_BACKEND_STATE(MyProcNumber), ASYNC_STATE_IDLE);
/* and remove it from the list */
if (QUEUE_FIRST_LISTENER == MyProcNumber)
QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);
@@ -1634,25 +1650,84 @@ SignalBackends(void)
for (int i = 0; i < count; i++)
{
int32 pid = pids[i];
+ ProcNumber procno = procnos[i];
+ uint32 expected;
+ bool signal_needed = false;
/*
- * If we are signaling our own process, no need to involve the kernel;
- * just set the flag directly.
+ * Implement state machine transitions for the notifier.
+ * We use a loop to handle race conditions where the state
+ * changes between our read and the CAS operation.
*/
- if (pid == MyProcPid)
+ uint32 current_state = pg_atomic_read_membarrier_u32(&QUEUE_BACKEND_STATE(procno));
+
+ switch (current_state)
{
- notifyInterruptPending = true;
- continue;
+ case ASYNC_STATE_IDLE:
+ /* Try to transition from IDLE to SIGNALLED */
+ expected = ASYNC_STATE_IDLE;
+ if (pg_atomic_compare_exchange_u32(&QUEUE_BACKEND_STATE(procno),
+ &expected,
+ ASYNC_STATE_SIGNALLED))
+ {
+ /* Success - need to send signal */
+ signal_needed = true;
+ if (Trace_notify)
+ elog(DEBUG1, "SignalBackends: transitioned backend %d from IDLE to SIGNALLED", pid);
+ }
+ /* Another notifier already signaled - we're done */
+ break;
+
+ case ASYNC_STATE_SIGNALLED:
+ /* Backend is already signaled - nothing to do */
+ if (Trace_notify)
+ elog(DEBUG1, "SignalBackends: backend %d already in SIGNALLED state, skipping", pid);
+ break;
+
+ case ASYNC_STATE_PROCESSING:
+ /* Try to transition from PROCESSING to SIGNALLED */
+ expected = ASYNC_STATE_PROCESSING;
+ if (pg_atomic_compare_exchange_u32(&QUEUE_BACKEND_STATE(procno),
+ &expected,
+ ASYNC_STATE_SIGNALLED))
+ {
+ /* Success - need to send signal for re-scan */
+ signal_needed = true;
+ if (Trace_notify)
+ elog(DEBUG1, "SignalBackends: transitioned backend %d from PROCESSING to SIGNALLED for re-scan", pid);
+ break;
+ }
+ /* Another notifier already signaled - we're done */
+ break;
+
+ default:
+ /* Should never happen */
+ elog(ERROR, "unexpected async state %u for backend %d",
+ current_state, pid);
}
- /*
- * Note: assuming things aren't broken, a signal failure here could
- * only occur if the target backend exited since we released
- * NotifyQueueLock; which is unlikely but certainly possible. So we
- * just log a low-level debug message if it happens.
- */
- if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]) < 0)
- elog(DEBUG3, "could not signal backend with PID %d: %m", pid);
+ /* Send signal if needed */
+ if (signal_needed)
+ {
+ /*
+ * For our own process, no need to involve the kernel
+ */
+ if (pid == MyProcPid)
+ {
+ SetLatch(MyLatch);
+ }
+ else
+ {
+ /*
+ * Note: assuming things aren't broken, a signal failure here could
+ * only occur if the target backend exited since we released
+ * NotifyQueueLock; which is unlikely but certainly possible. So we
+ * just log a low-level debug message if it happens.
+ */
+ if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procno) < 0)
+ elog(DEBUG3, "could not signal backend with PID %d: %m", pid);
+ }
+ }
}
pfree(pids);
@@ -1805,20 +1880,43 @@ HandleNotifyInterrupt(void)
{
/*
* Note: this is called by a SIGNAL HANDLER. You must be very wary what
- * you do here.
+ * you do here. The actual state transition has already been done by
+ * the notifier before sending the signal, so we only need to set the
+ * latch to ensure the backend wakes up and processes the notification.
*/
- /* signal that work needs to be done */
- notifyInterruptPending = true;
-
/* make sure the event is processed in due course */
SetLatch(MyLatch);
}
+/*
+ * IsNotifyInterruptPending
+ *
+ * Check if there's a pending notify interrupt for this backend
+ */
+bool
+IsNotifyInterruptPending(void)
+{
+ uint32 state;
+
+ /* If not registered as a listener, no notifications are pending */
+ if (!amRegisteredListener)
+ return false;
+
+ /*
+ * Read the current state with a memory barrier to ensure we see
+ * the most recent value written by notifiers.
+ */
+ state = pg_atomic_read_membarrier_u32(&QUEUE_BACKEND_STATE(MyProcNumber));
+
+ /* Notification is pending if state is SIGNALLED */
+ return (state == ASYNC_STATE_SIGNALLED);
+}
+
/*
* ProcessNotifyInterrupt
*
- * This is called if we see notifyInterruptPending set, just before
+ * This is called if we see a notification interrupt is pending, just before
* transmitting ReadyForQuery at the end of a frontend command, and
* also if a notify signal occurs while reading from the frontend.
* HandleNotifyInterrupt() will cause the read to be interrupted
@@ -1837,7 +1935,7 @@ ProcessNotifyInterrupt(bool flush)
return; /* not really idle */
/* Loop in case another signal arrives while sending messages */
- while (notifyInterruptPending)
+ while (IsNotifyInterruptPending())
ProcessIncomingNotify(flush);
}
@@ -2182,28 +2280,81 @@ asyncQueueAdvanceTail(void)
static void
ProcessIncomingNotify(bool flush)
{
- /* We *must* reset the flag */
- notifyInterruptPending = false;
+ uint32 expected;
- /* Do nothing else if we aren't actively listening */
+ /* Do nothing if we aren't actively listening */
if (listenChannels == NIL)
return;
+ /*
+ * Perform state transition from SIGNALLED to PROCESSING.
+ * This is the "acquire lock" operation for the listener.
+ */
+ expected = ASYNC_STATE_SIGNALLED;
+ if (!pg_atomic_compare_exchange_u32(&QUEUE_BACKEND_STATE(MyProcNumber),
+ &expected,
+ ASYNC_STATE_PROCESSING))
+ {
+ /*
+ * CAS failed - the state was not SIGNALLED. This should not happen
+ * as ProcessNotifyInterrupt only calls us when state is SIGNALLED.
+ */
+ elog(ERROR, "unexpected async state %u in ProcessIncomingNotify, expected SIGNALLED",
+ expected);
+ }
+
if (Trace_notify)
- elog(DEBUG1, "ProcessIncomingNotify");
+ elog(DEBUG1, "ProcessIncomingNotify: transitioned to PROCESSING");
set_ps_display("notify interrupt");
/*
- * We must run asyncQueueReadAllNotifications inside a transaction, else
- * bad things happen if it gets an error.
- */
+ * We must run asyncQueueReadAllNotifications inside a transaction, else
+ * bad things happen if it gets an error.
+ */
StartTransactionCommand();
asyncQueueReadAllNotifications();
CommitTransactionCommand();
+ /*
+ * Try to transition from PROCESSING back to IDLE.
+ * This is the "release lock" operation for the listener.
+ */
+ expected = ASYNC_STATE_PROCESSING;
+ if (pg_atomic_compare_exchange_u32(&QUEUE_BACKEND_STATE(MyProcNumber),
+ &expected,
+ ASYNC_STATE_IDLE))
+ {
+ /* Success - we're done, transitioned to IDLE */
+ if (Trace_notify)
+ elog(DEBUG1, "ProcessIncomingNotify: transitioned to IDLE");
+ }
+ else
+ {
+ /* CAS failed - check what the new state is */
+ if (expected == ASYNC_STATE_SIGNALLED)
+ {
+ /*
+ * A notifier set our state to SIGNALLED while we were processing.
+ * We are done with this batch of work, but we know there is more
+ * to do. Rather than loop here and risk starving other backend
+ * activity, we set our own latch to ensure we are woken up again
+ * to re-process, and then exit. The state is left as SIGNALLED.
+ */
+ if (Trace_notify)
+ elog(DEBUG1, "ProcessIncomingNotify: signalled while processing");
+ SetLatch(MyLatch);
+ }
+ else
+ {
+ /* Any other state is an error */
+ elog(ERROR, "unexpected async state %u when trying to return to IDLE",
+ expected);
+ }
+ }
+
/*
* If this isn't an end-of-command case, we must flush the notify messages
* to ensure frontend gets them promptly.
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a297606cdd7..e1d80cbefea 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -512,7 +512,7 @@ ProcessClientReadInterrupt(bool blocked)
ProcessCatchupInterrupt();
/* Process notify interrupts, if any */
- if (notifyInterruptPending)
+ if (IsNotifyInterruptPending())
ProcessNotifyInterrupt(true);
}
else if (ProcDiePending)
@@ -4604,7 +4604,7 @@ PostgresMain(const char *dbname, const char *username)
* were received during the just-finished transaction, they'll
* be seen by the client before ReadyForQuery is.
*/
- if (notifyInterruptPending)
+ if (IsNotifyInterruptPending())
ProcessNotifyInterrupt(false);
/*
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index f75c3df9556..7f2e0ac0b9f 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -17,7 +17,6 @@
extern PGDLLIMPORT bool Trace_notify;
extern PGDLLIMPORT int max_notify_queue_pages;
-extern PGDLLIMPORT volatile sig_atomic_t notifyInterruptPending;
extern Size AsyncShmemSize(void);
extern void AsyncShmemInit(void);
@@ -46,4 +45,7 @@ extern void HandleNotifyInterrupt(void);
/* process interrupts */
extern void ProcessNotifyInterrupt(bool flush);
+/* check if notification interrupt is pending */
+extern bool IsNotifyInterruptPending(void);
+
#endif /* ASYNC_H */
--
2.47.1
[application/octet-stream] 0002-Optimize-LISTEN-NOTIFY-wakeup-by-replacing-signal-wi.patch (3.3K, 5-0002-Optimize-LISTEN-NOTIFY-wakeup-by-replacing-signal-wi.patch)
download | inline diff:
From 31e419747ab92dbc29d0d9db58d88ff2d2caf5c9 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Thu, 24 Jul 2025 21:17:19 +0200
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY wakeup by replacing signal with
direct SetLatch
Building upon the robust atomic state machine introduced in the previous
commit, this change completes the modernization of NOTIFY IPC by
replacing its wakeup mechanism. With inter-process state now managed
reliably, the heavyweight SIGUSR1 signal is no longer necessary and is
replaced with a much more efficient, direct "poke."
The async.c notifier now replaces its call to SendProcSignal with a
direct call to SetLatch on the target backend's procLatch. This is a
significant optimization because WaitLatch, which listeners already use
for blocking, is underpinned by the modern WaitEventSet abstraction
(kqueue, epoll, etc.). We now leverage this existing, highly efficient
infrastructure for the wakeup, completely bypassing the kill() syscall
and the SIGUSR1 signal handler for all NOTIFY events.
This demonstrates a powerful, two-step migration pattern:
1. First, solve a subsystem's state synchronization problem with a
lock-free, atomic FSM to eliminate redundant signaling.
2. Then, with state management handled, make the wakeup itself cheaper
by replacing the expensive signal with a direct SetLatch.
This staged approach allows us to modernize subsystems incrementally and
safely. By applying this pattern to async.c, we prove its viability and
simplicity, creating a clear template for other parts of the system to
follow in moving towards a more performant, signal-free IPC model.
---
src/backend/commands/async.c | 25 +++++++++++++++++++------
1 file changed, 19 insertions(+), 6 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index ae20017af9b..c871774b72c 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -142,6 +142,7 @@
#include "miscadmin.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
+#include "storage/proc.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
@@ -1719,13 +1720,25 @@ SignalBackends(void)
else
{
/*
- * Note: assuming things aren't broken, a signal failure here could
- * only occur if the target backend exited since we released
- * NotifyQueueLock; which is unlikely but certainly possible. So we
- * just log a low-level debug message if it happens.
+ * Get the target backend's PGPROC and set its latch.
+ *
+ * Note: The target backend might exit after we released
+ * NotifyQueueLock but before we set the latch. We need to
+ * handle the race condition where the PGPROC slot might be
+ * recycled by a new process with a different PID.
*/
- if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procno) < 0)
- elog(DEBUG3, "could not signal backend with PID %d: %m", pid);
+ PGPROC *proc = GetPGProcByNumber(procno);
+
+ /* Verify the PID hasn't changed (backend hasn't exited) */
+ if (proc->pid == pid)
+ {
+ SetLatch(&proc->procLatch);
+ }
+ else
+ {
+ /* Backend exited and slot was recycled */
+ elog(DEBUG3, "could not signal backend with PID %d: process no longer exists", pid);
+ }
}
}
}
--
2.47.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-08-07 00:16 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-08-07 00:16 UTC (permalink / raw)
To: Thomas Munro <[email protected]>; +Cc: pgsql-hackers; Heikki Linnakangas <[email protected]>; Rishu Bagga <[email protected]>
On Thu, Jul 24, 2025, at 23:03, Joel Jacobson wrote:
> * 0001-Optimize-LISTEN-NOTIFY-signaling-with-a-lock-free-at.patch
> * 0002-Optimize-LISTEN-NOTIFY-wakeup-by-replacing-signal-wi.patch
I'm withdrawing the latest patches, since they won't fix the scalability
problems, but only provide some performance improvements by eliminating
redundant IPC signalling. This could also be improved outside of
async.c, by optimizing ProcSignal [1] or removing ProcSignal as
"Interrupts vs Signals" [2] is working on.
There seems to be two different scalability problems, that appears to be
orthogonal:
First, it's the thundering herd problems that I tried to solve initially
in this thread, by introducing a hash table in shared memory, to keep
track of what backends listen to what channels, to avoid immediate
wakeup of all listening backends for every notification.
Second, it's the heavyweight lock in PreCommit_Notify(), that prevents
parallelism of NOTIFY. Tom Lane has an idea [3] on how to improve this.
My perf+pgbench experiments indicate that out of these two different
scalability problems, if one or the other is the bottleneck depends on
the workload.
I think the idea of keeping track of channels per backends has merit,
but I want to take a step back and see what others think about the idea first.
I guess my main question is if we think we should fix one problem first,
then the other, both at the same time, or only one or the other?
I've attached some benchmarks using pgbench and running postgres under
perf, which I hope can provide some insights.
/Joel
[1] https://www.postgresql.org/message-id/flat/a0b12a70-8200-4bd4-9e24-56796314bdce%40app.fastmail.com
[2] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2B3MkS21yK4jL4cgZywdnnGKiBg0jatoV6kzaniBmcqbQ%4...
[3] https://www.postgresql.org/message-id/1878165.1752858390%40sss.pgh.pa.us
Attachments:
[text/markdown] listen_notify_pgbench_perf.md (18.4K, 2-listen_notify_pgbench_perf.md)
download | inline diff:
# LISTEN/NOTIFY scalability benchmark
## Table of Contents
- [Overview](#overview)
- [master (b5c53b4)](#master-b5c53b4)
- [1 x NOTIFY channel_1 to 1 x LISTEN channel_1](#1-x-notify-channel_1-to-1-x-listen-channel_1)
- [100 x NOTIFY channel_1 to 1 x LISTEN channel_1](#100-x-notify-channel_1-to-1-x-listen-channel_1)
- [1 x NOTIFY channel_1 to 100 x LISTEN channel_1](#1-x-notify-channel_1-to-100-x-listen-channel_1)
- [100 x NOTIFY channel_1 to 100 x LISTEN channel_1](#100-x-notify-channel_1-to-100-x-listen-channel_1)
- [100 x NOTIFY channel_:client_id to 100 x LISTEN channel_:client_id](#100-x-notify-channel_client_id-to-100-x-listen-channel_client_id)
- [master (b5c53b4) without heavyweight lock](#master-b5c53b4-without-heavyweight-lock)
- [1 x NOTIFY channel_1 to 1 x LISTEN channel_1](#1-x-notify-channel_1-to-1-x-listen-channel_1-1)
- [100 x NOTIFY channel_1 to 1 x LISTEN channel_1](#100-x-notify-channel_1-to-1-x-listen-channel_1-1)
- [1 x NOTIFY channel_1 to 100 x LISTEN channel_1](#1-x-notify-channel_1-to-100-x-listen-channel_1-1)
- [100 x NOTIFY channel_1 to 100 x LISTEN channel_1](#100-x-notify-channel_1-to-100-x-listen-channel_1-1)
- [100 x NOTIFY channel_:client_id to 100 x LISTEN channel_:client_id](#100-x-notify-channel_client_id-to-100-x-listen-channel_client_id-1)
- [Scripts](#scripts)
## Overview
The goal of this benchmark is to get a better understanding of
how {a single, a hundred} concurrent NOTIFY backends in combination
with {a single, a hundred} concurrent LISTEN backends, affect the
pgbench tps, and using perf to understand what the bottleneck for
each workload scenario is.
The benchmark has been run on Ubuntu 24.04.2 LTS running in a UTM
virtual machine on a Apple M3 Max 128GB RAM.
In the perf results, a drill-down from PostgresMain is shown,
where the largest branch is expanded, down to the syscall,
to get an idea of what dominates.
## master (b5c53b4)
### 1 x NOTIFY channel_1 to 1 x LISTEN channel_1
```
$ ./listen_script 1
$ pgbench -f ~/notify_channel_1.sql -c 1 -j 1 -T 60 -n bench
tps = 11544.902331 (without initial connection time)
- 98.90% 0.25% postgres postgres [.] PostgresMain
- 98.64% PostgresMain
- 39.59% exec_simple_query
- 32.36% CommitTransactionCommand
- 32.20% CommitTransaction
- 23.15% AtCommit_Notify
+ 22.05% kill
+ 0.83% AllocSetAllocFromNewBlock
+ 3.74% PreCommit_Notify
+ 1.30% ResourceOwnerReleaseInternal
+ 0.67% XactLogCommitRecord
+ 2.83% pg_parse_query
+ 1.66% start_xact_command
0.53% PortalRun
+ 32.54% pq_getbyte
+ 24.39% socket_flush
0.54% SetCurrentStatementStartTimestamp
```
### 100 x NOTIFY channel_1 to 1 x LISTEN channel_1
```
$ ./listen_script 1
$ pgbench -f ~/notify_channel_1.sql -c 100 -j 100 -T 60 -n bench
tps = 7494.324353 (without initial connection time)
- 99.23% 0.21% postgres postgres [.] PostgresMain
- 99.02% PostgresMain
- 61.83% exec_simple_query
- 57.73% CommitTransactionCommand
- 57.61% CommitTransaction
- 27.40% ResourceOwnerReleaseInternal
- 27.17% ProcReleaseLocks
- 27.09% LockReleaseAll
- 21.82% ProcLockWakeup
+ 19.28% kill
0.89% LockCheckConflicts
+ 3.75% LWLockRelease
+ 14.03% AtCommit_Notify
+ 11.77% PreCommit_Notify
+ 2.31% XactLogCommitRecord
+ 1.76% pg_parse_query
+ 0.70% start_xact_command
+ 22.45% socket_flush
+ 13.29% pq_getbyte
```
### 1 x NOTIFY channel_1 to 100 x LISTEN channel_1
```
$ for n in `seq 1 100` ; do ./listen_script 1 ; done
$ pgbench -f ~/notify_channel_1.sql -c 1 -j 1 -T 60 -n bench
tps = 798.089837 (without initial connection time)
- 99.75% 0.02% postgres postgres [.] PostgresMain
- 99.73% PostgresMain
- 62.41% pq_getbyte
- pq_recvbuf
- 62.40% secure_read
- 42.07% ProcessClientReadInterrupt
- 41.90% ProcessNotifyInterrupt
- 34.08% socket_flush
- 34.03% internal_flush_buffer
- 34.02% secure_write
- 27.75% WaitEventSetWait
+ 18.20% epoll_pwait
+ 7.59% WaitEventSetWait
+ 1.79% drain
+ 5.81% __send
+ 3.02% CommitTransactionCommand
+ 2.90% asyncQueueReadAllNotifications
+ 1.44% StartTransactionCommand
+ 19.31% WaitEventSetWait
+ 0.82% recv
+ 36.48% exec_simple_query
+ 0.62% socket_flush
```
### 100 x NOTIFY channel_1 to 100 x LISTEN channel_1
```
$ for n in `seq 1 100` ; do ./listen_script 1 ; done
$ pgbench -f ~/notify_channel_1.sql -c 100 -j 100 -T 60 -n bench
tps = 1314.302478 (without initial connection time)
- 99.78% 0.02% postgres postgres [.] PostgresMain
- 99.76% PostgresMain
- 50.35% pq_getbyte
- 50.34% pq_recvbuf
- 50.34% secure_read
- 29.69% ProcessClientReadInterrupt
- 29.54% ProcessNotifyInterrupt
- 22.70% socket_flush
- 22.63% internal_flush_buffer
- 22.62% secure_write
- 17.92% WaitEventSetWait
+ 12.84% epoll_pwait
+ 3.88% WaitEventSetWait
+ 1.11% drain
+ 4.35% __send
+ 2.63% CommitTransactionCommand
+ 2.58% asyncQueueReadAllNotifications
+ 1.25% StartTransactionCommand
+ 19.25% WaitEventSetWait
+ 1.23% recv
+ 48.19% exec_simple_query
+ 0.94% socket_flush
```
### 100 x NOTIFY channel_:client_id to 100 x LISTEN channel_:client_id
```
$ for n in `seq 1 100` ; do ./listen_script $n ; done
$ pgbench -f ~/notify_channel_client_id.sql -c 100 -j 100 -T 60 -n bench
tps = 1419.322468 (without initial connection time)
- 99.81% 0.02% postgres postgres [.] PostgresMain
- 99.79% PostgresMain
- 50.42% pq_getbyte
- 50.41% pq_recvbuf
- 50.41% secure_read
- 35.41% WaitEventSetWait
+ 26.01% epoll_pwait
+ 7.12% WaitEventSetWait
+ 2.06% drain
+ 10.18% ProcessClientReadInterrupt
+ 4.39% recv
+ 48.15% exec_simple_query
+ 0.97% socket_flush
```
## master (b5c53b4) without heavyweight lock
This is just to give an idea of how the heavyweight lock affects the
scalability.
```diff
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb..47dfe42c9c 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -919,8 +919,6 @@ PreCommit_Notify(void)
* (Historical note: before PG 9.0, a similar lock on "database 0" was
* used by the flatfiles mechanism.)
*/
- LockSharedObject(DatabaseRelationId, InvalidOid, 0,
- AccessExclusiveLock);
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
```
### 1 x NOTIFY channel_1 to 1 x LISTEN channel_1
```
$ ./listen_script 1
$ pgbench -f ~/notify_channel_1.sql -c 1 -j 1 -T 60 -n bench
tps = 11645.734899 (without initial connection time)
- 99.28% 0.24% postgres postgres [.] PostgresMain
- 99.04% PostgresMain
- 39.80% exec_simple_query
- 32.54% CommitTransactionCommand
- 32.38% CommitTransaction
- 23.53% AtCommit_Notify
+ 22.41% kill
+ 0.86% AllocSetAllocFromNewBlock
+ 3.57% PreCommit_Notify
+ 1.09% ResourceOwnerReleaseInternal
+ 0.73% XactLogCommitRecord
+ 2.66% pg_parse_query
+ 1.58% start_xact_command
0.59% PortalRun
0.57% CreatePortal
+ 32.42% pq_getbyte
+ 24.52% socket_flush
0.55% SetCurrentStatementStartTimestamp
```
### 100 x NOTIFY channel_1 to 1 x LISTEN channel_1
```
$ ./listen_script 1
$ pgbench -f ~/notify_channel_1.sql -c 100 -j 100 -T 60 -n bench
tps = 121615.034209 (without initial connection time)
- 99.81% 0.20% postgres postgres [.] PostgresMain
- 99.61% PostgresMain
- 67.21% exec_simple_query
- 57.33% CommitTransactionCommand
- 57.23% CommitTransaction
- 38.04% AtCommit_Notify
+ 34.87% kill
+ 1.22% LWLockRelease
+ 0.85% AllocSetAllocFromNewBlock
+ 0.76% LWLockAcquire
+ 9.06% PreCommit_Notify
+ 2.54% ResourceOwnerReleaseInternal
+ 1.53% XactLogCommitRecord
1.44% TransactionIdSetTreeStatus
+ 0.70% GetCurrentTransactionStopTimestamp
0.57% LWLockRelease
0.54% ProcArrayEndTransaction
+ 0.53% MemoryContextResetOnly
+ 4.81% pg_parse_query
+ 1.33% start_xact_command
0.85% CreatePortal
0.54% PortalRun
+ 16.14% socket_flush
+ 14.43% pq_getbyte
```
### 1 x NOTIFY channel_1 to 100 x LISTEN channel_1
```
$ for n in `seq 1 100` ; do ./listen_script 1 ; done
$ pgbench -f ~/notify_channel_1.sql -c 1 -j 1 -T 60 -n bench
tps = 801.370038 (without initial connection time)
- 99.79% 0.02% postgres postgres [.] PostgresMain
- 99.77% PostgresMain
- 62.94% pq_getbyte
- pq_recvbuf
- 62.94% secure_read
- 42.29% ProcessClientReadInterrupt
- 42.13% ProcessNotifyInterrupt
- 33.99% socket_flush
- 33.93% internal_flush_buffer
- 33.92% secure_write
- 27.73% WaitEventSetWait
+ 18.15% epoll_pwait
+ 7.59% WaitEventSetWait
+ 1.83% drain
+ 5.75% __send
+ 3.21% CommitTransactionCommand
+ 3.00% asyncQueueReadAllNotifications
+ 1.43% StartTransactionCommand
+ 19.59% WaitEventSetWait
+ 0.86% recv
+ 36.10% exec_simple_query
```
### 100 x NOTIFY channel_1 to 100 x LISTEN channel_1
```
$ for n in `seq 1 100` ; do ./listen_script 1 ; done
$ pgbench -f ~/notify_channel_1.sql -c 100 -j 100 -T 60 -n bench
tps = 4095.709407 (without initial connection time)
- 99.79% 0.05% postgres postgres [.] PostgresMain
- 99.73% PostgresMain
- 54.22% exec_simple_query
- 52.98% CommitTransactionCommand
- 52.95% CommitTransaction
- 50.49% AtCommit_Notify
+ 49.85% kill
1.17% PreCommit_Notify
+ 43.25% pq_getbyte
+ 1.75% socket_flush
```
### 100 x NOTIFY channel_:client_id to 100 x LISTEN channel_:client_id
```
$ for n in `seq 1 100` ; do ./listen_script $n ; done
$ pgbench -f ~/notify_channel_client_id.sql -c 100 -j 100 -T 60 -n bench
tps = 3354.541290 (without initial connection time)
- 99.87% 0.03% postgres postgres [.] PostgresMain
- 99.85% PostgresMain
- 62.30% exec_simple_query
- 61.72% CommitTransactionCommand
- 61.71% CommitTransaction
- 60.24% AtCommit_Notify
+ 59.83% kill
0.80% PreCommit_Notify
+ 36.09% pq_getbyte
+ 1.21% socket_flush
```
## Scripts
The following `expect` script was used to spawn LISTEN connections,
that were kept open, and that did SELECT 1 every second,
to receive the async notifications, to make it more realistic:
`listen_script`:
```
#!/usr/bin/expect -f
set timeout -1
log_user 0 ;# suppress stdout/stderr
if {$argc != 1} {
puts stderr "Usage: $argv0 <channel>"
exit 64 ;# EX_USAGE
}
set channel [lindex $argv 0]
if {[fork] != 0} { exit }
disconnect ;# stdio → /dev/null
spawn /home/joel/pg-debug/bin/psql -q bench
sleep 1
send "LISTEN channel_$channel;\r"
proc heartbeat {} {
if {[catch {send "SELECT 1;\r"}]} { exit 2 } ;# PTY gone → exit
after 1000 heartbeat
}
after 1000 heartbeat
while 1 {
expect {
eof { exit 0 }
"You are currently not connected to a database." { exit 1 }
-re {.*\r?\n} { exp_continue }
}
}
```
`pgbench` scripts:
notify_channel_1.sql:
```
NOTIFY channel_1;
```
notify_channel_client_id.sql:
```
NOTIFY channel_:client_id;
```
[application/pdf] listen_notify_pgbench_perf.pdf (170.5K, 3-listen_notify_pgbench_perf.pdf)
download
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-09-23 16:27 Tom Lane <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Tom Lane @ 2025-09-23 16:27 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Thomas Munro <[email protected]>; pgsql-hackers; Heikki Linnakangas <[email protected]>; Rishu Bagga <[email protected]>
[ getting back to this... ]
"Joel Jacobson" <[email protected]> writes:
> I'm withdrawing the latest patches, since they won't fix the scalability
> problems, but only provide some performance improvements by eliminating
> redundant IPC signalling. This could also be improved outside of
> async.c, by optimizing ProcSignal [1] or removing ProcSignal as
> "Interrupts vs Signals" [2] is working on.
> There seems to be two different scalability problems, that appears to be
> orthogonal:
> First, it's the thundering herd problems that I tried to solve initially
> in this thread, by introducing a hash table in shared memory, to keep
> track of what backends listen to what channels, to avoid immediate
> wakeup of all listening backends for every notification.
> Second, it's the heavyweight lock in PreCommit_Notify(), that prevents
> parallelism of NOTIFY. Tom Lane has an idea [3] on how to improve this.
I concur that these are orthogonal issues, but I don't understand
why you withdrew your patches --- don't they constitute a solution
to the first scalability bottleneck?
> I guess my main question is if we think we should fix one problem first,
> then the other, both at the same time, or only one or the other?
I imagine we'd eventually want to fix both, but it doesn't have to
be done in the same patch.
regards, tom lane
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-09-24 20:34 Joel Jacobson <[email protected]>
parent: Tom Lane <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-09-24 20:34 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Thomas Munro <[email protected]>; pgsql-hackers; Heikki Linnakangas <[email protected]>; Rishu Bagga <[email protected]>
On Tue, Sep 23, 2025, at 18:27, Tom Lane wrote:
> I concur that these are orthogonal issues, but I don't understand
> why you withdrew your patches --- don't they constitute a solution
> to the first scalability bottleneck?
Thanks for getting back to this thread. I was unhappy with not finding a
solution that would improve all use-cases, I had a feeling it would be
possible to find one, and I think I've done so now.
>> I guess my main question is if we think we should fix one problem first,
>> then the other, both at the same time, or only one or the other?
>
> I imagine we'd eventually want to fix both, but it doesn't have to
> be done in the same patch.
I've attached a new patch with a new pragmatic approach, that
specifically addresses the context switching cost.
The patch is based upon the assumption that some extra LISTEN/NOTIFY
latency would be acceptable by most users, as a trade-off, in order to
improve throughput.
One nice thing with this approach is that it has the potential to
improve throughput both for users with just a single listening backend,
and also for users with lots of listening backends.
More details in the commit message of the patch.
Curious to hear thoughts on this approach.
/Joel
Attachments:
[application/octet-stream] 0001-LISTEN-NOTIFY-make-the-latency-throughput-trade-off-.patch (11.1K, 2-0001-LISTEN-NOTIFY-make-the-latency-throughput-trade-off-.patch)
download | inline diff:
From 5424a31351c83430eb6f93abd3dfcf936126e134 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sat, 16 Aug 2025 19:28:18 +0200
Subject: [PATCH] LISTEN/NOTIFY: make the latency/throughput trade-off tunable
Background: Currently, listeners are signaled on every NOTIFY as soon as
possible. That minimizes perceived latency, but under bursty traffic it
leads to many redundant wakeups, heavy context switching, and degraded
throughput.
This patch adds listener-side wakeup coalescing controlled by a new GUC,
notify_latency_target. The setting defines the maximum additional
latency that is acceptable, allowing redundant wakeups to be coalesced
within the specified interval.
Each listener has a shared "wakeup pending" flag. Senders that observe
the flag is already set do nothing, effectively coalescing their NOTIFY
with the pending wakeup. The listener records the start time of each
processing cycle; if it is awakened again too soon, it defers work and
arms a timeout to re-awaken after the configured delay. The flag is
cleared when entering asyncQueueReadAllNotifications(). A new timeout
reason, NOTIFY_DEFERRED_WAKEUP_TIMEOUT, is registered at backend
startup.
This makes the inherent latency/throughput trade-off explicit and
administrator-controlled. Larger delays increase batching and reduce
wakeup churn, improving throughput at the cost of additional per-notify
latency; a delay of 0 preserves the previous behavior. Queue ordering,
visibility, and cross-database semantics are unchanged.
User-visible change: new GUC notify_latency_target (ms, default 0).
---
doc/src/sgml/config.sgml | 29 ++++++++++++
src/backend/commands/async.c | 47 ++++++++++++++++++-
src/backend/utils/init/postinit.c | 2 +
src/backend/utils/misc/guc_parameters.dat | 10 ++++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/async.h | 1 +
src/include/utils/timeout.h | 1 +
7 files changed, 90 insertions(+), 1 deletion(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e9b420f3ddb..f0156b52a0c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -10267,6 +10267,35 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</listitem>
</varlistentry>
+ <varlistentry id="guc-notify-min-wakeup-delay" xreflabel="notify_latency_target">
+ <term><varname>notify_latency_target</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>notify_latency_target</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Sets the maximum acceptable additional latency for delivering
+ <command>LISTEN</command>/<command>NOTIFY</command>
+ notifications. During bursty periods, notifications that arrive
+ within this interval are coalesced and delivered together,
+ trading bounded extra latency for fewer wakeups and higher
+ throughput.
+ </para>
+
+ <para>
+ After a listening backend has been idle, the first
+ <command>NOTIFY</command> causes an immediately wakeup.
+ If additional notifications happen before
+ <varname>notify_latency_target</varname> has elapsed since the
+ start of that processing cycle, wakeup is deferred by one full
+ <varname>notify_latency_target</varname> interval from the point
+ of deferral. When that interval expires, the listening backend
+ wakes and catches up in a single wakeup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-bytea-output" xreflabel="bytea_output">
<term><varname>bytea_output</varname> (<type>enum</type>)
<indexterm>
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..3f4cef10bd9 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -150,6 +150,7 @@
#include "utils/ps_status.h"
#include "utils/snapmgr.h"
#include "utils/timestamp.h"
+#include "utils/timeout.h"
/*
@@ -246,6 +247,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeup_pending_flag; /* for listener wakeup throttling */
} QueueBackendStatus;
/*
@@ -293,6 +295,8 @@ typedef struct AsyncQueueControl
static AsyncQueueControl *asyncQueueControl;
+static TimestampTz last_wakeup_start_time = 0;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +305,9 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) \
+ (asyncQueueControl->backend[i].wakeup_pending_flag)
+
/*
* The SLRU buffer area through which we access the notification queue
@@ -423,6 +430,7 @@ static bool tryAdvanceTail = false;
/* GUC parameters */
bool Trace_notify = false;
+int notify_latency_target;
/* For 8 KB pages this gives 8 GB of disk space */
int max_notify_queue_pages = 1048576;
@@ -527,6 +535,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) = false;
}
}
@@ -1603,7 +1612,18 @@ SignalBackends(void)
QueuePosition pos;
Assert(pid != InvalidPid);
+
+ /*
+ * If a wakeup is already pending for this listener, do nothing. The
+ * pending signal guarantees it will wake up and process all messages
+ * up to the current queue head, including the one we just wrote. This
+ * coalesces multiple wakeups into one.
+ */
+ if (QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i))
+ continue;
+
pos = QUEUE_BACKEND_POS(i);
+
if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
{
/*
@@ -1624,6 +1644,7 @@ SignalBackends(void)
continue;
}
/* OK, need to signal this one */
+ QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) = true;
pids[count] = pid;
procnos[count] = i;
count++;
@@ -1861,10 +1882,13 @@ asyncQueueReadAllNotifications(void)
AsyncQueueEntry align;
} page_buffer;
- /* Fetch current state */
+ last_wakeup_start_time = GetCurrentTimestamp();
+
+ /* Fetch current state and clear wakeup-pending flag */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING_FLAG(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2189,6 +2213,27 @@ ProcessIncomingNotify(bool flush)
if (listenChannels == NIL)
return;
+ /*
+ * Throttling check: if we were last active too recently, defer. This
+ * check is safe without a lock because it's based on a backend-local
+ * timestamp.
+ */
+ if (notify_latency_target > 0 &&
+ !TimestampDifferenceExceeds(last_wakeup_start_time,
+ GetCurrentTimestamp(),
+ notify_latency_target))
+ {
+ /*
+ * Too soon. We leave wakeup_pending_flag untouched (it must be true,
+ * or we wouldn't have been signaled) to tell senders we are
+ * intentionally delaying. Arm a timer to re-awaken and process the
+ * backlog later.
+ */
+ enable_timeout_after(NOTIFY_DEFERRED_WAKEUP_TIMEOUT,
+ notify_latency_target);
+ return;
+ }
+
if (Trace_notify)
elog(DEBUG1, "ProcessIncomingNotify");
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 641e535a73c..4afd6eb7441 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -33,6 +33,7 @@
#include "catalog/pg_database.h"
#include "catalog/pg_db_role_setting.h"
#include "catalog/pg_tablespace.h"
+#include "commands/async.h"
#include "libpq/auth.h"
#include "libpq/libpq-be.h"
#include "mb/pg_wchar.h"
@@ -764,6 +765,7 @@ InitPostgres(const char *in_dbname, Oid dboid,
RegisterTimeout(TRANSACTION_TIMEOUT, TransactionTimeoutHandler);
RegisterTimeout(IDLE_SESSION_TIMEOUT, IdleSessionTimeoutHandler);
RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler);
+ RegisterTimeout(NOTIFY_DEFERRED_WAKEUP_TIMEOUT, HandleNotifyInterrupt);
RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
IdleStatsUpdateTimeoutHandler);
}
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 6bc6be13d2a..2b23a9520bf 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1567,6 +1567,16 @@
max => 'INT_MAX',
},
+{ name => 'notify_latency_target', type => 'int', context => 'PGC_SUSET', group => 'CLIENT_CONN_OTHER',
+ short_desc => 'Latency target for waking listeners to process NOTIFY.',
+ long_desc => 'First notify after idle wakes immediately; arrivals within the interval defer the next wakeup by one full interval and are coalesced. 0 disables.',
+ flags => 'GUC_UNIT_MS',
+ variable => 'notify_latency_target',
+ boot_val => '0',
+ min => '0',
+ max => 'INT_MAX',
+},
+
{ name => 'wal_decode_buffer_size', type => 'int', context => 'PGC_POSTMASTER', group => 'WAL_RECOVERY',
short_desc => 'Buffer size for reading ahead in the WAL during recovery.',
long_desc => 'Maximum distance to read ahead in the WAL to prefetch referenced data blocks.',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c36fcb9ab61..ca8f6227b28 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -766,6 +766,7 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
#lock_timeout = 0 # in milliseconds, 0 is disabled
#idle_in_transaction_session_timeout = 0 # in milliseconds, 0 is disabled
#idle_session_timeout = 0 # in milliseconds, 0 is disabled
+#notify_latency_target = 0 # in milliseconds, 0 is disabled
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index f75c3df9556..ed27456e487 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -16,6 +16,7 @@
#include <signal.h>
extern PGDLLIMPORT bool Trace_notify;
+extern PGDLLIMPORT int notify_latency_target;
extern PGDLLIMPORT int max_notify_queue_pages;
extern PGDLLIMPORT volatile sig_atomic_t notifyInterruptPending;
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 7b19beafdc9..35cca7c06bf 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -35,6 +35,7 @@ typedef enum TimeoutId
IDLE_SESSION_TIMEOUT,
IDLE_STATS_UPDATE_TIMEOUT,
CLIENT_CONNECTION_CHECK_TIMEOUT,
+ NOTIFY_DEFERRED_WAKEUP_TIMEOUT,
STARTUP_PROGRESS_TIMEOUT,
/* First user-definable timeout reason */
USER_TIMEOUT,
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-09-25 08:25 Chao Li <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Chao Li @ 2025-09-25 08:25 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Tom Lane <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Heikki Linnakangas <[email protected]>; Rishu Bagga <[email protected]>
Hi Joel,
Thanks for the patch. After reviewing it, I got a few comments.
> On Sep 25, 2025, at 04:34, Joel Jacobson <[email protected]> wrote:
>
>
> Curious to hear thoughts on this approach.
>
> /Joel
> <0001-LISTEN-NOTIFY-make-the-latency-throughput-trade-off-.patch>
1.
```
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -35,6 +35,7 @@ typedef enum TimeoutId
IDLE_SESSION_TIMEOUT,
IDLE_STATS_UPDATE_TIMEOUT,
CLIENT_CONNECTION_CHECK_TIMEOUT,
+ NOTIFY_DEFERRED_WAKEUP_TIMEOUT,
STARTUP_PROGRESS_TIMEOUT,
```
Can we define the new one after STARTUP_PROGRESS_TIMEOUT to try to preserve the existing enum value?
2.
```
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -766,6 +766,7 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
#lock_timeout = 0 # in milliseconds, 0 is disabled
#idle_in_transaction_session_timeout = 0 # in milliseconds, 0 is disabled
#idle_session_timeout = 0 # in milliseconds, 0 is disabled
+#notify_latency_target = 0 # in milliseconds, 0 is disabled
#bytea_output = 'hex' # hex, escape
```
I think we should add one more table to make the comment to align with last line’s comment.
3.
```
/* GUC parameters */
bool Trace_notify = false;
+int notify_latency_target;
```
I know compiler will auto initiate notify_latency_target to 0. But all other global and static variables around are explicitly initiated, so it would look better to assign 0 to it, which just keeps coding style consistent.
4.
```
+ /*
+ * Throttling check: if we were last active too recently, defer. This
+ * check is safe without a lock because it's based on a backend-local
+ * timestamp.
+ */
+ if (notify_latency_target > 0 &&
+ !TimestampDifferenceExceeds(last_wakeup_start_time,
+ GetCurrentTimestamp(),
+ notify_latency_target))
+ {
+ /*
+ * Too soon. We leave wakeup_pending_flag untouched (it must be true,
+ * or we wouldn't have been signaled) to tell senders we are
+ * intentionally delaying. Arm a timer to re-awaken and process the
+ * backlog later.
+ */
+ enable_timeout_after(NOTIFY_DEFERRED_WAKEUP_TIMEOUT,
+ notify_latency_target);
+ return;
+ }
+
```
Should we avid duplicate timeout to be enabled? Now, whenever a duplicate notification is avoid, a new timeout is enabled. I think we can add another variable to remember if a timeout has been enabled.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-09-25 21:13 Joel Jacobson <[email protected]>
parent: Chao Li <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-09-25 21:13 UTC (permalink / raw)
To: Chao Li <[email protected]>; +Cc: Tom Lane <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Heikki Linnakangas <[email protected]>; Rishu Bagga <[email protected]>
On Thu, Sep 25, 2025, at 10:25, Chao Li wrote:
> Hi Joel,
>
> Thanks for the patch. After reviewing it, I got a few comments.
Thanks for reviewing!
>> On Sep 25, 2025, at 04:34, Joel Jacobson <[email protected]> wrote:
> 1.
...
> Can we define the new one after STARTUP_PROGRESS_TIMEOUT to try to
> preserve the existing enum value?
Fixed.
> 2.
...
> I think we should add one more table to make the comment to align with
> last line’s comment.
Fixed.
> 3.
...
> I know compiler will auto initiate notify_latency_target to 0. But all
> other global and static variables around are explicitly initiated, so
> it would look better to assign 0 to it, which just keeps coding style
> consistent.
Fixed.
> 4.
...
> Should we avid duplicate timeout to be enabled? Now, whenever a
> duplicate notification is avoid, a new timeout is enabled. I think we
> can add another variable to remember if a timeout has been enabled.
Hmm, I don't see how duplicate timeout could happen?
Once we decide to defer the wakeup, wakeup_pending_flag remains set,
which avoids further signals from notifiers, so I don't see how we could
re-enter ProcessIncomingNotify(), since notifyInterruptPending is reset
when ProcessIncomingNotify() is called, and notifyInterruptPending is
only set when a signal is received (or set directly when in same
process).
New patch attached with 1-3 fixed.
/Joel
Attachments:
[application/octet-stream] 0001-LISTEN-NOTIFY-make-the-latency-throughput-trade-off-v2.patch (11.1K, 2-0001-LISTEN-NOTIFY-make-the-latency-throughput-trade-off-v2.patch)
download | inline diff:
From 72a6252a504f0dc90aa1236a0bc8f560fb75a227 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sat, 16 Aug 2025 19:28:18 +0200
Subject: [PATCH] LISTEN/NOTIFY: make the latency/throughput trade-off tunable
Background: Currently, listeners are signaled on every NOTIFY as soon as
possible. That minimizes perceived latency, but under bursty traffic it
leads to many redundant wakeups, heavy context switching, and degraded
throughput.
This patch adds listener-side wakeup coalescing controlled by a new GUC,
notify_latency_target. The setting defines the maximum additional
latency that is acceptable, allowing redundant wakeups to be coalesced
within the specified interval.
Each listener has a shared "wakeup pending" flag. Senders that observe
the flag is already set do nothing, effectively coalescing their NOTIFY
with the pending wakeup. The listener records the start time of each
processing cycle; if it is awakened again too soon, it defers work and
arms a timeout to re-awaken after the configured delay. The flag is
cleared when entering asyncQueueReadAllNotifications(). A new timeout
reason, NOTIFY_DEFERRED_WAKEUP_TIMEOUT, is registered at backend
startup.
This makes the inherent latency/throughput trade-off explicit and
administrator-controlled. Larger delays increase batching and reduce
wakeup churn, improving throughput at the cost of additional per-notify
latency; a delay of 0 preserves the previous behavior. Queue ordering,
visibility, and cross-database semantics are unchanged.
User-visible change: new GUC notify_latency_target (ms, default 0).
---
doc/src/sgml/config.sgml | 29 ++++++++++++
src/backend/commands/async.c | 47 ++++++++++++++++++-
src/backend/utils/init/postinit.c | 2 +
src/backend/utils/misc/guc_parameters.dat | 10 ++++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/commands/async.h | 1 +
src/include/utils/timeout.h | 1 +
7 files changed, 90 insertions(+), 1 deletion(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e9b420f3ddb..f0156b52a0c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -10267,6 +10267,35 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</listitem>
</varlistentry>
+ <varlistentry id="guc-notify-min-wakeup-delay" xreflabel="notify_latency_target">
+ <term><varname>notify_latency_target</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>notify_latency_target</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Sets the maximum acceptable additional latency for delivering
+ <command>LISTEN</command>/<command>NOTIFY</command>
+ notifications. During bursty periods, notifications that arrive
+ within this interval are coalesced and delivered together,
+ trading bounded extra latency for fewer wakeups and higher
+ throughput.
+ </para>
+
+ <para>
+ After a listening backend has been idle, the first
+ <command>NOTIFY</command> causes an immediately wakeup.
+ If additional notifications happen before
+ <varname>notify_latency_target</varname> has elapsed since the
+ start of that processing cycle, wakeup is deferred by one full
+ <varname>notify_latency_target</varname> interval from the point
+ of deferral. When that interval expires, the listening backend
+ wakes and catches up in a single wakeup.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-bytea-output" xreflabel="bytea_output">
<term><varname>bytea_output</varname> (<type>enum</type>)
<indexterm>
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..c2d97f731a7 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -150,6 +150,7 @@
#include "utils/ps_status.h"
#include "utils/snapmgr.h"
#include "utils/timestamp.h"
+#include "utils/timeout.h"
/*
@@ -246,6 +247,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeup_pending_flag; /* for listener wakeup throttling */
} QueueBackendStatus;
/*
@@ -293,6 +295,8 @@ typedef struct AsyncQueueControl
static AsyncQueueControl *asyncQueueControl;
+static TimestampTz last_wakeup_start_time = 0;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +305,9 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) \
+ (asyncQueueControl->backend[i].wakeup_pending_flag)
+
/*
* The SLRU buffer area through which we access the notification queue
@@ -423,6 +430,7 @@ static bool tryAdvanceTail = false;
/* GUC parameters */
bool Trace_notify = false;
+int notify_latency_target = 0;
/* For 8 KB pages this gives 8 GB of disk space */
int max_notify_queue_pages = 1048576;
@@ -527,6 +535,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) = false;
}
}
@@ -1603,7 +1612,18 @@ SignalBackends(void)
QueuePosition pos;
Assert(pid != InvalidPid);
+
+ /*
+ * If a wakeup is already pending for this listener, do nothing. The
+ * pending signal guarantees it will wake up and process all messages
+ * up to the current queue head, including the one we just wrote. This
+ * coalesces multiple wakeups into one.
+ */
+ if (QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i))
+ continue;
+
pos = QUEUE_BACKEND_POS(i);
+
if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
{
/*
@@ -1624,6 +1644,7 @@ SignalBackends(void)
continue;
}
/* OK, need to signal this one */
+ QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) = true;
pids[count] = pid;
procnos[count] = i;
count++;
@@ -1861,10 +1882,13 @@ asyncQueueReadAllNotifications(void)
AsyncQueueEntry align;
} page_buffer;
- /* Fetch current state */
+ last_wakeup_start_time = GetCurrentTimestamp();
+
+ /* Fetch current state and clear wakeup-pending flag */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING_FLAG(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2189,6 +2213,27 @@ ProcessIncomingNotify(bool flush)
if (listenChannels == NIL)
return;
+ /*
+ * Throttling check: if we were last active too recently, defer. This
+ * check is safe without a lock because it's based on a backend-local
+ * timestamp.
+ */
+ if (notify_latency_target > 0 &&
+ !TimestampDifferenceExceeds(last_wakeup_start_time,
+ GetCurrentTimestamp(),
+ notify_latency_target))
+ {
+ /*
+ * Too soon. We leave wakeup_pending_flag untouched (it must be true,
+ * or we wouldn't have been signaled) to tell senders we are
+ * intentionally delaying. Arm a timer to re-awaken and process the
+ * backlog later.
+ */
+ enable_timeout_after(NOTIFY_DEFERRED_WAKEUP_TIMEOUT,
+ notify_latency_target);
+ return;
+ }
+
if (Trace_notify)
elog(DEBUG1, "ProcessIncomingNotify");
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 641e535a73c..4afd6eb7441 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -33,6 +33,7 @@
#include "catalog/pg_database.h"
#include "catalog/pg_db_role_setting.h"
#include "catalog/pg_tablespace.h"
+#include "commands/async.h"
#include "libpq/auth.h"
#include "libpq/libpq-be.h"
#include "mb/pg_wchar.h"
@@ -764,6 +765,7 @@ InitPostgres(const char *in_dbname, Oid dboid,
RegisterTimeout(TRANSACTION_TIMEOUT, TransactionTimeoutHandler);
RegisterTimeout(IDLE_SESSION_TIMEOUT, IdleSessionTimeoutHandler);
RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler);
+ RegisterTimeout(NOTIFY_DEFERRED_WAKEUP_TIMEOUT, HandleNotifyInterrupt);
RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
IdleStatsUpdateTimeoutHandler);
}
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 6bc6be13d2a..2b23a9520bf 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1567,6 +1567,16 @@
max => 'INT_MAX',
},
+{ name => 'notify_latency_target', type => 'int', context => 'PGC_SUSET', group => 'CLIENT_CONN_OTHER',
+ short_desc => 'Latency target for waking listeners to process NOTIFY.',
+ long_desc => 'First notify after idle wakes immediately; arrivals within the interval defer the next wakeup by one full interval and are coalesced. 0 disables.',
+ flags => 'GUC_UNIT_MS',
+ variable => 'notify_latency_target',
+ boot_val => '0',
+ min => '0',
+ max => 'INT_MAX',
+},
+
{ name => 'wal_decode_buffer_size', type => 'int', context => 'PGC_POSTMASTER', group => 'WAL_RECOVERY',
short_desc => 'Buffer size for reading ahead in the WAL during recovery.',
long_desc => 'Maximum distance to read ahead in the WAL to prefetch referenced data blocks.',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c36fcb9ab61..fd2150b66f9 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -766,6 +766,7 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
#lock_timeout = 0 # in milliseconds, 0 is disabled
#idle_in_transaction_session_timeout = 0 # in milliseconds, 0 is disabled
#idle_session_timeout = 0 # in milliseconds, 0 is disabled
+#notify_latency_target = 0 # in milliseconds, 0 is disabled
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
diff --git a/src/include/commands/async.h b/src/include/commands/async.h
index f75c3df9556..ed27456e487 100644
--- a/src/include/commands/async.h
+++ b/src/include/commands/async.h
@@ -16,6 +16,7 @@
#include <signal.h>
extern PGDLLIMPORT bool Trace_notify;
+extern PGDLLIMPORT int notify_latency_target;
extern PGDLLIMPORT int max_notify_queue_pages;
extern PGDLLIMPORT volatile sig_atomic_t notifyInterruptPending;
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 7b19beafdc9..ea720b05043 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -36,6 +36,7 @@ typedef enum TimeoutId
IDLE_STATS_UPDATE_TIMEOUT,
CLIENT_CONNECTION_CHECK_TIMEOUT,
STARTUP_PROGRESS_TIMEOUT,
+ NOTIFY_DEFERRED_WAKEUP_TIMEOUT,
/* First user-definable timeout reason */
USER_TIMEOUT,
/* Maximum number of timeout reasons */
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-09-26 02:26 Chao Li <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Chao Li @ 2025-09-26 02:26 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Tom Lane <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Heikki Linnakangas <[email protected]>; Rishu Bagga <[email protected]>
> On Sep 26, 2025, at 05:13, Joel Jacobson <[email protected]> wrote:
>
> Hmm, I don't see how duplicate timeout could happen?
>
> Once we decide to defer the wakeup, wakeup_pending_flag remains set,
> which avoids further signals from notifiers, so I don't see how we could
> re-enter ProcessIncomingNotify(), since notifyInterruptPending is reset
> when ProcessIncomingNotify() is called, and notifyInterruptPending is
> only set when a signal is received (or set directly when in same
> process).
>
I think what you explained is partially correct.
Based on my understanding, any backend process may call SignalBackends(), which means that it’s possible that multiple backend processes may call SignalBackends() concurrently.
Looking at your code, between checking QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is a block of code (the “if-else”) to run, so that it’s possible that multiple backend processes have passed the QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will be sent to a process, which will lead to duplicate timeout enabled in the receiver process.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-09-26 09:32 Joel Jacobson <[email protected]>
parent: Chao Li <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-09-26 09:32 UTC (permalink / raw)
To: Chao Li <[email protected]>; +Cc: Tom Lane <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Heikki Linnakangas <[email protected]>; Rishu Bagga <[email protected]>
On Fri, Sep 26, 2025, at 04:26, Chao Li wrote:
> I think what you explained is partially correct.
>
> Based on my understanding, any backend process may call
> SignalBackends(), which means that it’s possible that multiple backend
> processes may call SignalBackends() concurrently.
>
> Looking at your code, between checking
> QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is
> a block of code (the “if-else”) to run, so that it’s possible that
> multiple backend processes have passed the
> QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will
> be sent to a process, which will lead to duplicate timeout enabled in
> the receiver process.
I don't see how that can happen; we're checking wakeup_pending_flag
while holding an exclusive lock, so I don't see how multiple backend
processes could be within the region where we check/set
wakeup_pending_flag, at the same time?
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-09-26 09:44 Chao Li <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Chao Li @ 2025-09-26 09:44 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Tom Lane <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Heikki Linnakangas <[email protected]>; Rishu Bagga <[email protected]>
> On Sep 26, 2025, at 17:32, Joel Jacobson <[email protected]> wrote:
>
> On Fri, Sep 26, 2025, at 04:26, Chao Li wrote:
>
>> I think what you explained is partially correct.
>>
>> Based on my understanding, any backend process may call
>> SignalBackends(), which means that it’s possible that multiple backend
>> processes may call SignalBackends() concurrently.
>>
>> Looking at your code, between checking
>> QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is
>> a block of code (the “if-else”) to run, so that it’s possible that
>> multiple backend processes have passed the
>> QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will
>> be sent to a process, which will lead to duplicate timeout enabled in
>> the receiver process.
>
> I don't see how that can happen; we're checking wakeup_pending_flag
> while holding an exclusive lock, so I don't see how multiple backend
> processes could be within the region where we check/set
> wakeup_pending_flag, at the same time?
>
> /Joel
I might miss the factor of holding an exclusive lock. I will revisit that part again.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-09-28 10:24 Joel Jacobson <[email protected]>
parent: Chao Li <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-09-28 10:24 UTC (permalink / raw)
To: Chao Li <[email protected]>; +Cc: Tom Lane <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Heikki Linnakangas <[email protected]>; Rishu Bagga <[email protected]>
On Fri, Sep 26, 2025, at 11:44, Chao Li wrote:
>> On Sep 26, 2025, at 17:32, Joel Jacobson <[email protected]> wrote:
>>
>> On Fri, Sep 26, 2025, at 04:26, Chao Li wrote:
>>
>>> I think what you explained is partially correct.
>>>
>>> Based on my understanding, any backend process may call
>>> SignalBackends(), which means that it’s possible that multiple backend
>>> processes may call SignalBackends() concurrently.
>>>
>>> Looking at your code, between checking
>>> QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is
>>> a block of code (the “if-else”) to run, so that it’s possible that
>>> multiple backend processes have passed the
>>> QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will
>>> be sent to a process, which will lead to duplicate timeout enabled in
>>> the receiver process.
>>
>> I don't see how that can happen; we're checking wakeup_pending_flag
>> while holding an exclusive lock, so I don't see how multiple backend
>> processes could be within the region where we check/set
>> wakeup_pending_flag, at the same time?
>>
>> /Joel
>
> I might miss the factor of holding an exclusive lock. I will revisit
> that part again.
I've re-read this entire thread, and I actually think my original
approaches are more promising, that is, the
0001-optimize_listen_notify-v4.patch patch, doing multicast targeted
signaling.
Therefore, merely consider the latest patch as PoC with some possible
interesting ideas.
Before this patch, I had never used PostgreSQL's timeout mechanism
before, so I didn't consider it when thinking about how to solve the
remaining problems with 0001-optimize_listen_notify-v4.patch, which
currently can't guarantee that all listening backends will eventually
catch up, since it just kicks one of the most lagging ones, for each
notification. This could be a problem in practise if there is a long
period of time with no notifications coming in. Then some listening
backends could end up not being signaled and would stay behind,
preventing the queue tail from advancing.
I'm thinking maybe somehow we can use the timeout mechanism here, but
I'm not sure how yet. Any ideas?
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-09-29 02:33 Chao Li <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Chao Li @ 2025-09-29 02:33 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Tom Lane <[email protected]>; Thomas Munro <[email protected]>; pgsql-hackers; Heikki Linnakangas <[email protected]>; Rishu Bagga <[email protected]>
> On Sep 28, 2025, at 18:24, Joel Jacobson <[email protected]> wrote:
>
>>
>> I might miss the factor of holding an exclusive lock. I will revisit
>> that part again.
>
> I've re-read this entire thread, and I actually think my original
> approaches are more promising, that is, the
> 0001-optimize_listen_notify-v4.patch patch, doing multicast targeted
> signaling.
>
> Therefore, merely consider the latest patch as PoC with some possible
> interesting ideas.
>
> Before this patch, I had never used PostgreSQL's timeout mechanism
> before, so I didn't consider it when thinking about how to solve the
> remaining problems with 0001-optimize_listen_notify-v4.patch, which
> currently can't guarantee that all listening backends will eventually
> catch up, since it just kicks one of the most lagging ones, for each
> notification. This could be a problem in practise if there is a long
> period of time with no notifications coming in. Then some listening
> backends could end up not being signaled and would stay behind,
> preventing the queue tail from advancing.
>
> I'm thinking maybe somehow we can use the timeout mechanism here, but
> I'm not sure how yet. Any ideas?
>
> /Joel
Hi Joel,
I never had a concern about using the timeout mechanism. My comment was about enabling timeout duplicately.
I just revisited the code, now I agree that I was over-worried because I missed considering NotifyQueueLock. With the lock protection, a backend process’ QUEUE_BACKEND_WAKEUP_PENDING_FLAG won’t have race condition, then it should have no duplicate signals sending to the same backend process. Then in the backend process, you have “last_wakeup_start_time” that avoids duplicate timeout within a configured period, and you reset last_wakeup_start_time in asyncQueueReadAllNotifications() together with cleaning the QUEUE_BACKEND_WAKEUP_PENDING_FLAG.
So, overall v2 looks good to me.
One last tiny comment is about naming of last_wakeup_start_time. I think it can be renamed to “last_wakeup_time”. Because the variable just records when asyncQueueReadAllNotifications() last time called, there seems not a meaning of “start” involved.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-09-30 18:56 Joel Jacobson <[email protected]>
parent: Chao Li <[email protected]>
0 siblings, 2 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-09-30 18:56 UTC (permalink / raw)
To: pgsql-hackers
On Mon, Sep 29, 2025, at 04:33, Chao Li wrote:
> I never had a concern about using the timeout mechanism. My comment was
> about enabling timeout duplicately.
Thanks for reviewing. However, like said in my previous email, I'm
sorry, but don't believe in my suggested throughput/latency approach. I
unfortunately managed to derail from the IMO more promising approaches I
worked on initially.
What I couldn't find a solution to then, was the problem of possibly
ending up in a situation where some lagging backends would never catch
up.
In this new patch, I've simply introduced a new bgworker, given the
specific task of kicking lagging backends. I wish of course we could do
without the bgworker, but I don't see how that would be possible.
---
optimize_listen_notify-v5.patch:
Fix LISTEN/NOTIFY so it scales with idle listening backends
Currently, idle listening backends cause a dramatic slowdown due to
context switching when they are signaled and wake up. This is wasteful
when they are not listening to the channel being notified.
Just 10 extra idle listening connections cause a slowdown from 8700 TPS
to 6100 TPS, 100 extra cause it to drop to 2000 TPS, and at 1000 extra
it falls to 250 TPS.
To improve scalability with the number of idle listening backends, this
patch introduces a shared hash table to keep track of channels per
listening backend. This hash table is partitioned to reduce contention
on concurrent LISTEN/UNLISTEN operations.
We keep track of up to NOTIFY_MULTICAST_THRESHOLD (16) listeners per
channel. Benchmarks indicated diminishing gains above this level.
Setting it lower seems unnecessary, so a constant seemed fine; a GUC did
not seem motivated.
This patch also adds a wakeup_pending flag to each backend's queue
status to avoid redundant signaling when a wakeup is already pending as
the backend is signaled again. The flag is set when a backend is
signaled and cleared before processing the queue. This order is
important to ensure correctness.
It was also necessary to add a new bgworker, notify_bgworker, whose sole
responsibility is to wake up lagging listening backends, ensuring they
are kicked when they are about to fall too far behind. This bgworker is
always started at postmaster startup, but is only activated upon NOTIFY
by signaling it, unless it is already active. The notify_bgworker
staggers the signaling of lagging listening backends by sleeping 100 ms
between each signal, to prevent the thundering herd problem we would
otherwise get if all listening backends woke up at the same time. It
loops until there are no more lagging listening backends, and then
becomes inactive.
/Joel
Attachments:
[application/octet-stream] optimize_listen_notify-v5.patch (47.8K, 2-optimize_listen_notify-v5.patch)
download | inline diff:
From e5bd0959b756dd7e52ffcc1e0a7005ce27f9cabb Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 28 Sep 2025 14:53:57 +0200
Subject: [PATCH] Fix LISTEN/NOTIFY so it scales with idle listening backends
Currently, idle listening backends cause a dramatic slowdown due to
context switching when they are signaled and wake up. This is wasteful
when they are not listening to the channel being notified.
Just 10 extra idle listening connections cause a slowdown from 8700 TPS
to 6100 TPS, 100 extra cause it to drop to 2000 TPS, and at 1000 extra
it falls to 250 TPS.
To improve scalability with the number of idle listening backends, this
patch introduces a shared hash table to keep track of channels per
listening backend. This hash table is partitioned to reduce contention
on concurrent LISTEN/UNLISTEN operations.
We keep track of up to NOTIFY_MULTICAST_THRESHOLD (16) listeners per
channel. Benchmarks indicated diminishing gains above this level.
Setting it lower seems unnecessary, so a constant seemed fine; a GUC did
not seem motivated.
This patch also adds a wakeup_pending flag to each backend's queue
status to avoid redundant signaling when a wakeup is already pending as
the backend is signaled again. The flag is set when a backend is
signaled and cleared before processing the queue. This order is
important to ensure correctness.
It was also necessary to add a new bgworker, notify_bgworker, whose sole
responsibility is to wake up lagging listening backends, ensuring they
are kicked when they are about to fall too far behind. This bgworker is
always started at postmaster startup, but is only activated upon NOTIFY
by signaling it, unless it is already active. The notify_bgworker
staggers the signaling of lagging listening backends by sleeping 100 ms
between each signal, to prevent the thundering herd problem we would
otherwise get if all listening backends woke up at the same time. It
loops until there are no more lagging listening backends, and then
becomes inactive.
---
src/backend/commands/async.c | 882 +++++++++++++++++-
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/bgworker.c | 4 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/notify_bgworker.c | 225 +++++
src/backend/postmaster/postmaster.c | 6 +
.../utils/activity/wait_event_names.txt | 1 +
src/include/postmaster/notify_bgworker.h | 40 +
src/include/storage/lwlocklist.h | 1 +
9 files changed, 1122 insertions(+), 39 deletions(-)
create mode 100644 src/backend/postmaster/notify_bgworker.c
create mode 100644 src/include/postmaster/notify_bgworker.h
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..fd32e207408 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,12 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * In addition to each backend maintaining its own list of channels, we also
+ * maintain a central hash table that tracks listeners for each channel, up
+ * to NOTIFY_MULTICAST_THRESHOLD. When the number of listeners is below
+ * this threshold, we can perform a targeted "multicast" by signaling only
+ * those specific backends. If the number of listeners reaches or exceeds the
+ * threshold, we fall back to signaling all listening backends in the database.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -71,13 +75,19 @@
* make any actual updates to the effective listen state (listenChannels).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which sends PROCSIG_NOTIFY_INTERRUPT signals to
+ * listening backends, and has two modes of operation:
+ * a) Multicast mode: For channels with a number of listeners not exceeding
+ * NOTIFY_MULTICAST_THRESHOLD, signals are sent only to those specific
+ * backends.
+ * b) Broadcast mode: If any channel being notified has more listeners than
+ * the threshold (or if the hash table runs out of shared memory for
+ * new entries), we signal every listening backend in the database.
+ *
+ * After sending immediate signals, SignalBackends() also triggers a deferred
+ * wakeup background worker (if not already active) that handles waking up
+ * backends that have fallen behind by QUEUE_CLEANUP_DELAY or more pages,
+ * using staggered delays to prevent thundering herd effects.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +138,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,6 +148,7 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "postmaster/notify_bgworker.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
@@ -146,6 +158,7 @@
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
#include "utils/guc_hooks.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
#include "utils/snapmgr.h"
@@ -162,6 +175,79 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Maximum number of listeners to track per channel for multicast signaling.
+ * When the number of listeners on a channel exceeds this threshold, NOTIFY
+ * will signal all listening backends rather than just those listening on the
+ * specific channel. Setting to 0 disables multicast signaling entirely.
+ */
+#define NOTIFY_MULTICAST_THRESHOLD 16
+
+/*
+ * Number of partitions for the channel hash table's locks.
+ * This must be a power of two.
+ */
+#define NUM_NOTIFY_PARTITIONS 128
+
+/*
+ * Channel hash table definitions
+ *
+ * This hash table provides an optimization by tracking which backends are
+ * listening on each channel, up to a certain threshold. Channels are
+ * identified by database OID and channel name, making them
+ * database-specific.
+ *
+ * To improve scalability of concurrent LISTEN/UNLISTEN operations, the hash
+ * table is partitioned, with each partition protected by its own LWLock.
+ * This avoids serializing all operations on a single global lock.
+ *
+ * When the number of backends listening on a channel is at or below
+ * NOTIFY_MULTICAST_THRESHOLD, we store their ProcNumbers and signal them
+ * directly (multicast).
+ *
+ * We fall back to broadcast mode and signal all listening backends when:
+ * 1) More backends listen on the same channel than the threshold allows, OR
+ * 2) The hash table runs out of shared memory for new entries
+ *
+ * Note that CHANNEL_HASH_MAX_SIZE is not a hard limit - the hash table can
+ * store more entries than this, but performance will degrade due to bucket
+ * overflow. The actual fallback to broadcast mode occurs only when shared
+ * memory is exhausted and we cannot allocate new hash entries.
+ *
+ * The maximum size (CHANNEL_HASH_MAX_SIZE) is based on the typical OS port
+ * range. This provides a reasonable upper bound for systems that use
+ * per-connection channels.
+ *
+ */
+#define CHANNEL_HASH_INIT_SIZE 256
+#define CHANNEL_HASH_MAX_SIZE 65535
+
+/*
+ * Key structure for the channel hash table.
+ * Channels are database-specific, so we need both the database OID
+ * and the channel name to uniquely identify a channel.
+ */
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+/*
+ * Each entry contains a channel key (database OID + channel name) and an array
+ * of listening backend ProcNumbers, up to NOTIFY_MULTICAST_THRESHOLD. If the
+ * number of listeners exceeds the threshold, we mark the channel for
+ * broadcast and stop tracking individual listeners.
+ */
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ bool is_broadcast; /* True if num_listeners >= threshold */
+ uint8 num_listeners; /* Number of listeners currently stored */
+ /* Listeners array follows, of size NOTIFY_MULTICAST_THRESHOLD */
+ ProcNumber listeners[FLEXIBLE_ARRAY_MEMBER];
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -227,8 +313,8 @@ typedef struct QueuePosition
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +332,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeup_pending;
} QueueBackendStatus;
/*
@@ -269,6 +356,11 @@ typedef struct QueueBackendStatus
* In order to avoid deadlocks, whenever we need multiple locks, we first get
* NotifyQueueTailLock, then NotifyQueueLock, and lastly SLRU bank lock.
*
+ * The channel hash table is protected by a separate set of partitioned
+ * locks. To prevent deadlocks between these and NotifyQueueLock, the global
+ * lock-ordering rule is: always acquire NotifyQueueLock *before* acquiring
+ * any channel hash partition lock.
+ *
* Each backend uses the backend[] array entry with index equal to its
* ProcNumber. We rely on this to make SendProcSignal fast.
*
@@ -288,11 +380,67 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ /* Deferred wakeup worker state */
+ bool deferredWakeupWorkerActive; /* is worker processing? */
+ pid_t deferredWakeupWorkerPid; /* PID of worker for signaling */
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+/* Locks for partitioned channel hash table */
+static LWLock *channelHashLocks;
+
+/* Channel hash table for multicast signalling */
+static HTAB *channelHash = NULL;
+
+/* Forward declaration needed by GetChannelHash */
+static uint32 channel_hash_func(const void *key, Size keysize);
+
+/*
+ * GetChannelHash
+ * Get the channel hash table, initializing our backend's pointer if needed.
+ *
+ * This must be called before any access to the channel hash table.
+ * The hash table itself is created in shared memory during AsyncShmemInit,
+ * but each backend needs to get its own pointer to it.
+ */
+static HTAB *
+GetChannelHash(void)
+{
+ if (channelHash == NULL)
+ {
+ HASHCTL hash_ctl;
+ Size entrysize;
+
+ /*
+ * Set up to attach to the existing shared hash table. The hash
+ * control parameters must match those used in AsyncShmemInit.
+ */
+ MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ChannelHashKey);
+
+ /*
+ * The size of a channel entry is flexible. We must have enough space
+ * for the maximum number of listeners specified by the threshold.
+ */
+ entrysize = add_size(offsetof(ChannelEntry, listeners),
+ mul_size(NOTIFY_MULTICAST_THRESHOLD, sizeof(ProcNumber)));
+ hash_ctl.entrysize = entrysize;
+
+ hash_ctl.hash = channel_hash_func;
+ hash_ctl.num_partitions = NUM_NOTIFY_PARTITIONS;
+
+ channelHash = ShmemInitHash("Channel Hash",
+ CHANNEL_HASH_INIT_SIZE,
+ CHANNEL_HASH_MAX_SIZE,
+ &hash_ctl,
+ HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+ }
+
+ return channelHash;
+}
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +449,7 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeup_pending)
/*
* The SLRU buffer area through which we access the notification queue
@@ -458,6 +607,14 @@ static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+/* Channel hash table management functions */
+static LWLock *GetChannelHashLock(const char *channel);
+static inline void ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel);
+static void ChannelHashAddListener(const char *channel, ProcNumber procno);
+static void ChannelHashRemoveListener(const char *channel, ProcNumber procno);
+static ChannelEntry * ChannelHashLookup(const char *channel);
+static List *GetPendingNotifyChannels(void);
+
/*
* Compute the difference between two queue page numbers.
* Previously this function accounted for a wraparound.
@@ -485,6 +642,7 @@ Size
AsyncShmemSize(void)
{
Size size;
+ Size entrysize;
/* This had better match AsyncShmemInit */
size = mul_size(MaxBackends, sizeof(QueueBackendStatus));
@@ -492,6 +650,18 @@ AsyncShmemSize(void)
size = add_size(size, SimpleLruShmemSize(notify_buffers, 0));
+ /*
+ * The size of a channel entry is flexible. We must allocate enough space
+ * for the maximum number of listeners specified by the threshold.
+ */
+ entrysize = add_size(offsetof(ChannelEntry, listeners),
+ mul_size(NOTIFY_MULTICAST_THRESHOLD, sizeof(ProcNumber)));
+ size = add_size(size, hash_estimate_size(CHANNEL_HASH_MAX_SIZE,
+ entrysize));
+
+ /* Space for channel hash partition locks */
+ size = add_size(size, mul_size(NUM_NOTIFY_PARTITIONS, sizeof(LWLock)));
+
return size;
}
@@ -521,12 +691,15 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->deferredWakeupWorkerActive = false;
+ asyncQueueControl->deferredWakeupWorkerPid = 0;
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -546,6 +719,48 @@ AsyncShmemInit(void)
*/
(void) SlruScanDirectory(NotifyCtl, SlruScanDirCbDeleteAll, NULL);
}
+
+ /*
+ * Create or attach to the channel hash table.
+ */
+ {
+ HASHCTL hash_ctl;
+ Size entrysize;
+
+ MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ChannelHashKey);
+
+ /*
+ * The size of a channel entry is flexible. We must have enough space
+ * for the maximum number of listeners specified by the threshold.
+ */
+ entrysize = add_size(offsetof(ChannelEntry, listeners),
+ mul_size(NOTIFY_MULTICAST_THRESHOLD, sizeof(ProcNumber)));
+ hash_ctl.entrysize = entrysize;
+
+ hash_ctl.hash = channel_hash_func;
+ hash_ctl.num_partitions = NUM_NOTIFY_PARTITIONS;
+
+ channelHash = ShmemInitHash("Channel Hash",
+ CHANNEL_HASH_INIT_SIZE,
+ CHANNEL_HASH_MAX_SIZE,
+ &hash_ctl,
+ HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+ }
+
+ /* Initialize locks for the partitioned hash table */
+ size = mul_size(NUM_NOTIFY_PARTITIONS, sizeof(LWLock));
+ channelHashLocks = (LWLock *)
+ ShmemInitStruct("Channel Hash Locks", size, &found);
+ if (!found)
+ {
+ /* First time through: initialize the locks */
+ for (int i = 0; i < NUM_NOTIFY_PARTITIONS; i++)
+ {
+ LWLockInitialize(&channelHashLocks[i],
+ LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ }
+ }
}
@@ -1152,6 +1367,8 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+
+ ChannelHashAddListener(channel, MyProcNumber);
}
/*
@@ -1175,6 +1392,7 @@ Exec_UnlistenCommit(const char *channel)
{
listenChannels = foreach_delete_current(listenChannels, q);
pfree(lchan);
+ ChannelHashRemoveListener(channel, MyProcNumber);
break;
}
}
@@ -1193,9 +1411,22 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ ListCell *p;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ /*
+ * Before freeing the local list, iterate through it and perform a
+ * targeted removal for each of our channels from the shared hash table.
+ */
+ foreach(p, listenChannels)
+ {
+ char *channel = (char *) lfirst(p);
+
+ ChannelHashRemoveListener(channel, MyProcNumber);
+ }
+
list_free_deep(listenChannels);
listenChannels = NIL;
}
@@ -1565,12 +1796,12 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * This function operates in two modes:
+ * 1. Multicast mode: If all pending notification channels have listeners at or
+ * below NOTIFY_MULTICAST_THRESHOLD, we signal only those specific backends.
+ * 2. Broadcast mode: If any channel's listener count exceeds the threshold OR
+ * the hash table lacks memory for new entries, we signal all listening
+ * backends in our database.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1814,12 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ List *channels;
+ ListCell *p;
+ bool *signaled;
+ bool broadcast_mode = false;
+ bool trigger_deferred_wakeup = false;
+ pid_t deferred_wakeup_pid = 0;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1594,40 +1831,149 @@ SignalBackends(void)
*/
pids = (int32 *) palloc(MaxBackends * sizeof(int32));
procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+ signaled = (bool *) palloc0(MaxBackends * sizeof(bool));
count = 0;
+ /* Get list of channels that have pending notifications */
+ channels = GetPendingNotifyChannels();
+
+ /*
+ * To prevent deadlocks, we must always acquire locks in the same order:
+ * global NotifyQueueLock first, then individual partition locks.
+ */
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+
+ /*
+ * Determine if we can use targeted signaling or must broadcast. This
+ * check must be done while holding NotifyQueueLock to prevent deadlocks
+ * against other backends that might be modifying the listener list and
+ * hash table simultaneously (e.g., asyncQueueUnregister).
+ */
+ foreach(p, channels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(p);
+ ChannelEntry *entry;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ LWLockAcquire(lock, LW_SHARED);
+ entry = ChannelHashLookup(channel);
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ /*
+ * If there is no entry, it could mean we ran out of shared memory
+ * when trying to add this channel to the hash table. If the entry is
+ * marked for broadcast, we must use broadcast mode.
+ */
+ if (!entry || entry->is_broadcast)
+ {
+ broadcast_mode = true;
+ LWLockRelease(lock);
+ break;
+ }
+ LWLockRelease(lock);
+ }
+
+ if (broadcast_mode)
+ {
+ /*
+ * In broadcast mode, we iterate over all listening backends and
+ * signal the ones in our database that are not already caught up.
+ */
+ for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
{
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
/*
* Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
+ * already caught up.
*/
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ /* OK, need to signal this one */
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
}
- else
+ }
+ else
+ {
+ /*
+ * In multicast mode, signal specific listening backends. We must
+ * re-check the hash entries here inside the lock to avoid races.
+ */
+ foreach(p, channels)
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
- continue;
+ char *channel = (char *) lfirst(p);
+ ChannelEntry *entry;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ LWLockAcquire(lock, LW_SHARED);
+ entry = ChannelHashLookup(channel);
+
+ if (entry && !entry->is_broadcast)
+ {
+ for (int j = 0; j < entry->num_listeners; j++)
+ {
+ ProcNumber i = entry->listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ if (signaled[i])
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ continue;
+
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ /* OK, need to signal this one */
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ }
+ }
+ LWLockRelease(lock);
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
+
+ /*
+ * Check if we should trigger the deferred wakeup worker after we're done
+ * sending immediate signals. We do this check while still holding the
+ * lock to avoid needing to reacquire it later.
+ */
+ if (!asyncQueueControl->deferredWakeupWorkerActive &&
+ asyncQueueControl->deferredWakeupWorkerPid != 0)
+ {
+ asyncQueueControl->deferredWakeupWorkerActive = true;
+ trigger_deferred_wakeup = true;
+ deferred_wakeup_pid = asyncQueueControl->deferredWakeupWorkerPid;
+ }
+
LWLockRelease(NotifyQueueLock);
/* Now send signals */
@@ -1647,9 +1993,9 @@ SignalBackends(void)
/*
* Note: assuming things aren't broken, a signal failure here could
- * only occur if the target backend exited since we released
- * NotifyQueueLock; which is unlikely but certainly possible. So we
- * just log a low-level debug message if it happens.
+ * only occur if the target backend exited since we released the lock;
+ * which is unlikely but certainly possible. So we just log a
+ * low-level debug message if it happens.
*/
if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]) < 0)
elog(DEBUG3, "could not signal backend with PID %d: %m", pid);
@@ -1657,6 +2003,25 @@ SignalBackends(void)
pfree(pids);
pfree(procnos);
+ pfree(signaled);
+
+ /*
+ * Trigger the deferred wakeup worker if needed. The worker will check for
+ * lagging backends and wake them up with staggered delays.
+ */
+ if (trigger_deferred_wakeup)
+ {
+ if (kill(deferred_wakeup_pid, SIGUSR1) < 0)
+ {
+ /* Worker might have died, clear the flags */
+ elog(WARNING, "could not signal deferred wakeup worker: %m");
+
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ asyncQueueControl->deferredWakeupWorkerActive = false;
+ asyncQueueControl->deferredWakeupWorkerPid = 0;
+ LWLockRelease(NotifyQueueLock);
+ }
+ }
}
/*
@@ -1865,6 +2230,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2395,3 +2761,441 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * Channel hash table management functions
+ */
+
+/*
+ * channel_hash_func
+ * Custom hash function for the channel hash table. This function ensures
+ * that the low-order bits of the hash are well-distributed, which is
+ * critical for partitioned hash tables.
+ */
+static uint32
+channel_hash_func(const void *key, Size keysize)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ uint32 h;
+
+ /*
+ * Mix the dboid and the channel name to produce a good hash. hash_any()
+ * is a high-quality portable hash function. This prevents channels with
+ * the same name in different databases from always mapping to the same
+ * partition.
+ */
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/*
+ * GetChannelHashLock
+ * Return the LWLock that protects the partition for the given channel name.
+ */
+static LWLock *
+GetChannelHashLock(const char *channel)
+{
+ ChannelHashKey key;
+ uint32 hash;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ hash = get_hash_value(GetChannelHash(), &key);
+
+ return &channelHashLocks[hash % NUM_NOTIFY_PARTITIONS];
+}
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key (database OID + channel name) for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
+
+/*
+ * ChannelHashAddListener
+ * Register the given backend as a listener for the specified channel.
+ *
+ * This function uses an optimistic read-locking strategy to maximize
+ * concurrency. An exclusive lock is only taken when mutating the listener
+ * list.
+ *
+ * 1. It first takes a shared lock. If the channel is already in broadcast
+ * mode, or if the current backend is already in the listener list, no write
+ * is needed and we can return immediately.
+ *
+ * 2. If a write is needed, it releases the shared lock and acquires an
+ * exclusive lock.
+ *
+ * 3. CRUCIALLY, after acquiring the exclusive lock, it must re-check the
+ * state, as another backend may have modified the entry in the interim.
+ *
+ * 4. If the number of listeners is below NOTIFY_MULTICAST_THRESHOLD, the
+ * new listener is added. If the threshold is reached, the channel is
+ * converted to broadcast mode.
+ */
+static void
+ChannelHashAddListener(const char *channel, ProcNumber procno)
+{
+ ChannelEntry *entry;
+ bool found;
+ ChannelHashKey key;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ /*
+ * If the threshold is zero, this optimization is disabled. All channels
+ * immediately use broadcast, so we don't need to track them.
+ */
+ if (NOTIFY_MULTICAST_THRESHOLD <= 0)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * FAST PATH: Optimistically take a shared lock. If the channel is already
+ * in broadcast mode, or if we are already listed, we are done.
+ */
+ LWLockAcquire(lock, LW_SHARED);
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (entry)
+ {
+ if (entry->is_broadcast)
+ {
+ LWLockRelease(lock);
+ return;
+ }
+ /* Check if we are already in the list */
+ for (int i = 0; i < entry->num_listeners; i++)
+ {
+ if (entry->listeners[i] == procno)
+ {
+ LWLockRelease(lock);
+ return;
+ }
+ }
+ }
+ LWLockRelease(lock);
+
+ /*
+ * SLOW PATH: We need to write. Acquire exclusive lock.
+ */
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
+ /*
+ * Re-check state after acquiring exclusive lock, as it may have changed.
+ */
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_ENTER_NULL, &found);
+
+ if (entry == NULL)
+ {
+ /* Out of memory in the hash partition. */
+ ereport(DEBUG1, (errmsg("too many notification channels are already being tracked")));
+ LWLockRelease(lock);
+ return;
+ }
+
+ if (!found)
+ {
+ /* First listener for this channel. */
+ entry->is_broadcast = false;
+ entry->num_listeners = 1;
+ entry->listeners[0] = procno;
+ }
+ else
+ {
+ /* Entry already exists, re-check everything. */
+ bool already_present = false;
+
+ if (entry->is_broadcast)
+ {
+ /* Another backend set it to broadcast mode. We're done. */
+ LWLockRelease(lock);
+ return;
+ }
+
+ for (int i = 0; i < entry->num_listeners; i++)
+ {
+ if (entry->listeners[i] == procno)
+ {
+ already_present = true;
+ break;
+ }
+ }
+
+ if (!already_present)
+ {
+ if (entry->num_listeners < NOTIFY_MULTICAST_THRESHOLD)
+ {
+ /* Add ourselves to the list of listeners. */
+ entry->listeners[entry->num_listeners] = procno;
+ entry->num_listeners++;
+ }
+ else
+ {
+ /* We are the listener that exceeds the threshold. */
+ entry->is_broadcast = true;
+ entry->num_listeners = 0; /* Clear the list */
+ }
+ }
+ }
+ LWLockRelease(lock);
+}
+
+/*
+ * ChannelHashRemoveListener
+ * Update the channel hash when a backend stops listening on a channel.
+ *
+ * This function uses an optimistic read-lock strategy. An exclusive lock is
+ * only taken if we are in the listener list for a channel and need to remove
+ * ourselves. If a channel is in broadcast mode, we cannot safely modify it,
+ * as we can't know which backends are listening.
+ */
+static void
+ChannelHashRemoveListener(const char *channel, ProcNumber procno)
+{
+ ChannelEntry *entry;
+ ChannelHashKey key;
+ LWLock *lock = GetChannelHashLock(channel);
+ bool present = false;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * Take a shared lock first to see if a removal is even possible. If the
+ * entry doesn't exist, is in broadcast mode, or we're not in its list, we
+ * have nothing to do. This is the fast path.
+ */
+ LWLockAcquire(lock, LW_SHARED);
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (!entry || entry->is_broadcast)
+ {
+ LWLockRelease(lock);
+ return;
+ }
+
+ /* Check if we are in the list */
+ for (int i = 0; i < entry->num_listeners; i++)
+ {
+ if (entry->listeners[i] == procno)
+ {
+ present = true;
+ break;
+ }
+ }
+ if (!present)
+ {
+ LWLockRelease(lock);
+ return;
+ }
+ LWLockRelease(lock);
+
+ /* A removal is likely needed. Acquire an exclusive lock. */
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
+ /*
+ * Re-check the state. Another backend might have changed it (e.g., to
+ * broadcast mode).
+ */
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (entry && !entry->is_broadcast)
+ {
+ int i;
+
+ for (i = 0; i < entry->num_listeners; i++)
+ {
+ if (entry->listeners[i] == procno)
+ {
+ /*
+ * Found our procno. Remove it from the listener array.
+ *
+ * If this is the last listener, we remove the entire hash
+ * entry for the channel.
+ */
+ if (entry->num_listeners == 1)
+ {
+ (void) hash_search(GetChannelHash(), &key, HASH_REMOVE, NULL);
+ }
+ else
+ {
+ /*
+ * To remove an element from the array while keeping it
+ * contiguous, we first decrement the listener count.
+ * Then, we shift all subsequent elements one position to
+ * the left, overwriting the element we want to remove.
+ *
+ * The `if (i < entry->num_listeners)` condition
+ * explicitly handles the case where the last element in
+ * the array is being removed. In that scenario, `i`
+ * equals the new `num_listeners`, so no memory movement
+ * is necessary, and the `memmove` is correctly skipped.
+ */
+ entry->num_listeners--;
+ if (i < entry->num_listeners)
+ {
+ Size size_to_move;
+
+ size_to_move = mul_size(entry->num_listeners - i,
+ sizeof(ProcNumber));
+ memmove(&entry->listeners[i],
+ &entry->listeners[i + 1],
+ size_to_move);
+ }
+ }
+ break; /* Found and removed, exit loop. */
+ }
+ }
+ }
+ LWLockRelease(lock);
+}
+
+/*
+ * ChannelHashLookup
+ * Look up the channel hash entry for the given channel name in the
+ * current database.
+ *
+ * Returns NULL if no hash entry exists for the channel. When an entry exists,
+ * the caller should check the is_broadcast field to determine if individual
+ * listeners are being tracked or if the channel uses broadcast mode.
+ *
+ * Caller must hold the appropriate partition lock (shared is sufficient).
+ */
+static ChannelEntry *
+ChannelHashLookup(const char *channel)
+{
+ ChannelHashKey key;
+
+ Assert(LWLockHeldByMe(GetChannelHashLock(channel)));
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ return (ChannelEntry *) hash_search(GetChannelHash(),
+ &key,
+ HASH_FIND,
+ NULL);
+}
+
+/*
+ * GetPendingNotifyChannels
+ * Get list of unique channel names from pending notifications.
+ */
+static List *
+GetPendingNotifyChannels(void)
+{
+ List *channels = NIL;
+ ListCell *p;
+ ListCell *q;
+ bool found;
+
+ if (!pendingNotifies)
+ return NIL;
+
+ /* Collect unique channel names from pending notifications */
+ foreach(p, pendingNotifies->events)
+ {
+ Notification *n = (Notification *) lfirst(p);
+ char *channel = n->data;
+
+ /* Check if we already have this channel in our list */
+ found = false;
+ foreach(q, channels)
+ {
+ char *existing = (char *) lfirst(q);
+
+ if (strcmp(existing, channel) == 0)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ channels = lappend(channels, channel);
+ }
+
+ return channels;
+}
+
+/*
+ * AsyncDeferredWakeupSetWorkerPid
+ * Store the PID of the deferred wakeup worker in shared memory
+ */
+void
+AsyncDeferredWakeupSetWorkerPid(pid_t pid)
+{
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ asyncQueueControl->deferredWakeupWorkerPid = pid;
+ LWLockRelease(NotifyQueueLock);
+}
+
+/*
+ * AsyncDeferredWakeupClearActive
+ * Clear the active flag for the deferred wakeup worker
+ */
+void
+AsyncDeferredWakeupClearActive(void)
+{
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ asyncQueueControl->deferredWakeupWorkerActive = false;
+ LWLockRelease(NotifyQueueLock);
+}
+
+/*
+ * AsyncGetLaggingBackends
+ * Get list of lagging listening backends that need to be woken up
+ *
+ * Returns a list of BackendWakeupInfo structs. The caller is responsible
+ * for freeing the list and its contents.
+ */
+List *
+AsyncGetLaggingBackends(void)
+{
+ List *lagging_backends = NIL;
+ QueuePosition head;
+
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ head = QUEUE_HEAD;
+
+ /* Iterate through all listening backends */
+ for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ {
+ QueuePosition pos;
+ int64 pageDiff;
+
+ /* Skip if wakeup is already pending */
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /* Calculate how far behind this backend is */
+ pageDiff = asyncQueuePageDiff(QUEUE_POS_PAGE(head), QUEUE_POS_PAGE(pos));
+
+ /* If backend is lagging by QUEUE_CLEANUP_DELAY or more pages */
+ if (pageDiff >= QUEUE_CLEANUP_DELAY)
+ {
+ BackendWakeupInfo *info;
+
+ info = (BackendWakeupInfo *) palloc(sizeof(BackendWakeupInfo));
+ info->pid = QUEUE_BACKEND_PID(i);
+ info->procno = i;
+
+ /* Mark as having wakeup pending */
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+
+ lagging_backends = lappend(lagging_backends, info);
+ }
+ }
+
+ LWLockRelease(NotifyQueueLock);
+
+ return lagging_backends;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 0f4435d2d97..2ac4f3fd524 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -21,6 +21,7 @@ OBJS = \
fork_process.o \
interrupt.o \
launch_backend.o \
+ notify_bgworker.o \
pgarch.o \
pmchild.o \
postmaster.o \
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 1ad65c237c3..0946065895a 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -18,6 +18,7 @@
#include "pgstat.h"
#include "port/atomics.h"
#include "postmaster/bgworker_internals.h"
+#include "postmaster/notify_bgworker.h"
#include "postmaster/postmaster.h"
#include "replication/logicallauncher.h"
#include "replication/logicalworker.h"
@@ -132,6 +133,9 @@ static const struct
},
{
"TablesyncWorkerMain", TablesyncWorkerMain
+ },
+ {
+ "NotifyDeferredWakeupMain", NotifyDeferredWakeupMain
}
};
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index 0008603cfee..c9d285570ae 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -9,6 +9,7 @@ backend_sources += files(
'fork_process.c',
'interrupt.c',
'launch_backend.c',
+ 'notify_bgworker.c',
'pgarch.c',
'pmchild.c',
'postmaster.c',
diff --git a/src/backend/postmaster/notify_bgworker.c b/src/backend/postmaster/notify_bgworker.c
new file mode 100644
index 00000000000..f0c5514cff7
--- /dev/null
+++ b/src/backend/postmaster/notify_bgworker.c
@@ -0,0 +1,225 @@
+/*-------------------------------------------------------------------------
+ *
+ * notify_bgworker.c
+ * Background worker for deferred wakeup of lagging LISTEN/NOTIFY backends
+ *
+ * This background worker is responsible for performing staggered wakeup of
+ * listening backends that have fallen behind in processing the notification
+ * queue. It runs continuously but only performs work when signaled by the
+ * main NOTIFY mechanism.
+ *
+ * The worker is triggered when SignalBackends() in async.c determines that
+ * there are lagging backends that need to be woken up. The worker then
+ * performs a staggered wakeup with delays between signals to avoid
+ * thundering herd effects.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/notify_bgworker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <signal.h>
+#include <unistd.h>
+
+#include "access/parallel.h"
+#include "commands/async.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgworker.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/notify_bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shm_toc.h"
+#include "storage/shmem.h"
+#include "tcop/tcopprot.h"
+#include "utils/memutils.h"
+#include "utils/ps_status.h"
+
+/* Configuration constants */
+#define NOTIFY_DEFERRED_WAKEUP_DELAY_MS 100 /* milliseconds between signals */
+
+/* Flag to indicate SIGUSR1 was received */
+static volatile sig_atomic_t got_sigusr1 = false;
+
+/* Forward declaration */
+static void ProcessDeferredWakeups(void);
+
+/* Signal handler for SIGUSR1 */
+static void
+notify_bgworker_sigusr1(SIGNAL_ARGS)
+{
+ int save_errno = errno;
+
+ got_sigusr1 = true;
+ SetLatch(MyLatch);
+
+ errno = save_errno;
+}
+
+/*
+ * NotifyDeferredWakeupMain
+ * Main entry point for the notify deferred wakeup background worker
+ */
+void
+NotifyDeferredWakeupMain(Datum main_arg)
+{
+ /* Establish signal handlers */
+ pqsignal(SIGUSR1, notify_bgworker_sigusr1);
+ pqsignal(SIGTERM, die);
+ BackgroundWorkerUnblockSignals();
+
+ /* Store our PID in shared memory for signaling */
+ AsyncDeferredWakeupSetWorkerPid(MyProcPid);
+
+ ereport(LOG,
+ (errmsg("notify deferred wakeup worker started")));
+
+ /* Main loop */
+ for (;;)
+ {
+ int rc;
+
+ /* Check for interrupts */
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Wait for signal to wake up. We use WL_LATCH_SET to wake on our
+ * latch being set, and WL_EXIT_ON_PM_DEATH to ensure we exit if the
+ * postmaster dies.
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+ -1,
+ WAIT_EVENT_NOTIFY_DEFERRED_WAKEUP);
+
+ ResetLatch(MyLatch);
+
+ /* Emergency bailout if postmaster has died */
+ if (rc & WL_POSTMASTER_DEATH)
+ proc_exit(1);
+
+ /* Process deferred wakeups if we were signaled */
+ if (got_sigusr1)
+ {
+ got_sigusr1 = false;
+ ProcessDeferredWakeups();
+ }
+ }
+}
+
+/*
+ * ProcessDeferredWakeups
+ * Wake up lagging listening backends with staggered delays
+ *
+ * This function continues processing until there are no more lagging
+ * backends, ensuring all backends eventually get woken up.
+ */
+static void
+ProcessDeferredWakeups(void)
+{
+ int total_wakeup_count = 0;
+
+ /*
+ * Continue processing until there are no more lagging backends. This
+ * ensures we handle all backends that need waking up, even if new ones
+ * become lagging while we're processing.
+ */
+ for (;;)
+ {
+ List *lagging_backends;
+ ListCell *lc;
+ int wakeup_count = 0;
+
+ /*
+ * Build list of lagging backends while holding the lock. We need to
+ * be quick here to avoid holding the lock for too long.
+ */
+ lagging_backends = AsyncGetLaggingBackends();
+
+ if (lagging_backends == NIL)
+ {
+ /* No more lagging backends, we're done */
+ break;
+ }
+
+ /* Now perform the staggered wakeup without holding the lock */
+ foreach(lc, lagging_backends)
+ {
+ BackendWakeupInfo *info = (BackendWakeupInfo *) lfirst(lc);
+
+ /* Send signal to the backend */
+ if (SendProcSignal(info->pid, PROCSIG_NOTIFY_INTERRUPT, info->procno) < 0)
+ {
+ /* Backend might have exited, just log and continue */
+ elog(WARNING, "could not signal backend with PID %d: %m", info->pid);
+ }
+ else
+ {
+ wakeup_count++;
+ total_wakeup_count++;
+ }
+
+ pfree(info);
+
+ /* Sleep between signals to avoid thundering herd */
+ if (lnext(lagging_backends, lc) != NULL)
+ {
+ pg_usleep(NOTIFY_DEFERRED_WAKEUP_DELAY_MS * 1000L);
+
+ /* Check for interrupts between wakeups */
+ CHECK_FOR_INTERRUPTS();
+ }
+ }
+
+ list_free(lagging_backends);
+
+ if (wakeup_count > 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("notify deferred wakeup worker signaled %d lagging backends in this round",
+ wakeup_count)));
+ }
+ }
+
+ if (total_wakeup_count > 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("notify deferred wakeup worker signaled %d lagging backends total",
+ total_wakeup_count)));
+ }
+
+ /* Clear the active flag to indicate we're done */
+ AsyncDeferredWakeupClearActive();
+}
+
+/*
+ * NotifyDeferredWakeupWorkerRegister
+ * Register the notify deferred wakeup background worker
+ */
+void
+NotifyDeferredWakeupWorkerRegister(void)
+{
+ BackgroundWorker worker;
+
+ memset(&worker, 0, sizeof(BackgroundWorker));
+ snprintf(worker.bgw_name, BGW_MAXLEN, "notify deferred wakeup");
+ snprintf(worker.bgw_type, BGW_MAXLEN, "notify deferred wakeup");
+ worker.bgw_flags = BGWORKER_SHMEM_ACCESS;
+ worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+ worker.bgw_restart_time = BGW_DEFAULT_RESTART_INTERVAL;
+ snprintf(worker.bgw_library_name, MAXPGPATH, "postgres");
+ snprintf(worker.bgw_function_name, BGW_MAXLEN, "NotifyDeferredWakeupMain");
+ worker.bgw_main_arg = (Datum) 0;
+ worker.bgw_notify_pid = 0;
+
+ RegisterBackgroundWorker(&worker);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index e1d643b013d..954c3b371c2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -102,6 +102,7 @@
#include "port/pg_bswap.h"
#include "postmaster/autovacuum.h"
#include "postmaster/bgworker_internals.h"
+#include "postmaster/notify_bgworker.h"
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
@@ -929,6 +930,11 @@ PostmasterMain(int argc, char *argv[])
*/
ApplyLauncherRegister();
+ /*
+ * Register the notify deferred wakeup worker.
+ */
+ NotifyDeferredWakeupWorkerRegister();
+
/*
* process any libraries that should be preloaded at postmaster start
*/
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/postmaster/notify_bgworker.h b/src/include/postmaster/notify_bgworker.h
new file mode 100644
index 00000000000..5d8b98b82a6
--- /dev/null
+++ b/src/include/postmaster/notify_bgworker.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * notify_bgworker.h
+ * Deferred wakeup background worker for LISTEN/NOTIFY
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/postmaster/notify_bgworker.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NOTIFY_BGWORKER_H
+#define NOTIFY_BGWORKER_H
+
+#include "storage/proc.h"
+
+/* Structure to hold information about a backend that needs to be woken up */
+typedef struct BackendWakeupInfo
+{
+ int32 pid;
+ ProcNumber procno;
+} BackendWakeupInfo;
+
+/* Wait event for the notify deferred wakeup worker */
+#define WAIT_EVENT_NOTIFY_DEFERRED_WAKEUP PG_WAIT_EXTENSION
+
+
+/* Main entry point for the background worker */
+extern void NotifyDeferredWakeupMain(Datum main_arg);
+
+/* Registration function */
+extern void NotifyDeferredWakeupWorkerRegister(void);
+
+/* Functions to be implemented in async.c for worker interaction */
+extern void AsyncDeferredWakeupSetWorkerPid(pid_t pid);
+extern void AsyncDeferredWakeupClearActive(void);
+extern List *AsyncGetLaggingBackends(void);
+
+#endif /* NOTIFY_BGWORKER_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-01 05:47 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
1 sibling, 0 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-10-01 05:47 UTC (permalink / raw)
To: pgsql-hackers
On Tue, Sep 30, 2025, at 20:56, Joel Jacobson wrote:
> Attachments:
> * optimize_listen_notify-v5.patch
Changes since v5:
*) Added missing #include "nodes/pg_list.h" to fix List type error in headerscheck
*) Add NOTIFY_DEFERRED_WAKEUP_MAIN to wait_event_names.txt and rename WAIT_EVENT_NOTIFY_DEFERRED_WAKEUP to WAIT_EVENT_NOTIFY_DEFERRED_WAKEUP_MAIN
/Joel
Attachments:
[application/octet-stream] optimize_listen_notify-v6.patch (48.4K, 2-optimize_listen_notify-v6.patch)
download | inline diff:
From 86c93ae51099567ffa712f8e7852263237c98e6c Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 28 Sep 2025 14:53:57 +0200
Subject: [PATCH] Fix LISTEN/NOTIFY so it scales with idle listening backends
Currently, idle listening backends cause a dramatic slowdown due to
context switching when they are signaled and wake up. This is wasteful
when they are not listening to the channel being notified.
Just 10 extra idle listening connections cause a slowdown from 8700 TPS
to 6100 TPS, 100 extra cause it to drop to 2000 TPS, and at 1000 extra
it falls to 250 TPS.
To improve scalability with the number of idle listening backends, this
patch introduces a shared hash table to keep track of channels per
listening backend. This hash table is partitioned to reduce contention
on concurrent LISTEN/UNLISTEN operations.
We keep track of up to NOTIFY_MULTICAST_THRESHOLD (16) listeners per
channel. Benchmarks indicated diminishing gains above this level.
Setting it lower seems unnecessary, so a constant seemed fine; a GUC did
not seem motivated.
This patch also adds a wakeup_pending flag to each backend's queue
status to avoid redundant signaling when a wakeup is already pending as
the backend is signaled again. The flag is set when a backend is
signaled and cleared before processing the queue. This order is
important to ensure correctness.
It was also necessary to add a new bgworker, notify_bgworker, whose sole
responsibility is to wake up lagging listening backends, ensuring they
are kicked when they are about to fall too far behind. This bgworker is
always started at postmaster startup, but is only activated upon NOTIFY
by signaling it, unless it is already active. The notify_bgworker
staggers the signaling of lagging listening backends by sleeping 100 ms
between each signal, to prevent the thundering herd problem we would
otherwise get if all listening backends woke up at the same time. It
loops until there are no more lagging listening backends, and then
becomes inactive.
---
src/backend/commands/async.c | 882 +++++++++++++++++-
src/backend/postmaster/Makefile | 1 +
src/backend/postmaster/bgworker.c | 4 +
src/backend/postmaster/meson.build | 1 +
src/backend/postmaster/notify_bgworker.c | 225 +++++
src/backend/postmaster/postmaster.c | 6 +
.../utils/activity/wait_event_names.txt | 2 +
src/include/postmaster/notify_bgworker.h | 37 +
src/include/storage/lwlocklist.h | 1 +
9 files changed, 1120 insertions(+), 39 deletions(-)
create mode 100644 src/backend/postmaster/notify_bgworker.c
create mode 100644 src/include/postmaster/notify_bgworker.h
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..fd32e207408 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,12 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * In addition to each backend maintaining its own list of channels, we also
+ * maintain a central hash table that tracks listeners for each channel, up
+ * to NOTIFY_MULTICAST_THRESHOLD. When the number of listeners is below
+ * this threshold, we can perform a targeted "multicast" by signaling only
+ * those specific backends. If the number of listeners reaches or exceeds the
+ * threshold, we fall back to signaling all listening backends in the database.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -71,13 +75,19 @@
* make any actual updates to the effective listen state (listenChannels).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which sends PROCSIG_NOTIFY_INTERRUPT signals to
+ * listening backends, and has two modes of operation:
+ * a) Multicast mode: For channels with a number of listeners not exceeding
+ * NOTIFY_MULTICAST_THRESHOLD, signals are sent only to those specific
+ * backends.
+ * b) Broadcast mode: If any channel being notified has more listeners than
+ * the threshold (or if the hash table runs out of shared memory for
+ * new entries), we signal every listening backend in the database.
+ *
+ * After sending immediate signals, SignalBackends() also triggers a deferred
+ * wakeup background worker (if not already active) that handles waking up
+ * backends that have fallen behind by QUEUE_CLEANUP_DELAY or more pages,
+ * using staggered delays to prevent thundering herd effects.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +138,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,6 +148,7 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "postmaster/notify_bgworker.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
@@ -146,6 +158,7 @@
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
#include "utils/guc_hooks.h"
+#include "utils/hsearch.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
#include "utils/snapmgr.h"
@@ -162,6 +175,79 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Maximum number of listeners to track per channel for multicast signaling.
+ * When the number of listeners on a channel exceeds this threshold, NOTIFY
+ * will signal all listening backends rather than just those listening on the
+ * specific channel. Setting to 0 disables multicast signaling entirely.
+ */
+#define NOTIFY_MULTICAST_THRESHOLD 16
+
+/*
+ * Number of partitions for the channel hash table's locks.
+ * This must be a power of two.
+ */
+#define NUM_NOTIFY_PARTITIONS 128
+
+/*
+ * Channel hash table definitions
+ *
+ * This hash table provides an optimization by tracking which backends are
+ * listening on each channel, up to a certain threshold. Channels are
+ * identified by database OID and channel name, making them
+ * database-specific.
+ *
+ * To improve scalability of concurrent LISTEN/UNLISTEN operations, the hash
+ * table is partitioned, with each partition protected by its own LWLock.
+ * This avoids serializing all operations on a single global lock.
+ *
+ * When the number of backends listening on a channel is at or below
+ * NOTIFY_MULTICAST_THRESHOLD, we store their ProcNumbers and signal them
+ * directly (multicast).
+ *
+ * We fall back to broadcast mode and signal all listening backends when:
+ * 1) More backends listen on the same channel than the threshold allows, OR
+ * 2) The hash table runs out of shared memory for new entries
+ *
+ * Note that CHANNEL_HASH_MAX_SIZE is not a hard limit - the hash table can
+ * store more entries than this, but performance will degrade due to bucket
+ * overflow. The actual fallback to broadcast mode occurs only when shared
+ * memory is exhausted and we cannot allocate new hash entries.
+ *
+ * The maximum size (CHANNEL_HASH_MAX_SIZE) is based on the typical OS port
+ * range. This provides a reasonable upper bound for systems that use
+ * per-connection channels.
+ *
+ */
+#define CHANNEL_HASH_INIT_SIZE 256
+#define CHANNEL_HASH_MAX_SIZE 65535
+
+/*
+ * Key structure for the channel hash table.
+ * Channels are database-specific, so we need both the database OID
+ * and the channel name to uniquely identify a channel.
+ */
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+/*
+ * Each entry contains a channel key (database OID + channel name) and an array
+ * of listening backend ProcNumbers, up to NOTIFY_MULTICAST_THRESHOLD. If the
+ * number of listeners exceeds the threshold, we mark the channel for
+ * broadcast and stop tracking individual listeners.
+ */
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ bool is_broadcast; /* True if num_listeners >= threshold */
+ uint8 num_listeners; /* Number of listeners currently stored */
+ /* Listeners array follows, of size NOTIFY_MULTICAST_THRESHOLD */
+ ProcNumber listeners[FLEXIBLE_ARRAY_MEMBER];
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -227,8 +313,8 @@ typedef struct QueuePosition
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +332,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeup_pending;
} QueueBackendStatus;
/*
@@ -269,6 +356,11 @@ typedef struct QueueBackendStatus
* In order to avoid deadlocks, whenever we need multiple locks, we first get
* NotifyQueueTailLock, then NotifyQueueLock, and lastly SLRU bank lock.
*
+ * The channel hash table is protected by a separate set of partitioned
+ * locks. To prevent deadlocks between these and NotifyQueueLock, the global
+ * lock-ordering rule is: always acquire NotifyQueueLock *before* acquiring
+ * any channel hash partition lock.
+ *
* Each backend uses the backend[] array entry with index equal to its
* ProcNumber. We rely on this to make SendProcSignal fast.
*
@@ -288,11 +380,67 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ /* Deferred wakeup worker state */
+ bool deferredWakeupWorkerActive; /* is worker processing? */
+ pid_t deferredWakeupWorkerPid; /* PID of worker for signaling */
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+/* Locks for partitioned channel hash table */
+static LWLock *channelHashLocks;
+
+/* Channel hash table for multicast signalling */
+static HTAB *channelHash = NULL;
+
+/* Forward declaration needed by GetChannelHash */
+static uint32 channel_hash_func(const void *key, Size keysize);
+
+/*
+ * GetChannelHash
+ * Get the channel hash table, initializing our backend's pointer if needed.
+ *
+ * This must be called before any access to the channel hash table.
+ * The hash table itself is created in shared memory during AsyncShmemInit,
+ * but each backend needs to get its own pointer to it.
+ */
+static HTAB *
+GetChannelHash(void)
+{
+ if (channelHash == NULL)
+ {
+ HASHCTL hash_ctl;
+ Size entrysize;
+
+ /*
+ * Set up to attach to the existing shared hash table. The hash
+ * control parameters must match those used in AsyncShmemInit.
+ */
+ MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ChannelHashKey);
+
+ /*
+ * The size of a channel entry is flexible. We must have enough space
+ * for the maximum number of listeners specified by the threshold.
+ */
+ entrysize = add_size(offsetof(ChannelEntry, listeners),
+ mul_size(NOTIFY_MULTICAST_THRESHOLD, sizeof(ProcNumber)));
+ hash_ctl.entrysize = entrysize;
+
+ hash_ctl.hash = channel_hash_func;
+ hash_ctl.num_partitions = NUM_NOTIFY_PARTITIONS;
+
+ channelHash = ShmemInitHash("Channel Hash",
+ CHANNEL_HASH_INIT_SIZE,
+ CHANNEL_HASH_MAX_SIZE,
+ &hash_ctl,
+ HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+ }
+
+ return channelHash;
+}
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +449,7 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeup_pending)
/*
* The SLRU buffer area through which we access the notification queue
@@ -458,6 +607,14 @@ static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+/* Channel hash table management functions */
+static LWLock *GetChannelHashLock(const char *channel);
+static inline void ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel);
+static void ChannelHashAddListener(const char *channel, ProcNumber procno);
+static void ChannelHashRemoveListener(const char *channel, ProcNumber procno);
+static ChannelEntry * ChannelHashLookup(const char *channel);
+static List *GetPendingNotifyChannels(void);
+
/*
* Compute the difference between two queue page numbers.
* Previously this function accounted for a wraparound.
@@ -485,6 +642,7 @@ Size
AsyncShmemSize(void)
{
Size size;
+ Size entrysize;
/* This had better match AsyncShmemInit */
size = mul_size(MaxBackends, sizeof(QueueBackendStatus));
@@ -492,6 +650,18 @@ AsyncShmemSize(void)
size = add_size(size, SimpleLruShmemSize(notify_buffers, 0));
+ /*
+ * The size of a channel entry is flexible. We must allocate enough space
+ * for the maximum number of listeners specified by the threshold.
+ */
+ entrysize = add_size(offsetof(ChannelEntry, listeners),
+ mul_size(NOTIFY_MULTICAST_THRESHOLD, sizeof(ProcNumber)));
+ size = add_size(size, hash_estimate_size(CHANNEL_HASH_MAX_SIZE,
+ entrysize));
+
+ /* Space for channel hash partition locks */
+ size = add_size(size, mul_size(NUM_NOTIFY_PARTITIONS, sizeof(LWLock)));
+
return size;
}
@@ -521,12 +691,15 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->deferredWakeupWorkerActive = false;
+ asyncQueueControl->deferredWakeupWorkerPid = 0;
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -546,6 +719,48 @@ AsyncShmemInit(void)
*/
(void) SlruScanDirectory(NotifyCtl, SlruScanDirCbDeleteAll, NULL);
}
+
+ /*
+ * Create or attach to the channel hash table.
+ */
+ {
+ HASHCTL hash_ctl;
+ Size entrysize;
+
+ MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ChannelHashKey);
+
+ /*
+ * The size of a channel entry is flexible. We must have enough space
+ * for the maximum number of listeners specified by the threshold.
+ */
+ entrysize = add_size(offsetof(ChannelEntry, listeners),
+ mul_size(NOTIFY_MULTICAST_THRESHOLD, sizeof(ProcNumber)));
+ hash_ctl.entrysize = entrysize;
+
+ hash_ctl.hash = channel_hash_func;
+ hash_ctl.num_partitions = NUM_NOTIFY_PARTITIONS;
+
+ channelHash = ShmemInitHash("Channel Hash",
+ CHANNEL_HASH_INIT_SIZE,
+ CHANNEL_HASH_MAX_SIZE,
+ &hash_ctl,
+ HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+ }
+
+ /* Initialize locks for the partitioned hash table */
+ size = mul_size(NUM_NOTIFY_PARTITIONS, sizeof(LWLock));
+ channelHashLocks = (LWLock *)
+ ShmemInitStruct("Channel Hash Locks", size, &found);
+ if (!found)
+ {
+ /* First time through: initialize the locks */
+ for (int i = 0; i < NUM_NOTIFY_PARTITIONS; i++)
+ {
+ LWLockInitialize(&channelHashLocks[i],
+ LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ }
+ }
}
@@ -1152,6 +1367,8 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+
+ ChannelHashAddListener(channel, MyProcNumber);
}
/*
@@ -1175,6 +1392,7 @@ Exec_UnlistenCommit(const char *channel)
{
listenChannels = foreach_delete_current(listenChannels, q);
pfree(lchan);
+ ChannelHashRemoveListener(channel, MyProcNumber);
break;
}
}
@@ -1193,9 +1411,22 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ ListCell *p;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ /*
+ * Before freeing the local list, iterate through it and perform a
+ * targeted removal for each of our channels from the shared hash table.
+ */
+ foreach(p, listenChannels)
+ {
+ char *channel = (char *) lfirst(p);
+
+ ChannelHashRemoveListener(channel, MyProcNumber);
+ }
+
list_free_deep(listenChannels);
listenChannels = NIL;
}
@@ -1565,12 +1796,12 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * This function operates in two modes:
+ * 1. Multicast mode: If all pending notification channels have listeners at or
+ * below NOTIFY_MULTICAST_THRESHOLD, we signal only those specific backends.
+ * 2. Broadcast mode: If any channel's listener count exceeds the threshold OR
+ * the hash table lacks memory for new entries, we signal all listening
+ * backends in our database.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1814,12 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ List *channels;
+ ListCell *p;
+ bool *signaled;
+ bool broadcast_mode = false;
+ bool trigger_deferred_wakeup = false;
+ pid_t deferred_wakeup_pid = 0;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1594,40 +1831,149 @@ SignalBackends(void)
*/
pids = (int32 *) palloc(MaxBackends * sizeof(int32));
procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+ signaled = (bool *) palloc0(MaxBackends * sizeof(bool));
count = 0;
+ /* Get list of channels that have pending notifications */
+ channels = GetPendingNotifyChannels();
+
+ /*
+ * To prevent deadlocks, we must always acquire locks in the same order:
+ * global NotifyQueueLock first, then individual partition locks.
+ */
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+
+ /*
+ * Determine if we can use targeted signaling or must broadcast. This
+ * check must be done while holding NotifyQueueLock to prevent deadlocks
+ * against other backends that might be modifying the listener list and
+ * hash table simultaneously (e.g., asyncQueueUnregister).
+ */
+ foreach(p, channels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(p);
+ ChannelEntry *entry;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ LWLockAcquire(lock, LW_SHARED);
+ entry = ChannelHashLookup(channel);
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ /*
+ * If there is no entry, it could mean we ran out of shared memory
+ * when trying to add this channel to the hash table. If the entry is
+ * marked for broadcast, we must use broadcast mode.
+ */
+ if (!entry || entry->is_broadcast)
+ {
+ broadcast_mode = true;
+ LWLockRelease(lock);
+ break;
+ }
+ LWLockRelease(lock);
+ }
+
+ if (broadcast_mode)
+ {
+ /*
+ * In broadcast mode, we iterate over all listening backends and
+ * signal the ones in our database that are not already caught up.
+ */
+ for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
{
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
/*
* Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
+ * already caught up.
*/
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ /* OK, need to signal this one */
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
}
- else
+ }
+ else
+ {
+ /*
+ * In multicast mode, signal specific listening backends. We must
+ * re-check the hash entries here inside the lock to avoid races.
+ */
+ foreach(p, channels)
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
- continue;
+ char *channel = (char *) lfirst(p);
+ ChannelEntry *entry;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ LWLockAcquire(lock, LW_SHARED);
+ entry = ChannelHashLookup(channel);
+
+ if (entry && !entry->is_broadcast)
+ {
+ for (int j = 0; j < entry->num_listeners; j++)
+ {
+ ProcNumber i = entry->listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ if (signaled[i])
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ continue;
+
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ /* OK, need to signal this one */
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ }
+ }
+ LWLockRelease(lock);
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
+
+ /*
+ * Check if we should trigger the deferred wakeup worker after we're done
+ * sending immediate signals. We do this check while still holding the
+ * lock to avoid needing to reacquire it later.
+ */
+ if (!asyncQueueControl->deferredWakeupWorkerActive &&
+ asyncQueueControl->deferredWakeupWorkerPid != 0)
+ {
+ asyncQueueControl->deferredWakeupWorkerActive = true;
+ trigger_deferred_wakeup = true;
+ deferred_wakeup_pid = asyncQueueControl->deferredWakeupWorkerPid;
+ }
+
LWLockRelease(NotifyQueueLock);
/* Now send signals */
@@ -1647,9 +1993,9 @@ SignalBackends(void)
/*
* Note: assuming things aren't broken, a signal failure here could
- * only occur if the target backend exited since we released
- * NotifyQueueLock; which is unlikely but certainly possible. So we
- * just log a low-level debug message if it happens.
+ * only occur if the target backend exited since we released the lock;
+ * which is unlikely but certainly possible. So we just log a
+ * low-level debug message if it happens.
*/
if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]) < 0)
elog(DEBUG3, "could not signal backend with PID %d: %m", pid);
@@ -1657,6 +2003,25 @@ SignalBackends(void)
pfree(pids);
pfree(procnos);
+ pfree(signaled);
+
+ /*
+ * Trigger the deferred wakeup worker if needed. The worker will check for
+ * lagging backends and wake them up with staggered delays.
+ */
+ if (trigger_deferred_wakeup)
+ {
+ if (kill(deferred_wakeup_pid, SIGUSR1) < 0)
+ {
+ /* Worker might have died, clear the flags */
+ elog(WARNING, "could not signal deferred wakeup worker: %m");
+
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ asyncQueueControl->deferredWakeupWorkerActive = false;
+ asyncQueueControl->deferredWakeupWorkerPid = 0;
+ LWLockRelease(NotifyQueueLock);
+ }
+ }
}
/*
@@ -1865,6 +2230,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2395,3 +2761,441 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * Channel hash table management functions
+ */
+
+/*
+ * channel_hash_func
+ * Custom hash function for the channel hash table. This function ensures
+ * that the low-order bits of the hash are well-distributed, which is
+ * critical for partitioned hash tables.
+ */
+static uint32
+channel_hash_func(const void *key, Size keysize)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ uint32 h;
+
+ /*
+ * Mix the dboid and the channel name to produce a good hash. hash_any()
+ * is a high-quality portable hash function. This prevents channels with
+ * the same name in different databases from always mapping to the same
+ * partition.
+ */
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/*
+ * GetChannelHashLock
+ * Return the LWLock that protects the partition for the given channel name.
+ */
+static LWLock *
+GetChannelHashLock(const char *channel)
+{
+ ChannelHashKey key;
+ uint32 hash;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ hash = get_hash_value(GetChannelHash(), &key);
+
+ return &channelHashLocks[hash % NUM_NOTIFY_PARTITIONS];
+}
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key (database OID + channel name) for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
+
+/*
+ * ChannelHashAddListener
+ * Register the given backend as a listener for the specified channel.
+ *
+ * This function uses an optimistic read-locking strategy to maximize
+ * concurrency. An exclusive lock is only taken when mutating the listener
+ * list.
+ *
+ * 1. It first takes a shared lock. If the channel is already in broadcast
+ * mode, or if the current backend is already in the listener list, no write
+ * is needed and we can return immediately.
+ *
+ * 2. If a write is needed, it releases the shared lock and acquires an
+ * exclusive lock.
+ *
+ * 3. CRUCIALLY, after acquiring the exclusive lock, it must re-check the
+ * state, as another backend may have modified the entry in the interim.
+ *
+ * 4. If the number of listeners is below NOTIFY_MULTICAST_THRESHOLD, the
+ * new listener is added. If the threshold is reached, the channel is
+ * converted to broadcast mode.
+ */
+static void
+ChannelHashAddListener(const char *channel, ProcNumber procno)
+{
+ ChannelEntry *entry;
+ bool found;
+ ChannelHashKey key;
+ LWLock *lock = GetChannelHashLock(channel);
+
+ /*
+ * If the threshold is zero, this optimization is disabled. All channels
+ * immediately use broadcast, so we don't need to track them.
+ */
+ if (NOTIFY_MULTICAST_THRESHOLD <= 0)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * FAST PATH: Optimistically take a shared lock. If the channel is already
+ * in broadcast mode, or if we are already listed, we are done.
+ */
+ LWLockAcquire(lock, LW_SHARED);
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (entry)
+ {
+ if (entry->is_broadcast)
+ {
+ LWLockRelease(lock);
+ return;
+ }
+ /* Check if we are already in the list */
+ for (int i = 0; i < entry->num_listeners; i++)
+ {
+ if (entry->listeners[i] == procno)
+ {
+ LWLockRelease(lock);
+ return;
+ }
+ }
+ }
+ LWLockRelease(lock);
+
+ /*
+ * SLOW PATH: We need to write. Acquire exclusive lock.
+ */
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
+ /*
+ * Re-check state after acquiring exclusive lock, as it may have changed.
+ */
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_ENTER_NULL, &found);
+
+ if (entry == NULL)
+ {
+ /* Out of memory in the hash partition. */
+ ereport(DEBUG1, (errmsg("too many notification channels are already being tracked")));
+ LWLockRelease(lock);
+ return;
+ }
+
+ if (!found)
+ {
+ /* First listener for this channel. */
+ entry->is_broadcast = false;
+ entry->num_listeners = 1;
+ entry->listeners[0] = procno;
+ }
+ else
+ {
+ /* Entry already exists, re-check everything. */
+ bool already_present = false;
+
+ if (entry->is_broadcast)
+ {
+ /* Another backend set it to broadcast mode. We're done. */
+ LWLockRelease(lock);
+ return;
+ }
+
+ for (int i = 0; i < entry->num_listeners; i++)
+ {
+ if (entry->listeners[i] == procno)
+ {
+ already_present = true;
+ break;
+ }
+ }
+
+ if (!already_present)
+ {
+ if (entry->num_listeners < NOTIFY_MULTICAST_THRESHOLD)
+ {
+ /* Add ourselves to the list of listeners. */
+ entry->listeners[entry->num_listeners] = procno;
+ entry->num_listeners++;
+ }
+ else
+ {
+ /* We are the listener that exceeds the threshold. */
+ entry->is_broadcast = true;
+ entry->num_listeners = 0; /* Clear the list */
+ }
+ }
+ }
+ LWLockRelease(lock);
+}
+
+/*
+ * ChannelHashRemoveListener
+ * Update the channel hash when a backend stops listening on a channel.
+ *
+ * This function uses an optimistic read-lock strategy. An exclusive lock is
+ * only taken if we are in the listener list for a channel and need to remove
+ * ourselves. If a channel is in broadcast mode, we cannot safely modify it,
+ * as we can't know which backends are listening.
+ */
+static void
+ChannelHashRemoveListener(const char *channel, ProcNumber procno)
+{
+ ChannelEntry *entry;
+ ChannelHashKey key;
+ LWLock *lock = GetChannelHashLock(channel);
+ bool present = false;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * Take a shared lock first to see if a removal is even possible. If the
+ * entry doesn't exist, is in broadcast mode, or we're not in its list, we
+ * have nothing to do. This is the fast path.
+ */
+ LWLockAcquire(lock, LW_SHARED);
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (!entry || entry->is_broadcast)
+ {
+ LWLockRelease(lock);
+ return;
+ }
+
+ /* Check if we are in the list */
+ for (int i = 0; i < entry->num_listeners; i++)
+ {
+ if (entry->listeners[i] == procno)
+ {
+ present = true;
+ break;
+ }
+ }
+ if (!present)
+ {
+ LWLockRelease(lock);
+ return;
+ }
+ LWLockRelease(lock);
+
+ /* A removal is likely needed. Acquire an exclusive lock. */
+ LWLockAcquire(lock, LW_EXCLUSIVE);
+
+ /*
+ * Re-check the state. Another backend might have changed it (e.g., to
+ * broadcast mode).
+ */
+ entry = (ChannelEntry *) hash_search(GetChannelHash(), &key, HASH_FIND, NULL);
+ if (entry && !entry->is_broadcast)
+ {
+ int i;
+
+ for (i = 0; i < entry->num_listeners; i++)
+ {
+ if (entry->listeners[i] == procno)
+ {
+ /*
+ * Found our procno. Remove it from the listener array.
+ *
+ * If this is the last listener, we remove the entire hash
+ * entry for the channel.
+ */
+ if (entry->num_listeners == 1)
+ {
+ (void) hash_search(GetChannelHash(), &key, HASH_REMOVE, NULL);
+ }
+ else
+ {
+ /*
+ * To remove an element from the array while keeping it
+ * contiguous, we first decrement the listener count.
+ * Then, we shift all subsequent elements one position to
+ * the left, overwriting the element we want to remove.
+ *
+ * The `if (i < entry->num_listeners)` condition
+ * explicitly handles the case where the last element in
+ * the array is being removed. In that scenario, `i`
+ * equals the new `num_listeners`, so no memory movement
+ * is necessary, and the `memmove` is correctly skipped.
+ */
+ entry->num_listeners--;
+ if (i < entry->num_listeners)
+ {
+ Size size_to_move;
+
+ size_to_move = mul_size(entry->num_listeners - i,
+ sizeof(ProcNumber));
+ memmove(&entry->listeners[i],
+ &entry->listeners[i + 1],
+ size_to_move);
+ }
+ }
+ break; /* Found and removed, exit loop. */
+ }
+ }
+ }
+ LWLockRelease(lock);
+}
+
+/*
+ * ChannelHashLookup
+ * Look up the channel hash entry for the given channel name in the
+ * current database.
+ *
+ * Returns NULL if no hash entry exists for the channel. When an entry exists,
+ * the caller should check the is_broadcast field to determine if individual
+ * listeners are being tracked or if the channel uses broadcast mode.
+ *
+ * Caller must hold the appropriate partition lock (shared is sufficient).
+ */
+static ChannelEntry *
+ChannelHashLookup(const char *channel)
+{
+ ChannelHashKey key;
+
+ Assert(LWLockHeldByMe(GetChannelHashLock(channel)));
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ return (ChannelEntry *) hash_search(GetChannelHash(),
+ &key,
+ HASH_FIND,
+ NULL);
+}
+
+/*
+ * GetPendingNotifyChannels
+ * Get list of unique channel names from pending notifications.
+ */
+static List *
+GetPendingNotifyChannels(void)
+{
+ List *channels = NIL;
+ ListCell *p;
+ ListCell *q;
+ bool found;
+
+ if (!pendingNotifies)
+ return NIL;
+
+ /* Collect unique channel names from pending notifications */
+ foreach(p, pendingNotifies->events)
+ {
+ Notification *n = (Notification *) lfirst(p);
+ char *channel = n->data;
+
+ /* Check if we already have this channel in our list */
+ found = false;
+ foreach(q, channels)
+ {
+ char *existing = (char *) lfirst(q);
+
+ if (strcmp(existing, channel) == 0)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ channels = lappend(channels, channel);
+ }
+
+ return channels;
+}
+
+/*
+ * AsyncDeferredWakeupSetWorkerPid
+ * Store the PID of the deferred wakeup worker in shared memory
+ */
+void
+AsyncDeferredWakeupSetWorkerPid(pid_t pid)
+{
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ asyncQueueControl->deferredWakeupWorkerPid = pid;
+ LWLockRelease(NotifyQueueLock);
+}
+
+/*
+ * AsyncDeferredWakeupClearActive
+ * Clear the active flag for the deferred wakeup worker
+ */
+void
+AsyncDeferredWakeupClearActive(void)
+{
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ asyncQueueControl->deferredWakeupWorkerActive = false;
+ LWLockRelease(NotifyQueueLock);
+}
+
+/*
+ * AsyncGetLaggingBackends
+ * Get list of lagging listening backends that need to be woken up
+ *
+ * Returns a list of BackendWakeupInfo structs. The caller is responsible
+ * for freeing the list and its contents.
+ */
+List *
+AsyncGetLaggingBackends(void)
+{
+ List *lagging_backends = NIL;
+ QueuePosition head;
+
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ head = QUEUE_HEAD;
+
+ /* Iterate through all listening backends */
+ for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ {
+ QueuePosition pos;
+ int64 pageDiff;
+
+ /* Skip if wakeup is already pending */
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /* Calculate how far behind this backend is */
+ pageDiff = asyncQueuePageDiff(QUEUE_POS_PAGE(head), QUEUE_POS_PAGE(pos));
+
+ /* If backend is lagging by QUEUE_CLEANUP_DELAY or more pages */
+ if (pageDiff >= QUEUE_CLEANUP_DELAY)
+ {
+ BackendWakeupInfo *info;
+
+ info = (BackendWakeupInfo *) palloc(sizeof(BackendWakeupInfo));
+ info->pid = QUEUE_BACKEND_PID(i);
+ info->procno = i;
+
+ /* Mark as having wakeup pending */
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+
+ lagging_backends = lappend(lagging_backends, info);
+ }
+ }
+
+ LWLockRelease(NotifyQueueLock);
+
+ return lagging_backends;
+}
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 0f4435d2d97..2ac4f3fd524 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -21,6 +21,7 @@ OBJS = \
fork_process.o \
interrupt.o \
launch_backend.o \
+ notify_bgworker.o \
pgarch.o \
pmchild.o \
postmaster.o \
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 1ad65c237c3..0946065895a 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -18,6 +18,7 @@
#include "pgstat.h"
#include "port/atomics.h"
#include "postmaster/bgworker_internals.h"
+#include "postmaster/notify_bgworker.h"
#include "postmaster/postmaster.h"
#include "replication/logicallauncher.h"
#include "replication/logicalworker.h"
@@ -132,6 +133,9 @@ static const struct
},
{
"TablesyncWorkerMain", TablesyncWorkerMain
+ },
+ {
+ "NotifyDeferredWakeupMain", NotifyDeferredWakeupMain
}
};
diff --git a/src/backend/postmaster/meson.build b/src/backend/postmaster/meson.build
index 0008603cfee..c9d285570ae 100644
--- a/src/backend/postmaster/meson.build
+++ b/src/backend/postmaster/meson.build
@@ -9,6 +9,7 @@ backend_sources += files(
'fork_process.c',
'interrupt.c',
'launch_backend.c',
+ 'notify_bgworker.c',
'pgarch.c',
'pmchild.c',
'postmaster.c',
diff --git a/src/backend/postmaster/notify_bgworker.c b/src/backend/postmaster/notify_bgworker.c
new file mode 100644
index 00000000000..08b135a05f2
--- /dev/null
+++ b/src/backend/postmaster/notify_bgworker.c
@@ -0,0 +1,225 @@
+/*-------------------------------------------------------------------------
+ *
+ * notify_bgworker.c
+ * Background worker for deferred wakeup of lagging LISTEN/NOTIFY backends
+ *
+ * This background worker is responsible for performing staggered wakeup of
+ * listening backends that have fallen behind in processing the notification
+ * queue. It runs continuously but only performs work when signaled by the
+ * main NOTIFY mechanism.
+ *
+ * The worker is triggered when SignalBackends() in async.c determines that
+ * there are lagging backends that need to be woken up. The worker then
+ * performs a staggered wakeup with delays between signals to avoid
+ * thundering herd effects.
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/notify_bgworker.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <signal.h>
+#include <unistd.h>
+
+#include "access/parallel.h"
+#include "commands/async.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgworker.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/notify_bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/procsignal.h"
+#include "storage/shm_toc.h"
+#include "storage/shmem.h"
+#include "tcop/tcopprot.h"
+#include "utils/memutils.h"
+#include "utils/ps_status.h"
+
+/* Configuration constants */
+#define NOTIFY_DEFERRED_WAKEUP_DELAY_MS 100 /* milliseconds between signals */
+
+/* Flag to indicate SIGUSR1 was received */
+static volatile sig_atomic_t got_sigusr1 = false;
+
+/* Forward declaration */
+static void ProcessDeferredWakeups(void);
+
+/* Signal handler for SIGUSR1 */
+static void
+notify_bgworker_sigusr1(SIGNAL_ARGS)
+{
+ int save_errno = errno;
+
+ got_sigusr1 = true;
+ SetLatch(MyLatch);
+
+ errno = save_errno;
+}
+
+/*
+ * NotifyDeferredWakeupMain
+ * Main entry point for the notify deferred wakeup background worker
+ */
+void
+NotifyDeferredWakeupMain(Datum main_arg)
+{
+ /* Establish signal handlers */
+ pqsignal(SIGUSR1, notify_bgworker_sigusr1);
+ pqsignal(SIGTERM, die);
+ BackgroundWorkerUnblockSignals();
+
+ /* Store our PID in shared memory for signaling */
+ AsyncDeferredWakeupSetWorkerPid(MyProcPid);
+
+ ereport(LOG,
+ (errmsg("notify deferred wakeup worker started")));
+
+ /* Main loop */
+ for (;;)
+ {
+ int rc;
+
+ /* Check for interrupts */
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Wait for signal to wake up. We use WL_LATCH_SET to wake on our
+ * latch being set, and WL_EXIT_ON_PM_DEATH to ensure we exit if the
+ * postmaster dies.
+ */
+ rc = WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+ -1,
+ WAIT_EVENT_NOTIFY_DEFERRED_WAKEUP_MAIN);
+
+ ResetLatch(MyLatch);
+
+ /* Emergency bailout if postmaster has died */
+ if (rc & WL_POSTMASTER_DEATH)
+ proc_exit(1);
+
+ /* Process deferred wakeups if we were signaled */
+ if (got_sigusr1)
+ {
+ got_sigusr1 = false;
+ ProcessDeferredWakeups();
+ }
+ }
+}
+
+/*
+ * ProcessDeferredWakeups
+ * Wake up lagging listening backends with staggered delays
+ *
+ * This function continues processing until there are no more lagging
+ * backends, ensuring all backends eventually get woken up.
+ */
+static void
+ProcessDeferredWakeups(void)
+{
+ int total_wakeup_count = 0;
+
+ /*
+ * Continue processing until there are no more lagging backends. This
+ * ensures we handle all backends that need waking up, even if new ones
+ * become lagging while we're processing.
+ */
+ for (;;)
+ {
+ List *lagging_backends;
+ ListCell *lc;
+ int wakeup_count = 0;
+
+ /*
+ * Build list of lagging backends while holding the lock. We need to
+ * be quick here to avoid holding the lock for too long.
+ */
+ lagging_backends = AsyncGetLaggingBackends();
+
+ if (lagging_backends == NIL)
+ {
+ /* No more lagging backends, we're done */
+ break;
+ }
+
+ /* Now perform the staggered wakeup without holding the lock */
+ foreach(lc, lagging_backends)
+ {
+ BackendWakeupInfo *info = (BackendWakeupInfo *) lfirst(lc);
+
+ /* Send signal to the backend */
+ if (SendProcSignal(info->pid, PROCSIG_NOTIFY_INTERRUPT, info->procno) < 0)
+ {
+ /* Backend might have exited, just log and continue */
+ elog(WARNING, "could not signal backend with PID %d: %m", info->pid);
+ }
+ else
+ {
+ wakeup_count++;
+ total_wakeup_count++;
+ }
+
+ pfree(info);
+
+ /* Sleep between signals to avoid thundering herd */
+ if (lnext(lagging_backends, lc) != NULL)
+ {
+ pg_usleep(NOTIFY_DEFERRED_WAKEUP_DELAY_MS * 1000L);
+
+ /* Check for interrupts between wakeups */
+ CHECK_FOR_INTERRUPTS();
+ }
+ }
+
+ list_free(lagging_backends);
+
+ if (wakeup_count > 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("notify deferred wakeup worker signaled %d lagging backends in this round",
+ wakeup_count)));
+ }
+ }
+
+ if (total_wakeup_count > 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("notify deferred wakeup worker signaled %d lagging backends total",
+ total_wakeup_count)));
+ }
+
+ /* Clear the active flag to indicate we're done */
+ AsyncDeferredWakeupClearActive();
+}
+
+/*
+ * NotifyDeferredWakeupWorkerRegister
+ * Register the notify deferred wakeup background worker
+ */
+void
+NotifyDeferredWakeupWorkerRegister(void)
+{
+ BackgroundWorker worker;
+
+ memset(&worker, 0, sizeof(BackgroundWorker));
+ snprintf(worker.bgw_name, BGW_MAXLEN, "notify deferred wakeup");
+ snprintf(worker.bgw_type, BGW_MAXLEN, "notify deferred wakeup");
+ worker.bgw_flags = BGWORKER_SHMEM_ACCESS;
+ worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+ worker.bgw_restart_time = BGW_DEFAULT_RESTART_INTERVAL;
+ snprintf(worker.bgw_library_name, MAXPGPATH, "postgres");
+ snprintf(worker.bgw_function_name, BGW_MAXLEN, "NotifyDeferredWakeupMain");
+ worker.bgw_main_arg = (Datum) 0;
+ worker.bgw_notify_pid = 0;
+
+ RegisterBackgroundWorker(&worker);
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index e1d643b013d..954c3b371c2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -102,6 +102,7 @@
#include "port/pg_bswap.h"
#include "postmaster/autovacuum.h"
#include "postmaster/bgworker_internals.h"
+#include "postmaster/notify_bgworker.h"
#include "postmaster/pgarch.h"
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
@@ -929,6 +930,11 @@ PostmasterMain(int argc, char *argv[])
*/
ApplyLauncherRegister();
+ /*
+ * Register the notify deferred wakeup worker.
+ */
+ NotifyDeferredWakeupWorkerRegister();
+
/*
* process any libraries that should be preloaded at postmaster start
*/
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..6c21c721835 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -61,6 +61,7 @@ IO_WORKER_MAIN "Waiting in main loop of IO Worker process."
LOGICAL_APPLY_MAIN "Waiting in main loop of logical replication apply process."
LOGICAL_LAUNCHER_MAIN "Waiting in main loop of logical replication launcher process."
LOGICAL_PARALLEL_APPLY_MAIN "Waiting in main loop of logical replication parallel apply process."
+NOTIFY_DEFERRED_WAKEUP_MAIN "Waiting in main loop of notify deferred wakeup process."
RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive, during streaming recovery."
REPLICATION_SLOTSYNC_MAIN "Waiting in main loop of slot sync worker."
REPLICATION_SLOTSYNC_SHUTDOWN "Waiting for slot sync worker to shut down."
@@ -366,6 +367,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/postmaster/notify_bgworker.h b/src/include/postmaster/notify_bgworker.h
new file mode 100644
index 00000000000..03f462d01b1
--- /dev/null
+++ b/src/include/postmaster/notify_bgworker.h
@@ -0,0 +1,37 @@
+/*-------------------------------------------------------------------------
+ *
+ * notify_bgworker.h
+ * Deferred wakeup background worker for LISTEN/NOTIFY
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/postmaster/notify_bgworker.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NOTIFY_BGWORKER_H
+#define NOTIFY_BGWORKER_H
+
+#include "nodes/pg_list.h"
+#include "storage/proc.h"
+
+/* Structure to hold information about a backend that needs to be woken up */
+typedef struct BackendWakeupInfo
+{
+ int32 pid;
+ ProcNumber procno;
+} BackendWakeupInfo;
+
+/* Main entry point for the background worker */
+extern void NotifyDeferredWakeupMain(Datum main_arg);
+
+/* Registration function */
+extern void NotifyDeferredWakeupWorkerRegister(void);
+
+/* Functions to be implemented in async.c for worker interaction */
+extern void AsyncDeferredWakeupSetWorkerPid(pid_t pid);
+extern void AsyncDeferredWakeupClearActive(void);
+extern List *AsyncGetLaggingBackends(void);
+
+#endif /* NOTIFY_BGWORKER_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-02 16:39 Tom Lane <[email protected]>
parent: Joel Jacobson <[email protected]>
1 sibling, 1 reply; 120+ messages in thread
From: Tom Lane @ 2025-10-02 16:39 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: pgsql-hackers
"Joel Jacobson" <[email protected]> writes:
> Thanks for reviewing. However, like said in my previous email, I'm
> sorry, but don't believe in my suggested throughput/latency approach. I
> unfortunately managed to derail from the IMO more promising approaches I
> worked on initially.
> What I couldn't find a solution to then, was the problem of possibly
> ending up in a situation where some lagging backends would never catch
> up.
> In this new patch, I've simply introduced a new bgworker, given the
> specific task of kicking lagging backends. I wish of course we could do
> without the bgworker, but I don't see how that would be possible.
I don't understand why you feel you need a bgworker. The existing
code does not have any provision that guarantees a lost signal will
eventually be re-sent --- it will be if there is continuing NOTIFY
traffic, but not if all the senders suddenly go quiet. AFAIR
we've had zero complaints about that in 25+ years. So I'm perfectly
content to continue the approach of "check for laggards during
NOTIFY". (This could be gated behind an overall check on how long the
notify queue is, so that we don't expend the cycles when things are
performing as-expected.) If you feel that that's not robust enough,
you should split it out as a separate patch that's advertised as a
robustness improvement not a performance improvement, and see if you
can get anyone to bite.
The other thing I'm concerned about with this patch is the new shared
hash table. I don't think we have anywhere near a good enough fix on
how big it needs to be, and that is problematic because of the
frozen-at-startup size of main shared memory. We could imagine
inventing YA GUC to let the user tell us how big to make it,
but I think there is now a better way: use a dshash table
(src/backend/lib/dshash.c). That offers the additional win that we
don't have to create it at all in an installation that never uses
LISTEN/NOTIFY. We could also rethink whether we really need the
NOTIFY_MULTICAST_THRESHOLD limit: rather than having two code paths,
we could just say that all listeners are registered for every channel.
regards, tom lane
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-06 20:11 Joel Jacobson <[email protected]>
parent: Tom Lane <[email protected]>
0 siblings, 2 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-10-06 20:11 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: pgsql-hackers
On Thu, Oct 2, 2025, at 18:39, Tom Lane wrote:
> I don't understand why you feel you need a bgworker. The existing
> code does not have any provision that guarantees a lost signal will
> eventually be re-sent --- it will be if there is continuing NOTIFY
> traffic, but not if all the senders suddenly go quiet. AFAIR
> we've had zero complaints about that in 25+ years. So I'm perfectly
> content to continue the approach of "check for laggards during
> NOTIFY". (This could be gated behind an overall check on how long the
> notify queue is, so that we don't expend the cycles when things are
> performing as-expected.) If you feel that that's not robust enough,
> you should split it out as a separate patch that's advertised as a
> robustness improvement not a performance improvement, and see if you
> can get anyone to bite.
Good point. I agree it's better to check for laggards during NOTIFY.
> The other thing I'm concerned about with this patch is the new shared
> hash table. I don't think we have anywhere near a good enough fix on
> how big it needs to be, and that is problematic because of the
> frozen-at-startup size of main shared memory. We could imagine
> inventing YA GUC to let the user tell us how big to make it,
> but I think there is now a better way: use a dshash table
> (src/backend/lib/dshash.c). That offers the additional win that we
> don't have to create it at all in an installation that never uses
> LISTEN/NOTIFY. We could also rethink whether we really need the
> NOTIFY_MULTICAST_THRESHOLD limit: rather than having two code paths,
> we could just say that all listeners are registered for every channel.
Thanks for guidance, I didn't know about dshash.
The patch is now using dshash. I've been looking at code in launcher.c
when implementing it. The function init_channel_hash() ended up being
very similar to launcher.c's logicalrep_launcher_attach_dshmem().
/Joel
Attachments:
[application/octet-stream] optimize_listen_notify-v7.patch (23.0K, 2-optimize_listen_notify-v7.patch)
download | inline diff:
From 15313fab42a9f9a3c80df146bf3c8443b7a8f7de Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 28 Sep 2025 14:53:57 +0200
Subject: [PATCH] Optimize LISTEN/NOTIFY with channel-specific listener
tracking
Currently, idle listening backends cause a dramatic slowdown due to
context switching when they are signaled and wake up. This is wasteful
when they are not listening to the channel being notified.
Just 10 extra idle listening connections cause a slowdown from 8700 TPS
to 6100 TPS, 100 extra cause it to drop to 2000 TPS, and at 1000 extra
it falls to 250 TPS.
This patch introduces targeted signaling for LISTEN/NOTIFY, improving
scalability in workloads with many idle listeners.
A dynamic shared hash table now tracks which backends listen on each
(database, channel) pair, which SignalBackends() uses to perform
targeted signaling. In addition, it staggers wakeups by signaling one
backend at the global tail to help it advance gradually, and forces any
excessively lagging backends to catch up. A per-backend wakeup_pending
flag avoids redundant signals.
---
src/backend/commands/async.c | 501 ++++++++++++++++--
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
3 files changed, 465 insertions(+), 38 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..b0819934d1e 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -71,13 +73,17 @@
* make any actual updates to the effective listen state (listenChannels).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which consults the shared channel hash table to
+ * identify listeners for the channels that have pending notifications
+ * in the current database. Each selected backend is marked as having a
+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to it.
+ *
+ * To maintain queue health, SignalBackends() also wakes one backend
+ * positioned at the global queue tail to help advance it, and signals
+ * any backend that has fallen too far behind to catch up. These measures
+ * prevent the notification queue from growing indefinitely, while mostly
+ * limiting wakeups to the backends that actually need them.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +134,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,14 +144,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +172,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listeners_array; /* DSA pointer to ProcNumber array */
+ int num_listeners; /* Number of listeners currently stored */
+ int allocated_listeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -227,8 +260,8 @@ typedef struct QueuePosition
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +279,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeup_pending; /* signal sent but not yet processed */
} QueueBackendStatus;
/*
@@ -288,11 +322,92 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channel_hash_dsa;
+ dshash_table_handle channel_hash_dsh;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channel_dsa = NULL;
+static dshash_table *channel_hash = NULL;
+static dshash_hash channel_hash_func(const void *key, size_t size, void *arg);
+
+/* parameters for the channel hash table */
+static const dshash_parameters channel_dsh_params = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channel_hash_func,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * channel_hash_func
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channel_hash_func(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/*
+ * init_channel_hash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+init_channel_hash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channel_hash_dsh != DSHASH_HANDLE_INVALID &&
+ channel_hash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channel_hash_dsh == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channel_dsa = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channel_dsa);
+ dsa_pin_mapping(channel_dsa);
+ channel_hash = dshash_create(channel_dsa, &channel_dsh_params, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channel_hash_dsa = dsa_get_handle(channel_dsa);
+ asyncQueueControl->channel_hash_dsh =
+ dshash_get_hash_table_handle(channel_hash);
+ }
+ else
+ {
+ /* Attach to existing dynamic shared hash table */
+ channel_dsa = dsa_attach(asyncQueueControl->channel_hash_dsa);
+ dsa_pin_mapping(channel_dsa);
+
+ channel_hash = dshash_attach(channel_dsa, &channel_dsh_params,
+ asyncQueueControl->channel_hash_dsh,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +416,7 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeup_pending)
/*
* The SLRU buffer area through which we access the notification queue
@@ -457,6 +573,11 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+static inline void ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel);
+static void ChannelHashAddListener(const char *channel);
+static void ChannelHashRemoveListener(const char *channel);
+static ChannelEntry * ChannelHashLookup(const char *channel);
+static List *GetPendingNotifyChannels(void);
/*
* Compute the difference between two queue page numbers.
@@ -521,12 +642,16 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channel_hash_dsa = DSA_HANDLE_INVALID;
+ asyncQueueControl->channel_hash_dsh = DSHASH_HANDLE_INVALID;
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1152,6 +1277,7 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+ ChannelHashAddListener(channel);
}
/*
@@ -1175,6 +1301,7 @@ Exec_UnlistenCommit(const char *channel)
{
listenChannels = foreach_delete_current(listenChannels, q);
pfree(lchan);
+ ChannelHashRemoveListener(channel);
break;
}
}
@@ -1193,9 +1320,18 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ ListCell *p;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ foreach(p, listenChannels)
+ {
+ char *channel = (char *) lfirst(p);
+
+ ChannelHashRemoveListener(channel);
+ }
+
list_free_deep(listenChannels);
listenChannels = NIL;
}
@@ -1565,12 +1701,16 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends registered as listeners for channels
+ * with pending notifications. However, when there is no traffic on some
+ * channels, listeners on such channels will fall further and further
+ * behind. Waken them if they are too far behind, so that they'll
+ * advance their queue position pointers, allowing the global tail to
+ * advance.
+ *
+ * To stagger wakeups of lagging backends, wake the backend furthest
+ * behind (at the tail), amortizing the context-switching cost across
+ * successive notifications instead of paying it all at once.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1723,10 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ List *channels;
+ ListCell *p;
+ bool *signaled;
+ int64 queue_length;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1594,39 +1738,110 @@ SignalBackends(void)
*/
pids = (int32 *) palloc(MaxBackends * sizeof(int32));
procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+ signaled = (bool *) palloc0(MaxBackends * sizeof(bool));
count = 0;
+ channels = GetPendingNotifyChannels();
+
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(p, channels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(p);
+ ChannelEntry *entry;
+ ProcNumber *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ entry = ChannelHashLookup(channel);
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ /* Get the listener array from DSA */
+ listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ entry->listeners_array);
+
+ for (int j = 0; j < entry->num_listeners; j++)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (signaled[i] || QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up or wrong database */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
}
- else
+
+ dshash_release_lock(channel_hash, entry);
+ }
+
+ queue_length = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(QUEUE_TAIL));
+
+ /* Check for lagging backends when the queue spans multiple pages */
+ if (queue_length > 0)
+ {
+ bool tail_woken = false;
+
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int64 lag;
+ int32 pid;
+
+ if (signaled[i] || QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /* Signal one backend positioned at the global tail */
+ if (!tail_woken && asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_TAIL),
+ QUEUE_POS_PAGE(pos)) == 0)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ tail_woken = true;
+ continue;
+ }
+
+ lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(pos));
+
+ /* Need to signal if a backend has fallen too far beind */
+ if (lag >= QUEUE_CLEANUP_DELAY)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1657,6 +1872,7 @@ SignalBackends(void)
pfree(pids);
pfree(procnos);
+ pfree(signaled);
}
/*
@@ -1865,6 +2081,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2395,3 +2612,211 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
+
+/*
+ * ChannelHashAddListener
+ * Register as a listener for the specified channel.
+ */
+static void
+ChannelHashAddListener(const char *channel)
+{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
+
+ init_channel_hash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listeners_array to InvalidDsaPointer as
+ * a marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channel_hash, &key, &found);
+
+ if (!found)
+ entry->listeners_array = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listeners_array))
+ {
+ /* First listener for this channel */
+ entry->listeners_array = dsa_allocate(channel_dsa,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->num_listeners = 0;
+ entry->allocated_listeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ entry->listeners_array);
+
+ for (int i = 0; i < entry->num_listeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channel_hash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ /* Need to add this listener */
+ if (entry->num_listeners >= entry->allocated_listeners)
+ {
+ /* Grow the array (double the size) */
+ int new_size = entry->allocated_listeners * 2;
+ dsa_pointer new_array = dsa_allocate(channel_dsa,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ new_array);
+
+ /* Copy existing listeners */
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->num_listeners);
+
+ /* Free old array and update entry */
+ dsa_free(channel_dsa, entry->listeners_array);
+ entry->listeners_array = new_array;
+ entry->allocated_listeners = new_size;
+ listeners = new_listeners;
+ }
+
+ /* Add the new listener */
+ listeners[entry->num_listeners] = MyProcNumber;
+ entry->num_listeners++;
+
+ dshash_release_lock(channel_hash, entry);
+}
+
+/*
+ * ChannelHashRemoveListener
+ * Unregister as a listener for the specified channel.
+ */
+static void
+ChannelHashRemoveListener(const char *channel)
+{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ int i;
+
+ if (channel_hash == NULL)
+ return;
+
+ init_channel_hash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ entry = dshash_find(channel_hash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ entry->listeners_array);
+
+ for (i = 0; i < entry->num_listeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ /* Found it, remove by shifting remaining elements */
+ entry->num_listeners--;
+ if (i < entry->num_listeners)
+ {
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->num_listeners - i));
+ }
+
+ if (entry->num_listeners == 0)
+ {
+ dsa_free(channel_dsa, entry->listeners_array);
+ dshash_delete_entry(channel_hash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channel_hash, entry);
+ }
+ return;
+ }
+ }
+
+ /* Not found in list */
+ dshash_release_lock(channel_hash, entry);
+}
+
+/*
+ * ChannelHashLookup
+ * Find the hash entry for a channel.
+ *
+ * Returns NULL if no hash entry exists for the channel.
+ * Caller must call dshash_release_lock() when done with the entry.
+ */
+static ChannelEntry *
+ChannelHashLookup(const char *channel)
+{
+ ChannelHashKey key;
+ ChannelEntry *result;
+
+ /* Hash may not be initialized */
+ if (channel_hash == NULL)
+ return NULL;
+
+ init_channel_hash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ result = dshash_find(channel_hash, &key, false);
+
+ return result;
+}
+
+/*
+ * GetPendingNotifyChannels
+ * Get list of unique channel names from pending notifications.
+ */
+static List *
+GetPendingNotifyChannels(void)
+{
+ List *channels = NIL;
+ ListCell *p;
+ ListCell *q;
+ bool found;
+
+ if (!pendingNotifies)
+ return NIL;
+
+ foreach(p, pendingNotifies->events)
+ {
+ Notification *n = (Notification *) lfirst(p);
+ char *channel = n->data;
+
+ found = false;
+ foreach(q, channels)
+ {
+ char *existing = (char *) lfirst(q);
+
+ if (strcmp(existing, channel) == 0)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ channels = lappend(channels, channel);
+ }
+
+ return channels;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-06 20:22 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
1 sibling, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-06 20:22 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: pgsql-hackers
On Mon, Oct 6, 2025, at 22:11, Joel Jacobson wrote:
> The patch is now using dshash. I've been looking at code in launcher.c
> when implementing it. The function init_channel_hash() ended up being
> very similar to launcher.c's logicalrep_launcher_attach_dshmem().
Noticed a mistake on one line just after pressing send.
Sorry about that, new version attached.
/Joel
Attachments:
[application/octet-stream] optimize_listen_notify-v8.patch (23.0K, 2-optimize_listen_notify-v8.patch)
download | inline diff:
From 76a289dc25e2d0987bb2707a4d8392d2e29aa8a7 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 28 Sep 2025 14:53:57 +0200
Subject: [PATCH] Optimize LISTEN/NOTIFY with channel-specific listener
tracking
Currently, idle listening backends cause a dramatic slowdown due to
context switching when they are signaled and wake up. This is wasteful
when they are not listening to the channel being notified.
Just 10 extra idle listening connections cause a slowdown from 8700 TPS
to 6100 TPS, 100 extra cause it to drop to 2000 TPS, and at 1000 extra
it falls to 250 TPS.
This patch introduces targeted signaling for LISTEN/NOTIFY, improving
scalability in workloads with many idle listeners.
A dynamic shared hash table now tracks which backends listen on each
(database, channel) pair, which SignalBackends() uses to perform
targeted signaling. In addition, it staggers wakeups by signaling one
backend at the global tail to help it advance gradually, and forces any
excessively lagging backends to catch up. A per-backend wakeup_pending
flag avoids redundant signals.
---
src/backend/commands/async.c | 500 ++++++++++++++++--
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
3 files changed, 464 insertions(+), 38 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..681b951ce65 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -71,13 +73,17 @@
* make any actual updates to the effective listen state (listenChannels).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which consults the shared channel hash table to
+ * identify listeners for the channels that have pending notifications
+ * in the current database. Each selected backend is marked as having a
+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to it.
+ *
+ * To maintain queue health, SignalBackends() also wakes one backend
+ * positioned at the global queue tail to help advance it, and signals
+ * any backend that has fallen too far behind to catch up. These measures
+ * prevent the notification queue from growing indefinitely, while mostly
+ * limiting wakeups to the backends that actually need them.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +134,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,14 +144,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +172,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listeners_array; /* DSA pointer to ProcNumber array */
+ int num_listeners; /* Number of listeners currently stored */
+ int allocated_listeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -227,8 +260,8 @@ typedef struct QueuePosition
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +279,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeup_pending; /* signal sent but not yet processed */
} QueueBackendStatus;
/*
@@ -288,11 +322,91 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channel_hash_dsa;
+ dshash_table_handle channel_hash_dsh;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channel_dsa = NULL;
+static dshash_table *channel_hash = NULL;
+static dshash_hash channel_hash_func(const void *key, size_t size, void *arg);
+
+/* parameters for the channel hash table */
+static const dshash_parameters channel_dsh_params = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channel_hash_func,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * channel_hash_func
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channel_hash_func(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/*
+ * init_channel_hash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+init_channel_hash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channel_hash_dsh != DSHASH_HANDLE_INVALID &&
+ channel_hash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channel_hash_dsh == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channel_dsa = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channel_dsa);
+ dsa_pin_mapping(channel_dsa);
+ channel_hash = dshash_create(channel_dsa, &channel_dsh_params, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channel_hash_dsa = dsa_get_handle(channel_dsa);
+ asyncQueueControl->channel_hash_dsh =
+ dshash_get_hash_table_handle(channel_hash);
+ }
+ else if (!channel_hash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channel_dsa = dsa_attach(asyncQueueControl->channel_hash_dsa);
+ dsa_pin_mapping(channel_dsa);
+ channel_hash = dshash_attach(channel_dsa, &channel_dsh_params,
+ asyncQueueControl->channel_hash_dsh,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +415,7 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeup_pending)
/*
* The SLRU buffer area through which we access the notification queue
@@ -457,6 +572,11 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+static inline void ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel);
+static void ChannelHashAddListener(const char *channel);
+static void ChannelHashRemoveListener(const char *channel);
+static ChannelEntry * ChannelHashLookup(const char *channel);
+static List *GetPendingNotifyChannels(void);
/*
* Compute the difference between two queue page numbers.
@@ -521,12 +641,16 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channel_hash_dsa = DSA_HANDLE_INVALID;
+ asyncQueueControl->channel_hash_dsh = DSHASH_HANDLE_INVALID;
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1152,6 +1276,7 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+ ChannelHashAddListener(channel);
}
/*
@@ -1175,6 +1300,7 @@ Exec_UnlistenCommit(const char *channel)
{
listenChannels = foreach_delete_current(listenChannels, q);
pfree(lchan);
+ ChannelHashRemoveListener(channel);
break;
}
}
@@ -1193,9 +1319,18 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ ListCell *p;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ foreach(p, listenChannels)
+ {
+ char *channel = (char *) lfirst(p);
+
+ ChannelHashRemoveListener(channel);
+ }
+
list_free_deep(listenChannels);
listenChannels = NIL;
}
@@ -1565,12 +1700,16 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends registered as listeners for channels
+ * with pending notifications. However, when there is no traffic on some
+ * channels, listeners on such channels will fall further and further
+ * behind. Waken them if they are too far behind, so that they'll
+ * advance their queue position pointers, allowing the global tail to
+ * advance.
+ *
+ * To stagger wakeups of lagging backends, wake the backend furthest
+ * behind (at the tail), amortizing the context-switching cost across
+ * successive notifications instead of paying it all at once.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1722,10 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ List *channels;
+ ListCell *p;
+ bool *signaled;
+ int64 queue_length;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1594,39 +1737,110 @@ SignalBackends(void)
*/
pids = (int32 *) palloc(MaxBackends * sizeof(int32));
procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+ signaled = (bool *) palloc0(MaxBackends * sizeof(bool));
count = 0;
+ channels = GetPendingNotifyChannels();
+
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(p, channels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(p);
+ ChannelEntry *entry;
+ ProcNumber *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ entry = ChannelHashLookup(channel);
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ /* Get the listener array from DSA */
+ listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ entry->listeners_array);
+
+ for (int j = 0; j < entry->num_listeners; j++)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (signaled[i] || QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up or wrong database */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
}
- else
+
+ dshash_release_lock(channel_hash, entry);
+ }
+
+ queue_length = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(QUEUE_TAIL));
+
+ /* Check for lagging backends when the queue spans multiple pages */
+ if (queue_length > 0)
+ {
+ bool tail_woken = false;
+
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int64 lag;
+ int32 pid;
+
+ if (signaled[i] || QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /* Signal one backend positioned at the global tail */
+ if (!tail_woken && asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_TAIL),
+ QUEUE_POS_PAGE(pos)) == 0)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ tail_woken = true;
+ continue;
+ }
+
+ lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(pos));
+
+ /* Need to signal if a backend has fallen too far beind */
+ if (lag >= QUEUE_CLEANUP_DELAY)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1657,6 +1871,7 @@ SignalBackends(void)
pfree(pids);
pfree(procnos);
+ pfree(signaled);
}
/*
@@ -1865,6 +2080,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2395,3 +2611,211 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
+
+/*
+ * ChannelHashAddListener
+ * Register as a listener for the specified channel.
+ */
+static void
+ChannelHashAddListener(const char *channel)
+{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
+
+ init_channel_hash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listeners_array to InvalidDsaPointer as
+ * a marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channel_hash, &key, &found);
+
+ if (!found)
+ entry->listeners_array = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listeners_array))
+ {
+ /* First listener for this channel */
+ entry->listeners_array = dsa_allocate(channel_dsa,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->num_listeners = 0;
+ entry->allocated_listeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ entry->listeners_array);
+
+ for (int i = 0; i < entry->num_listeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channel_hash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ /* Need to add this listener */
+ if (entry->num_listeners >= entry->allocated_listeners)
+ {
+ /* Grow the array (double the size) */
+ int new_size = entry->allocated_listeners * 2;
+ dsa_pointer new_array = dsa_allocate(channel_dsa,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ new_array);
+
+ /* Copy existing listeners */
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->num_listeners);
+
+ /* Free old array and update entry */
+ dsa_free(channel_dsa, entry->listeners_array);
+ entry->listeners_array = new_array;
+ entry->allocated_listeners = new_size;
+ listeners = new_listeners;
+ }
+
+ /* Add the new listener */
+ listeners[entry->num_listeners] = MyProcNumber;
+ entry->num_listeners++;
+
+ dshash_release_lock(channel_hash, entry);
+}
+
+/*
+ * ChannelHashRemoveListener
+ * Unregister as a listener for the specified channel.
+ */
+static void
+ChannelHashRemoveListener(const char *channel)
+{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ int i;
+
+ if (channel_hash == NULL)
+ return;
+
+ init_channel_hash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ entry = dshash_find(channel_hash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ entry->listeners_array);
+
+ for (i = 0; i < entry->num_listeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ /* Found it, remove by shifting remaining elements */
+ entry->num_listeners--;
+ if (i < entry->num_listeners)
+ {
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->num_listeners - i));
+ }
+
+ if (entry->num_listeners == 0)
+ {
+ dsa_free(channel_dsa, entry->listeners_array);
+ dshash_delete_entry(channel_hash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channel_hash, entry);
+ }
+ return;
+ }
+ }
+
+ /* Not found in list */
+ dshash_release_lock(channel_hash, entry);
+}
+
+/*
+ * ChannelHashLookup
+ * Find the hash entry for a channel.
+ *
+ * Returns NULL if no hash entry exists for the channel.
+ * Caller must call dshash_release_lock() when done with the entry.
+ */
+static ChannelEntry *
+ChannelHashLookup(const char *channel)
+{
+ ChannelHashKey key;
+ ChannelEntry *result;
+
+ /* Hash may not be initialized */
+ if (channel_hash == NULL)
+ return NULL;
+
+ init_channel_hash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ result = dshash_find(channel_hash, &key, false);
+
+ return result;
+}
+
+/*
+ * GetPendingNotifyChannels
+ * Get list of unique channel names from pending notifications.
+ */
+static List *
+GetPendingNotifyChannels(void)
+{
+ List *channels = NIL;
+ ListCell *p;
+ ListCell *q;
+ bool found;
+
+ if (!pendingNotifies)
+ return NIL;
+
+ foreach(p, pendingNotifies->events)
+ {
+ Notification *n = (Notification *) lfirst(p);
+ char *channel = n->data;
+
+ found = false;
+ foreach(q, channels)
+ {
+ char *existing = (char *) lfirst(q);
+
+ if (strcmp(existing, channel) == 0)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ channels = lappend(channels, channel);
+ }
+
+ return channels;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-07 05:39 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-07 05:39 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: pgsql-hackers
On Mon, Oct 6, 2025, at 22:22, Joel Jacobson wrote:
> On Mon, Oct 6, 2025, at 22:11, Joel Jacobson wrote:
>> The patch is now using dshash. I've been looking at code in launcher.c
>> when implementing it. The function init_channel_hash() ended up being
>> very similar to launcher.c's logicalrep_launcher_attach_dshmem().
>
> Noticed a mistake on one line just after pressing send.
> Sorry about that, new version attached.
Trying to fix the NetBSD failure.
I don't understand why 001_constraint_validation, test 'list_parted2_def
scanned' and test 'part_5 verified by existing constraints' should be
affected by this patch. I guess I could have gotten something wrong with
the locking with dshash, that might somehow affect other tests?
I've changed the dshash_find() in SignalBackends from dshash_find(...,
false) to dshash_find(..., true), that is, to take an exclusive lock
instead. Not sure if this is necessary, since we're not modifying the
entry, but we're already holding an exclusive lock on NotifyQueueLock
here, so I don't think it should affect concurrency.
Any help on looking specifically at the dshash code would be much
appreciated, since I'm new to this interface.
/Joel
Attachments:
[application/octet-stream] optimize_listen_notify-v9.patch (22.5K, 2-optimize_listen_notify-v9.patch)
download | inline diff:
From 0bc4283121441a80c55a92c5b29fe8da9ed65ca8 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 28 Sep 2025 14:53:57 +0200
Subject: [PATCH] Optimize LISTEN/NOTIFY with channel-specific listener
tracking
Currently, idle listening backends cause a dramatic slowdown due to
context switching when they are signaled and wake up. This is wasteful
when they are not listening to the channel being notified.
Just 10 extra idle listening connections cause a slowdown from 8700 TPS
to 6100 TPS, 100 extra cause it to drop to 2000 TPS, and at 1000 extra
it falls to 250 TPS.
This patch introduces targeted signaling for LISTEN/NOTIFY, improving
scalability in workloads with many idle listeners.
A dynamic shared hash table now tracks which backends listen on each
(database, channel) pair, which SignalBackends() uses to perform
targeted signaling. In addition, it staggers wakeups by signaling one
backend at the global tail to help it advance gradually, and forces any
excessively lagging backends to catch up. A per-backend wakeup_pending
flag avoids redundant signals.
---
src/backend/commands/async.c | 481 ++++++++++++++++--
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
3 files changed, 445 insertions(+), 38 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..b619b01da62 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -71,13 +73,17 @@
* make any actual updates to the effective listen state (listenChannels).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which consults the shared channel hash table to
+ * identify listeners for the channels that have pending notifications
+ * in the current database. Each selected backend is marked as having a
+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to it.
+ *
+ * To maintain queue health, SignalBackends() also wakes one backend
+ * positioned at the global queue tail to help advance it, and signals
+ * any backend that has fallen too far behind to catch up. These measures
+ * prevent the notification queue from growing indefinitely, while mostly
+ * limiting wakeups to the backends that actually need them.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +134,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,14 +144,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +172,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listeners_array; /* DSA pointer to ProcNumber array */
+ int num_listeners; /* Number of listeners currently stored */
+ int allocated_listeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -227,8 +260,8 @@ typedef struct QueuePosition
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +279,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeup_pending; /* signal sent but not yet processed */
} QueueBackendStatus;
/*
@@ -288,11 +322,91 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channel_hash_dsa;
+ dshash_table_handle channel_hash_dsh;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channel_dsa = NULL;
+static dshash_table *channel_hash = NULL;
+static dshash_hash channel_hash_func(const void *key, size_t size, void *arg);
+
+/* parameters for the channel hash table */
+static const dshash_parameters channel_dsh_params = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channel_hash_func,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * channel_hash_func
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channel_hash_func(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/*
+ * init_channel_hash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+init_channel_hash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channel_hash_dsh != DSHASH_HANDLE_INVALID &&
+ channel_hash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channel_hash_dsh == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channel_dsa = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channel_dsa);
+ dsa_pin_mapping(channel_dsa);
+ channel_hash = dshash_create(channel_dsa, &channel_dsh_params, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channel_hash_dsa = dsa_get_handle(channel_dsa);
+ asyncQueueControl->channel_hash_dsh =
+ dshash_get_hash_table_handle(channel_hash);
+ }
+ else if (!channel_hash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channel_dsa = dsa_attach(asyncQueueControl->channel_hash_dsa);
+ dsa_pin_mapping(channel_dsa);
+ channel_hash = dshash_attach(channel_dsa, &channel_dsh_params,
+ asyncQueueControl->channel_hash_dsh,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +415,7 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeup_pending)
/*
* The SLRU buffer area through which we access the notification queue
@@ -457,6 +572,10 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+static inline void ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel);
+static void ChannelHashAddListener(const char *channel);
+static void ChannelHashRemoveListener(const char *channel);
+static List *GetPendingNotifyChannels(void);
/*
* Compute the difference between two queue page numbers.
@@ -521,12 +640,16 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channel_hash_dsa = DSA_HANDLE_INVALID;
+ asyncQueueControl->channel_hash_dsh = DSHASH_HANDLE_INVALID;
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1152,6 +1275,7 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+ ChannelHashAddListener(channel);
}
/*
@@ -1175,6 +1299,7 @@ Exec_UnlistenCommit(const char *channel)
{
listenChannels = foreach_delete_current(listenChannels, q);
pfree(lchan);
+ ChannelHashRemoveListener(channel);
break;
}
}
@@ -1193,9 +1318,18 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ ListCell *p;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ foreach(p, listenChannels)
+ {
+ char *channel = (char *) lfirst(p);
+
+ ChannelHashRemoveListener(channel);
+ }
+
list_free_deep(listenChannels);
listenChannels = NIL;
}
@@ -1565,12 +1699,16 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends registered as listeners for channels
+ * with pending notifications. However, when there is no traffic on some
+ * channels, listeners on such channels will fall further and further
+ * behind. Waken them if they are too far behind, so that they'll
+ * advance their queue position pointers, allowing the global tail to
+ * advance.
+ *
+ * To stagger wakeups of lagging backends, wake the backend furthest
+ * behind (at the tail), amortizing the context-switching cost across
+ * successive notifications instead of paying it all at once.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1721,10 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ List *channels;
+ ListCell *p;
+ bool *signaled;
+ int64 queue_length;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1594,39 +1736,118 @@ SignalBackends(void)
*/
pids = (int32 *) palloc(MaxBackends * sizeof(int32));
procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+ signaled = (bool *) palloc0(MaxBackends * sizeof(bool));
count = 0;
+ channels = GetPendingNotifyChannels();
+
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(p, channels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(p);
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ ChannelHashKey key;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channel_hash == NULL)
+ entry = NULL;
+ else
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ init_channel_hash();
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channel_hash, &key, true);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ entry->listeners_array);
+
+ for (int j = 0; j < entry->num_listeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (signaled[i] || QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up or wrong database */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
}
- else
+
+ dshash_release_lock(channel_hash, entry);
+ }
+
+ queue_length = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(QUEUE_TAIL));
+
+ /* Check for lagging backends when the queue spans multiple pages */
+ if (queue_length > 0)
+ {
+ bool tail_woken = false;
+
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int64 lag;
+ int32 pid;
+
+ if (signaled[i] || QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /* Signal one backend positioned at the global tail */
+ if (!tail_woken && asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_TAIL),
+ QUEUE_POS_PAGE(pos)) == 0)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ tail_woken = true;
+ continue;
+ }
+
+ lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(pos));
+
+ /* Need to signal if a backend has fallen too far beind */
+ if (lag >= QUEUE_CLEANUP_DELAY)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1657,6 +1878,7 @@ SignalBackends(void)
pfree(pids);
pfree(procnos);
+ pfree(signaled);
}
/*
@@ -1865,6 +2087,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2395,3 +2618,185 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey * key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
+
+/*
+ * ChannelHashAddListener
+ * Register as a listener for the specified channel.
+ */
+static void
+ChannelHashAddListener(const char *channel)
+{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
+
+ init_channel_hash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listeners_array to InvalidDsaPointer as
+ * a marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channel_hash, &key, &found);
+
+ if (!found)
+ entry->listeners_array = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listeners_array))
+ {
+ /* First listener for this channel */
+ entry->listeners_array = dsa_allocate(channel_dsa,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->num_listeners = 0;
+ entry->allocated_listeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ entry->listeners_array);
+
+ for (int i = 0; i < entry->num_listeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channel_hash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ /* Need to add this listener */
+ if (entry->num_listeners >= entry->allocated_listeners)
+ {
+ /* Grow the array (double the size) */
+ int new_size = entry->allocated_listeners * 2;
+ dsa_pointer new_array = dsa_allocate(channel_dsa,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ new_array);
+
+ /* Copy existing listeners */
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->num_listeners);
+
+ /* Free old array and update entry */
+ dsa_free(channel_dsa, entry->listeners_array);
+ entry->listeners_array = new_array;
+ entry->allocated_listeners = new_size;
+ listeners = new_listeners;
+ }
+
+ /* Add the new listener */
+ listeners[entry->num_listeners] = MyProcNumber;
+ entry->num_listeners++;
+
+ dshash_release_lock(channel_hash, entry);
+}
+
+/*
+ * ChannelHashRemoveListener
+ * Unregister as a listener for the specified channel.
+ */
+static void
+ChannelHashRemoveListener(const char *channel)
+{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ int i;
+
+ if (channel_hash == NULL)
+ return;
+
+ init_channel_hash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ entry = dshash_find(channel_hash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ entry->listeners_array);
+
+ for (i = 0; i < entry->num_listeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ /* Found it, remove by shifting remaining elements */
+ entry->num_listeners--;
+ if (i < entry->num_listeners)
+ {
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->num_listeners - i));
+ }
+
+ if (entry->num_listeners == 0)
+ {
+ dsa_free(channel_dsa, entry->listeners_array);
+ dshash_delete_entry(channel_hash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channel_hash, entry);
+ }
+ return;
+ }
+ }
+
+ /* Not found in list */
+ dshash_release_lock(channel_hash, entry);
+}
+
+/*
+ * GetPendingNotifyChannels
+ * Get list of unique channel names from pending notifications.
+ */
+static List *
+GetPendingNotifyChannels(void)
+{
+ List *channels = NIL;
+ ListCell *p;
+ ListCell *q;
+ bool found;
+
+ if (!pendingNotifies)
+ return NIL;
+
+ foreach(p, pendingNotifies->events)
+ {
+ Notification *n = (Notification *) lfirst(p);
+ char *channel = n->data;
+
+ found = false;
+ foreach(q, channels)
+ {
+ char *existing = (char *) lfirst(q);
+
+ if (strcmp(existing, channel) == 0)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ channels = lappend(channels, channel);
+ }
+
+ return channels;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-07 05:43 Tom Lane <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Tom Lane @ 2025-10-07 05:43 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: pgsql-hackers
"Joel Jacobson" <[email protected]> writes:
> Trying to fix the NetBSD failure.
> I don't understand why 001_constraint_validation, test 'list_parted2_def
> scanned' and test 'part_5 verified by existing constraints' should be
> affected by this patch. I guess I could have gotten something wrong with
> the locking with dshash, that might somehow affect other tests?
Our CI infrastructure is not as stable as one could wish. You
sure this is related at all?
regards, tom lane
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-07 06:16 Joel Jacobson <[email protected]>
parent: Tom Lane <[email protected]>
0 siblings, 0 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-10-07 06:16 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: pgsql-hackers
On Tue, Oct 7, 2025, at 07:43, Tom Lane wrote:
> "Joel Jacobson" <[email protected]> writes:
>> Trying to fix the NetBSD failure.
>> I don't understand why 001_constraint_validation, test 'list_parted2_def
>> scanned' and test 'part_5 verified by existing constraints' should be
>> affected by this patch. I guess I could have gotten something wrong with
>> the locking with dshash, that might somehow affect other tests?
>
> Our CI infrastructure is not as stable as one could wish. You
> sure this is related at all?
No, not sure at all. OK, then going forward, I guess I should ignore
errors coming from just a single farm animal if the error seems
unrelated to my patch.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-07 12:40 Matheus Alcantara <[email protected]>
parent: Joel Jacobson <[email protected]>
1 sibling, 2 replies; 120+ messages in thread
From: Matheus Alcantara @ 2025-10-07 12:40 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; Tom Lane <[email protected]>; +Cc: pgsql-hackers
On Mon Oct 6, 2025 at 5:11 PM -03, Joel Jacobson wrote:
> The patch is now using dshash. I've been looking at code in launcher.c
> when implementing it. The function init_channel_hash() ended up being
> very similar to launcher.c's logicalrep_launcher_attach_dshmem().
>
Hi,
This is not a complete review, I just read the v9 patch and summarized
some points.
1. You may want to add ChannelEntry and ChannelHashKey to typedefs.list
to get pgindent do the right job on indentation.
2. The ListCell* variables are normally named as lc
+ ListCell *p;
3. This block on ChannelHashRemoveListener() seems contradictory. You
early return if channel_hash == NULL and then call init_channel_hash
that it will early return if channel_hash != NULL. So if channel_hash !=
NULL I don't think that we need to call init_channel_hash()?
+ if (channel_hash == NULL)
+ return;
+
+ init_channel_hash();
A similar check also exists on SignalBackends()
if (channel_hash == NULL)
...
else
{
// channel_hash is != NULL, so init_channel_hash will early
// return.
init_channel_hash();
...
}
4. The ChannelHashRemoveListener() release lock logic could be
simplified to something like the following, what do you think?
+ if (entry->num_listeners == 0)
+ {
+ dsa_free(channel_dsa, entry->listeners_array);
+ dshash_delete_entry(channel_hash, entry);
+ }
+ break;
+ }
+ }
+
+ /* Not found in list */
+ dshash_release_lock(channel_hash, entry);
5. You may want to use list_member() on GetPendingNotifyChannels() to
avoid the inner loop to check for duplicate channel names.
6. s/beind/behind
+ /* Need to signal if a backend has fallen too
far beind */
7. I'm wondering if we could add some TAP tests for this? I think that
adding a case to ensure that we can grown the dshash correctly and also
we manage multiple backends to the same channel properly. This CF [1]
has some examples of how TAP tests can be created to test LISTEN/NOTIFY
[1] https://commitfest.postgresql.org/patch/6095/
--
Matheus Alcantara
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-07 16:51 Tom Lane <[email protected]>
parent: Matheus Alcantara <[email protected]>
1 sibling, 1 reply; 120+ messages in thread
From: Tom Lane @ 2025-10-07 16:51 UTC (permalink / raw)
To: Matheus Alcantara <[email protected]>; +Cc: Joel Jacobson <[email protected]>; pgsql-hackers
Matheus Alcantara <[email protected]> writes:
> 7. I'm wondering if we could add some TAP tests for this?
async.c seems already moderately well covered by existing tests
src/test/regress/sql/async.sql
src/test/isolation/specs/async-notify.spec
Do we need more? If there's something not covered, can we extend
those test cases instead of spinning up a whole new installation
for a TAP test?
Also, I don't think it's the job of this patch to provide test
coverage for dshash. That should be quite well covered already.
regards, tom lane
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-07 17:28 Joel Jacobson <[email protected]>
parent: Matheus Alcantara <[email protected]>
1 sibling, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-07 17:28 UTC (permalink / raw)
To: Matheus Alcantara <[email protected]>; Tom Lane <[email protected]>; +Cc: pgsql-hackers
On Tue, Oct 7, 2025, at 14:40, Matheus Alcantara wrote:
> This is not a complete review, I just read the v9 patch and summarized
> some points.
Many thanks for the review!
> 1. You may want to add ChannelEntry and ChannelHashKey to typedefs.list
> to get pgindent do the right job on indentation.
Fixed.
> 2. The ListCell* variables are normally named as lc
> + ListCell *p;
I agree, better to be consistent. I renamed the variables this patch
adds, but I didn't change the existing ListCell *p variables in async.c.
Would we want to harmonize it to just *lc everywhere in async.c?
I notice we also use ListCell *l in async.c at some places.
> 3. This block on ChannelHashRemoveListener() seems contradictory. You
> early return if channel_hash == NULL and then call init_channel_hash
> that it will early return if channel_hash != NULL. So if channel_hash !=
> NULL I don't think that we need to call init_channel_hash()?
> + if (channel_hash == NULL)
> + return;
> +
> + init_channel_hash();
>
> A similar check also exists on SignalBackends()
> if (channel_hash == NULL)
> ...
> else
> {
> // channel_hash is != NULL, so init_channel_hash will early
> // return.
> init_channel_hash();
> ...
> }
Ahh, right, I agree. I've removed the unnecessary init_channel_hash()
calls.
> 4. The ChannelHashRemoveListener() release lock logic could be
> simplified to something like the following, what do you think?
> + if (entry->num_listeners == 0)
> + {
> + dsa_free(channel_dsa, entry->listeners_array);
> + dshash_delete_entry(channel_hash, entry);
> + }
> + break;
> + }
> + }
> +
> + /* Not found in list */
> + dshash_release_lock(channel_hash, entry);
That would be nicer, but I noted that dshash_delete_entry() releases the
lock just like dshash_release_lock(), so then I think we would need to
return; after dshash_delete_entry(), to prevent attempting to release
the lock twice?
> 5. You may want to use list_member() on GetPendingNotifyChannels() to
> avoid the inner loop to check for duplicate channel names.
Ahh, much nicer! Fixed.
> 6. s/beind/behind
> + /* Need to signal if a backend has fallen too
> far beind */
Fixed.
> 7. I'm wondering if we could add some TAP tests for this? I think that
> adding a case to ensure that we can grown the dshash correctly and also
> we manage multiple backends to the same channel properly. This CF [1]
> has some examples of how TAP tests can be created to test LISTEN/NOTIFY
I will look over the tests. Maybe we should add some elog DEBUG at the
new code paths, and ensure the tests at least cover all of them?
/Joel
Attachments:
[application/octet-stream] optimize_listen_notify-v10.patch (22.7K, 2-optimize_listen_notify-v10.patch)
download | inline diff:
From ed451ff79e16ea4991957e120421ca66c920b576 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 28 Sep 2025 14:53:57 +0200
Subject: [PATCH] Optimize LISTEN/NOTIFY with channel-specific listener
tracking
Currently, idle listening backends cause a dramatic slowdown due to
context switching when they are signaled and wake up. This is wasteful
when they are not listening to the channel being notified.
Just 10 extra idle listening connections cause a slowdown from 8700 TPS
to 6100 TPS, 100 extra cause it to drop to 2000 TPS, and at 1000 extra
it falls to 250 TPS.
This patch introduces targeted signaling for LISTEN/NOTIFY, improving
scalability in workloads with many idle listeners.
A dynamic shared hash table now tracks which backends listen on each
(database, channel) pair, which SignalBackends() uses to perform
targeted signaling. In addition, it staggers wakeups by signaling one
backend at the global tail to help it advance gradually, and forces any
excessively lagging backends to catch up. A per-backend wakeup_pending
flag avoids redundant signals.
---
src/backend/commands/async.c | 464 ++++++++++++++++--
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 430 insertions(+), 38 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..0fda6c02b4e 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -71,13 +73,17 @@
* make any actual updates to the effective listen state (listenChannels).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which consults the shared channel hash table to
+ * identify listeners for the channels that have pending notifications
+ * in the current database. Each selected backend is marked as having a
+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to it.
+ *
+ * To maintain queue health, SignalBackends() also wakes one backend
+ * positioned at the global queue tail to help advance it, and signals
+ * any backend that has fallen too far behind to catch up. These measures
+ * prevent the notification queue from growing indefinitely, while mostly
+ * limiting wakeups to the backends that actually need them.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +134,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,14 +144,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +172,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listeners_array; /* DSA pointer to ProcNumber array */
+ int num_listeners; /* Number of listeners currently stored */
+ int allocated_listeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -227,8 +260,8 @@ typedef struct QueuePosition
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +279,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeup_pending; /* signal sent but not yet processed */
} QueueBackendStatus;
/*
@@ -288,11 +322,91 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channel_hash_dsa;
+ dshash_table_handle channel_hash_dsh;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channel_dsa = NULL;
+static dshash_table *channel_hash = NULL;
+static dshash_hash channel_hash_func(const void *key, size_t size, void *arg);
+
+/* parameters for the channel hash table */
+static const dshash_parameters channel_dsh_params = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channel_hash_func,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * channel_hash_func
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channel_hash_func(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/*
+ * init_channel_hash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+init_channel_hash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channel_hash_dsh != DSHASH_HANDLE_INVALID &&
+ channel_hash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channel_hash_dsh == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channel_dsa = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channel_dsa);
+ dsa_pin_mapping(channel_dsa);
+ channel_hash = dshash_create(channel_dsa, &channel_dsh_params, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channel_hash_dsa = dsa_get_handle(channel_dsa);
+ asyncQueueControl->channel_hash_dsh =
+ dshash_get_hash_table_handle(channel_hash);
+ }
+ else if (!channel_hash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channel_dsa = dsa_attach(asyncQueueControl->channel_hash_dsa);
+ dsa_pin_mapping(channel_dsa);
+ channel_hash = dshash_attach(channel_dsa, &channel_dsh_params,
+ asyncQueueControl->channel_hash_dsh,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +415,7 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeup_pending)
/*
* The SLRU buffer area through which we access the notification queue
@@ -457,6 +572,10 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static void ChannelHashAddListener(const char *channel);
+static void ChannelHashRemoveListener(const char *channel);
+static List *GetPendingNotifyChannels(void);
/*
* Compute the difference between two queue page numbers.
@@ -521,12 +640,16 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channel_hash_dsa = DSA_HANDLE_INVALID;
+ asyncQueueControl->channel_hash_dsh = DSHASH_HANDLE_INVALID;
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1152,6 +1275,7 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+ ChannelHashAddListener(channel);
}
/*
@@ -1175,6 +1299,7 @@ Exec_UnlistenCommit(const char *channel)
{
listenChannels = foreach_delete_current(listenChannels, q);
pfree(lchan);
+ ChannelHashRemoveListener(channel);
break;
}
}
@@ -1193,9 +1318,18 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ ListCell *lc;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ foreach(lc, listenChannels)
+ {
+ char *channel = (char *) lfirst(lc);
+
+ ChannelHashRemoveListener(channel);
+ }
+
list_free_deep(listenChannels);
listenChannels = NIL;
}
@@ -1565,12 +1699,16 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends registered as listeners for channels
+ * with pending notifications. However, when there is no traffic on some
+ * channels, listeners on such channels will fall further and further
+ * behind. Waken them if they are too far behind, so that they'll
+ * advance their queue position pointers, allowing the global tail to
+ * advance.
+ *
+ * To stagger wakeups of lagging backends, wake the backend furthest
+ * behind (at the tail), amortizing the context-switching cost across
+ * successive notifications instead of paying it all at once.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1721,10 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ List *channels;
+ ListCell *lc;
+ bool *signaled;
+ int64 queue_length;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1594,39 +1736,117 @@ SignalBackends(void)
*/
pids = (int32 *) palloc(MaxBackends * sizeof(int32));
procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+ signaled = (bool *) palloc0(MaxBackends * sizeof(bool));
count = 0;
+ channels = GetPendingNotifyChannels();
+
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, channels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ ChannelHashKey key;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channel_hash == NULL)
+ entry = NULL;
+ else
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channel_hash, &key, true);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ entry->listeners_array);
+
+ for (int j = 0; j < entry->num_listeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (signaled[i] || QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up or wrong database */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
}
- else
+
+ dshash_release_lock(channel_hash, entry);
+ }
+
+ queue_length = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(QUEUE_TAIL));
+
+ /* Check for lagging backends when the queue spans multiple pages */
+ if (queue_length > 0)
+ {
+ bool tail_woken = false;
+
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int64 lag;
+ int32 pid;
+
+ if (signaled[i] || QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /* Signal one backend positioned at the global tail */
+ if (!tail_woken && asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_TAIL),
+ QUEUE_POS_PAGE(pos)) == 0)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ tail_woken = true;
+ continue;
+ }
+
+ lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(pos));
+
+ /* Need to signal if a backend has fallen too far behind */
+ if (lag >= QUEUE_CLEANUP_DELAY)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1657,6 +1877,7 @@ SignalBackends(void)
pfree(pids);
pfree(procnos);
+ pfree(signaled);
}
/*
@@ -1865,6 +2086,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2395,3 +2617,169 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
+
+/*
+ * ChannelHashAddListener
+ * Register as a listener for the specified channel.
+ */
+static void
+ChannelHashAddListener(const char *channel)
+{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
+
+ init_channel_hash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listeners_array to InvalidDsaPointer as
+ * a marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channel_hash, &key, &found);
+
+ if (!found)
+ entry->listeners_array = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listeners_array))
+ {
+ /* First listener for this channel */
+ entry->listeners_array = dsa_allocate(channel_dsa,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->num_listeners = 0;
+ entry->allocated_listeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ entry->listeners_array);
+
+ for (int i = 0; i < entry->num_listeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channel_hash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ /* Need to add this listener */
+ if (entry->num_listeners >= entry->allocated_listeners)
+ {
+ /* Grow the array (double the size) */
+ int new_size = entry->allocated_listeners * 2;
+ dsa_pointer new_array = dsa_allocate(channel_dsa,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ new_array);
+
+ /* Copy existing listeners */
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->num_listeners);
+
+ /* Free old array and update entry */
+ dsa_free(channel_dsa, entry->listeners_array);
+ entry->listeners_array = new_array;
+ entry->allocated_listeners = new_size;
+ listeners = new_listeners;
+ }
+
+ /* Add the new listener */
+ listeners[entry->num_listeners] = MyProcNumber;
+ entry->num_listeners++;
+
+ dshash_release_lock(channel_hash, entry);
+}
+
+/*
+ * ChannelHashRemoveListener
+ * Unregister as a listener for the specified channel.
+ */
+static void
+ChannelHashRemoveListener(const char *channel)
+{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ int i;
+
+ if (channel_hash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ entry = dshash_find(channel_hash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ entry->listeners_array);
+
+ for (i = 0; i < entry->num_listeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ /* Found it, remove by shifting remaining elements */
+ entry->num_listeners--;
+ if (i < entry->num_listeners)
+ {
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->num_listeners - i));
+ }
+
+ if (entry->num_listeners == 0)
+ {
+ dsa_free(channel_dsa, entry->listeners_array);
+ dshash_delete_entry(channel_hash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channel_hash, entry);
+ }
+ return;
+ }
+ }
+
+ /* Not found in list */
+ dshash_release_lock(channel_hash, entry);
+}
+
+/*
+ * GetPendingNotifyChannels
+ * Get list of unique channel names from pending notifications.
+ */
+static List *
+GetPendingNotifyChannels(void)
+{
+ List *channels = NIL;
+ ListCell *lc;
+
+ if (!pendingNotifies)
+ return NIL;
+
+ foreach(lc, pendingNotifies->events)
+ {
+ Notification *n = (Notification *) lfirst(lc);
+ char *channel = n->data;
+
+ if (!list_member(channels, channel))
+ channels = lappend(channels, channel);
+ }
+
+ return channels;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3c80d49b67e..a0ef639d8ea 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -411,6 +411,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-07 18:14 Tom Lane <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Tom Lane @ 2025-10-07 18:14 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Matheus Alcantara <[email protected]>; pgsql-hackers
"Joel Jacobson" <[email protected]> writes:
>> 7. I'm wondering if we could add some TAP tests for this? I think that
>> adding a case to ensure that we can grown the dshash correctly and also
>> we manage multiple backends to the same channel properly. This CF [1]
>> has some examples of how TAP tests can be created to test LISTEN/NOTIFY
> I will look over the tests. Maybe we should add some elog DEBUG at the
> new code paths, and ensure the tests at least cover all of them?
I went to do a coverage test on v10, and found that it does not get
through the existing async-notify isolation test: it panics with
"cannot abort transaction %u, it was already committed". It's a bit
premature to worry about adding new tests if you're not passing the
ones that are there.
regards, tom lane
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-07 19:26 Joel Jacobson <[email protected]>
parent: Tom Lane <[email protected]>
0 siblings, 2 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-10-07 19:26 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Matheus Alcantara <[email protected]>; pgsql-hackers
On Tue, Oct 7, 2025, at 20:14, Tom Lane wrote:
> "Joel Jacobson" <[email protected]> writes:
>>> 7. I'm wondering if we could add some TAP tests for this? I think that
>>> adding a case to ensure that we can grown the dshash correctly and also
>>> we manage multiple backends to the same channel properly. This CF [1]
>>> has some examples of how TAP tests can be created to test LISTEN/NOTIFY
>
>> I will look over the tests. Maybe we should add some elog DEBUG at the
>> new code paths, and ensure the tests at least cover all of them?
>
> I went to do a coverage test on v10, and found that it does not get
> through the existing async-notify isolation test: it panics with
> "cannot abort transaction %u, it was already committed". It's a bit
> premature to worry about adding new tests if you're not passing the
> ones that are there.
Ops, I see I got the list_member() code wrong. I've changed it to now
create String nodes, and then use strVal().
I also changed back to dshash_find(..., false) in SignalBackends(),
since that makes more sense to me, since we're not modifying entry.
(This was the code change due to me being fooled by the false alarm from
the NetBSD animal.)
/Joel
Attachments:
[application/octet-stream] optimize_listen_notify-v11.patch (22.8K, 2-optimize_listen_notify-v11.patch)
download | inline diff:
From 3b0cc568ae5ea92f6528b00620170e88759dde39 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Tue, 7 Oct 2025 20:56:47 +0200
Subject: [PATCH] Optimize LISTEN/NOTIFY with channel-specific listener
tracking
Currently, idle listening backends cause a dramatic slowdown due to
context switching when they are signaled and wake up. This is wasteful
when they are not listening to the channel being notified.
Just 10 extra idle listening connections cause a slowdown from 8700 TPS
to 6100 TPS, 100 extra cause it to drop to 2000 TPS, and at 1000 extra
it falls to 250 TPS.
This patch introduces targeted signaling for LISTEN/NOTIFY, improving
scalability in workloads with many idle listeners.
A dynamic shared hash table now tracks which backends listen on each
(database, channel) pair, which SignalBackends() uses to perform
targeted signaling. In addition, it staggers wakeups by signaling one
backend at the global tail to help it advance gradually, and forces any
excessively lagging backends to catch up. A per-backend wakeup_pending
flag avoids redundant signals.
---
src/backend/commands/async.c | 466 ++++++++++++++++--
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 432 insertions(+), 38 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..be5a394fc4f 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -71,13 +73,17 @@
* make any actual updates to the effective listen state (listenChannels).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which consults the shared channel hash table to
+ * identify listeners for the channels that have pending notifications
+ * in the current database. Each selected backend is marked as having a
+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to it.
+ *
+ * To maintain queue health, SignalBackends() also wakes one backend
+ * positioned at the global queue tail to help advance it, and signals
+ * any backend that has fallen too far behind to catch up. These measures
+ * prevent the notification queue from growing indefinitely, while mostly
+ * limiting wakeups to the backends that actually need them.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +134,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,14 +144,18 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "nodes/value.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +173,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listeners_array; /* DSA pointer to ProcNumber array */
+ int num_listeners; /* Number of listeners currently stored */
+ int allocated_listeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -227,8 +261,8 @@ typedef struct QueuePosition
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +280,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeup_pending; /* signal sent but not yet processed */
} QueueBackendStatus;
/*
@@ -288,11 +323,91 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channel_hash_dsa;
+ dshash_table_handle channel_hash_dsh;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channel_dsa = NULL;
+static dshash_table *channel_hash = NULL;
+static dshash_hash channel_hash_func(const void *key, size_t size, void *arg);
+
+/* parameters for the channel hash table */
+static const dshash_parameters channel_dsh_params = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channel_hash_func,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * channel_hash_func
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channel_hash_func(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/*
+ * init_channel_hash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+init_channel_hash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channel_hash_dsh != DSHASH_HANDLE_INVALID &&
+ channel_hash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channel_hash_dsh == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channel_dsa = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channel_dsa);
+ dsa_pin_mapping(channel_dsa);
+ channel_hash = dshash_create(channel_dsa, &channel_dsh_params, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channel_hash_dsa = dsa_get_handle(channel_dsa);
+ asyncQueueControl->channel_hash_dsh =
+ dshash_get_hash_table_handle(channel_hash);
+ }
+ else if (!channel_hash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channel_dsa = dsa_attach(asyncQueueControl->channel_hash_dsa);
+ dsa_pin_mapping(channel_dsa);
+ channel_hash = dshash_attach(channel_dsa, &channel_dsh_params,
+ asyncQueueControl->channel_hash_dsh,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +416,7 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeup_pending)
/*
* The SLRU buffer area through which we access the notification queue
@@ -457,6 +573,10 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static void ChannelHashAddListener(const char *channel);
+static void ChannelHashRemoveListener(const char *channel);
+static List *GetPendingNotifyChannels(void);
/*
* Compute the difference between two queue page numbers.
@@ -521,12 +641,16 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channel_hash_dsa = DSA_HANDLE_INVALID;
+ asyncQueueControl->channel_hash_dsh = DSHASH_HANDLE_INVALID;
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1152,6 +1276,7 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+ ChannelHashAddListener(channel);
}
/*
@@ -1175,6 +1300,7 @@ Exec_UnlistenCommit(const char *channel)
{
listenChannels = foreach_delete_current(listenChannels, q);
pfree(lchan);
+ ChannelHashRemoveListener(channel);
break;
}
}
@@ -1193,9 +1319,18 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ ListCell *lc;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ foreach(lc, listenChannels)
+ {
+ char *channel = (char *) lfirst(lc);
+
+ ChannelHashRemoveListener(channel);
+ }
+
list_free_deep(listenChannels);
listenChannels = NIL;
}
@@ -1565,12 +1700,16 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends registered as listeners for channels
+ * with pending notifications. However, when there is no traffic on some
+ * channels, listeners on such channels will fall further and further
+ * behind. Waken them if they are too far behind, so that they'll
+ * advance their queue position pointers, allowing the global tail to
+ * advance.
+ *
+ * To stagger wakeups of lagging backends, wake the backend furthest
+ * behind (at the tail), amortizing the context-switching cost across
+ * successive notifications instead of paying it all at once.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1722,10 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ List *channels;
+ ListCell *lc;
+ bool *signaled;
+ int64 queue_length;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1594,39 +1737,117 @@ SignalBackends(void)
*/
pids = (int32 *) palloc(MaxBackends * sizeof(int32));
procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+ signaled = (bool *) palloc0(MaxBackends * sizeof(bool));
count = 0;
+ channels = GetPendingNotifyChannels();
+
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, channels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = strVal(lfirst(lc));
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ ChannelHashKey key;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channel_hash == NULL)
+ entry = NULL;
+ else
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channel_hash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ entry->listeners_array);
+
+ for (int j = 0; j < entry->num_listeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (signaled[i] || QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up or wrong database */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
}
- else
+
+ dshash_release_lock(channel_hash, entry);
+ }
+
+ queue_length = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(QUEUE_TAIL));
+
+ /* Check for lagging backends when the queue spans multiple pages */
+ if (queue_length > 0)
+ {
+ bool tail_woken = false;
+
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int64 lag;
+ int32 pid;
+
+ if (signaled[i] || QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /* Signal one backend positioned at the global tail */
+ if (!tail_woken && asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_TAIL),
+ QUEUE_POS_PAGE(pos)) == 0)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ tail_woken = true;
+ continue;
+ }
+
+ lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(pos));
+
+ /* Need to signal if a backend has fallen too far behind */
+ if (lag >= QUEUE_CLEANUP_DELAY)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ signaled[i] = true;
+ count++;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1657,6 +1878,7 @@ SignalBackends(void)
pfree(pids);
pfree(procnos);
+ pfree(signaled);
}
/*
@@ -1865,6 +2087,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2395,3 +2618,170 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
+
+/*
+ * ChannelHashAddListener
+ * Register as a listener for the specified channel.
+ */
+static void
+ChannelHashAddListener(const char *channel)
+{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
+
+ init_channel_hash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listeners_array to InvalidDsaPointer as
+ * a marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channel_hash, &key, &found);
+
+ if (!found)
+ entry->listeners_array = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listeners_array))
+ {
+ /* First listener for this channel */
+ entry->listeners_array = dsa_allocate(channel_dsa,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->num_listeners = 0;
+ entry->allocated_listeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ entry->listeners_array);
+
+ for (int i = 0; i < entry->num_listeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channel_hash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ /* Need to add this listener */
+ if (entry->num_listeners >= entry->allocated_listeners)
+ {
+ /* Grow the array (double the size) */
+ int new_size = entry->allocated_listeners * 2;
+ dsa_pointer new_array = dsa_allocate(channel_dsa,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ new_array);
+
+ /* Copy existing listeners */
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->num_listeners);
+
+ /* Free old array and update entry */
+ dsa_free(channel_dsa, entry->listeners_array);
+ entry->listeners_array = new_array;
+ entry->allocated_listeners = new_size;
+ listeners = new_listeners;
+ }
+
+ /* Add the new listener */
+ listeners[entry->num_listeners] = MyProcNumber;
+ entry->num_listeners++;
+
+ dshash_release_lock(channel_hash, entry);
+}
+
+/*
+ * ChannelHashRemoveListener
+ * Unregister as a listener for the specified channel.
+ */
+static void
+ChannelHashRemoveListener(const char *channel)
+{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ int i;
+
+ if (channel_hash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ entry = dshash_find(channel_hash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ entry->listeners_array);
+
+ for (i = 0; i < entry->num_listeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ /* Found it, remove by shifting remaining elements */
+ entry->num_listeners--;
+ if (i < entry->num_listeners)
+ {
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->num_listeners - i));
+ }
+
+ if (entry->num_listeners == 0)
+ {
+ dsa_free(channel_dsa, entry->listeners_array);
+ dshash_delete_entry(channel_hash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channel_hash, entry);
+ }
+ return;
+ }
+ }
+
+ /* Not found in list */
+ dshash_release_lock(channel_hash, entry);
+}
+
+/*
+ * GetPendingNotifyChannels
+ * Get list of unique channel names from pending notifications.
+ */
+static List *
+GetPendingNotifyChannels(void)
+{
+ List *channels = NIL;
+ ListCell *lc;
+
+ if (!pendingNotifies)
+ return NIL;
+
+ foreach(lc, pendingNotifies->events)
+ {
+ Notification *n = (Notification *) lfirst(lc);
+ char *channel = n->data;
+ String *channel_str = makeString(channel);
+
+ if (!list_member(channels, channel_str))
+ channels = lappend(channels, channel_str);
+ }
+
+ return channels;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 37f26f6c6b7..2d9e2ae2b02 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -411,6 +411,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-07 20:15 Tom Lane <[email protected]>
parent: Joel Jacobson <[email protected]>
1 sibling, 1 reply; 120+ messages in thread
From: Tom Lane @ 2025-10-07 20:15 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Matheus Alcantara <[email protected]>; pgsql-hackers
"Joel Jacobson" <[email protected]> writes:
> Ops, I see I got the list_member() code wrong. I've changed it to now
> create String nodes, and then use strVal().
Might be better to revert to the previous coding. Using String
nodes is going to roughly double the space eaten for the list,
and it seems like it's not buying you a lot.
> I also changed back to dshash_find(..., false) in SignalBackends(),
> since that makes more sense to me, since we're not modifying entry.
Agreed.
I did a code coverage run and it seems like things are in pretty
good shape already. async.c is about 88% covered and a lot of the
omissions are either Trace_notify or unreached error reports, which
I'm not especially concerned about. The visible coverage gaps are:
1. asyncQueueFillWarning. This wasn't covered before either, because
it doesn't seem very practical to exercise it in an everyday
regression test. Since your patch doesn't touch that code nor the
queue contents, I'm not concerned here.
2. AtSubCommit_Notify's reparenting stanza. This is a pre-existing
omission too, but maybe worth doing something about?
3. AtSubAbort_Notify's pendingActions cleanup loop; same comments.
4. notification_match is not called at all. Again, pre-existing
coverage gap.
5. ChannelHashAddListener: "already registered" case is not reached,
which surprises me a bit, and neither is the "grow the array" stanza.
Since this is new code it might be worth adding coverage.
regards, tom lane
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-07 21:14 Matheus Alcantara <[email protected]>
parent: Tom Lane <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Matheus Alcantara @ 2025-10-07 21:14 UTC (permalink / raw)
To: Tom Lane <[email protected]>; Matheus Alcantara <[email protected]>; +Cc: Joel Jacobson <[email protected]>; pgsql-hackers
On Tue Oct 7, 2025 at 1:51 PM -03, Tom Lane wrote:
> Matheus Alcantara <[email protected]> writes:
>> 7. I'm wondering if we could add some TAP tests for this?
>
> async.c seems already moderately well covered by existing tests
> src/test/regress/sql/async.sql
> src/test/isolation/specs/async-notify.spec
>
> Do we need more? If there's something not covered, can we extend
> those test cases instead of spinning up a whole new installation
> for a TAP test?
>
I've executed the test coverage on v9 and it seems that we still have a
good code coverage. I would imagine with the new branches that the code
coverage could be effected but it's not true. There is just some small
piece of new code added that is not being coveraged.
> Also, I don't think it's the job of this patch to provide test
> coverage for dshash. That should be quite well covered already.
>
When I was mentioning to test that we can grow the dshash correctly it's
because the v9 patch has a logic to grow the array stored on dshash
entry value that currently is not being cover by the tests. I'm not
saying to test the dshash internal logic which I agree that it's not the
job of this patch. Sorry for being confusing.
+ /* Need to add this listener */
+ if (entry->num_listeners >= entry->allocated_listeners)
+ {
+ /* Grow the array (double the size) */
+ int new_size = entry->allocated_listeners * 2;
+ dsa_pointer new_array = dsa_allocate(channel_dsa,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ new_array);
+
+ /* Copy existing listeners */
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->num_listeners);
+
+ /* Free old array and update entry */
+ dsa_free(channel_dsa, entry->listeners_array);
+ entry->listeners_array = new_array;
+ entry->allocated_listeners = new_size;
+ listeners = new_listeners;
+ }
--
Matheus Alcantara
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-07 21:17 Tom Lane <[email protected]>
parent: Matheus Alcantara <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Tom Lane @ 2025-10-07 21:17 UTC (permalink / raw)
To: Matheus Alcantara <[email protected]>; +Cc: Joel Jacobson <[email protected]>; pgsql-hackers
"Matheus Alcantara" <[email protected]> writes:
> On Tue Oct 7, 2025 at 1:51 PM -03, Tom Lane wrote:
>> Also, I don't think it's the job of this patch to provide test
>> coverage for dshash. That should be quite well covered already.
> When I was mentioning to test that we can grow the dshash correctly it's
> because the v9 patch has a logic to grow the array stored on dshash
> entry value that currently is not being cover by the tests. I'm not
> saying to test the dshash internal logic which I agree that it's not the
> job of this patch. Sorry for being confusing.
Ah, yeah, I misunderstood what you meant. I agree that covering that
"Grow the array" stanza is a good idea, in fact I said the same thing
a little bit ago.
regards, tom lane
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-07 21:22 Matheus Alcantara <[email protected]>
parent: Tom Lane <[email protected]>
0 siblings, 0 replies; 120+ messages in thread
From: Matheus Alcantara @ 2025-10-07 21:22 UTC (permalink / raw)
To: Tom Lane <[email protected]>; Matheus Alcantara <[email protected]>; +Cc: Joel Jacobson <[email protected]>; pgsql-hackers
On Tue Oct 7, 2025 at 6:17 PM -03, Tom Lane wrote:
> "Matheus Alcantara" <[email protected]> writes:
>> On Tue Oct 7, 2025 at 1:51 PM -03, Tom Lane wrote:
>>> Also, I don't think it's the job of this patch to provide test
>>> coverage for dshash. That should be quite well covered already.
>
>> When I was mentioning to test that we can grow the dshash correctly it's
>> because the v9 patch has a logic to grow the array stored on dshash
>> entry value that currently is not being cover by the tests. I'm not
>> saying to test the dshash internal logic which I agree that it's not the
>> job of this patch. Sorry for being confusing.
>
> Ah, yeah, I misunderstood what you meant. I agree that covering that
> "Grow the array" stanza is a good idea, in fact I said the same thing
> a little bit ago.
>
Yeah, I just saw your response after I sent the email, which I agree
with all the points. So I think that we are on the same "page" now.
--
Matheus Alcantara
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-08 03:43 Chao Li <[email protected]>
parent: Joel Jacobson <[email protected]>
1 sibling, 2 replies; 120+ messages in thread
From: Chao Li @ 2025-10-08 03:43 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Tom Lane <[email protected]>; Matheus Alcantara <[email protected]>; pgsql-hackers
After several rounds of reviewing, the code is already very good. I just got a few small comments:
> On Oct 8, 2025, at 03:26, Joel Jacobson <[email protected]> wrote:
>
>
> /Joel<optimize_listen_notify-v11.patch>
1
```
+ channels = GetPendingNotifyChannels();
+
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, channels)
```
I don’t see where “channels” is freed. GetPendingNotifyChannels() creates a list of Nodes, both the list and Nodes the list points to should be freed.
2
```
+ foreach(lc, channels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = strVal(lfirst(lc));
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ ChannelHashKey key;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channel_hash == NULL)
+ entry = NULL;
+ else
```
I wonder whether or not “channel_hash” can be NULL here? Maybe possible if a channel is un-listened while the event is pending?
So, maybe add a comment here to explain the logic.
3
The same piece of code as 2.
I think the code can be optimized a little bit. First, we can initialize entry to NULL, then we don’t the if-else. Second, “key” is only used for dshash_find(), so it can defined where it is used.
foreach(lc, channels)
{
char *channel = strVal(lfirst(lc));
ChannelEntry *entry = NULL;
ProcNumber *listeners;
//ChannelHashKey key;
if (channel_hash != NULL)
{
ChannelHashKey key;
ChannelHashPrepareKey(&key, MyDatabaseId, channel);
entry = dshash_find(channel_hash, &key, false);
}
if (entry == NULL)
continue; /* No listeners registered for this channel */
4
```
+ if (signaled[i] || QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
```
I wonder if “signaled[i]” is a duplicate flag of "QUEUE_BACKEND_WAKEUP_PENDING(i)”?
I understand signaled is local, and QUEUE_BACKEND_WAKEUP_PENDING is in shared memory and may be set by other processes, but in local, when signaled[I] is set, QUEUE_BACKEND_WAKEUP_PENDING(i) is also set. And because of NotifyQueueLock, other process should not be able to cleanup the flag.
But if “signals” is really needed, maybe we can use Bitmapset (src/backend/nodes/bitmapset.c), that would use 1/8 of memories comparing to the bool array.
5
```
/*
@@ -1865,6 +2087,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
```
This piece of code originally only read the shared memory, so it can use LW_SHARED lock mode, but now it writes to the shared memory, do we need to change the lock mode to “exclusive”?
6
```
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
```
Do we really need the memset()? If “channel” is of length NAMEDATALEN, then it still results in a non-0 terminated key->channel; if channel is shorter than NAMEDATALEN, strlcpy will auto add a tailing ‘\0’. I think previous code should have ensured length of channel should be less than NAMEDATALEN.
7
```
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +280,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeup_pending; /* signal sent but not yet processed */
} QueueBackendStatus;
```
In the same structure, rest of fields are all in camel case, I think it’s better to rename the new field to “wakeupPending”.
8
```
@@ -288,11 +323,91 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channel_hash_dsa;
+ dshash_table_handle channel_hash_dsh;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
```
Same as 7, but in this case, type names are not camel case, maybe okay for field names. I don’t have a strong opinion here.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-08 04:36 Chao Li <[email protected]>
parent: Chao Li <[email protected]>
1 sibling, 0 replies; 120+ messages in thread
From: Chao Li @ 2025-10-08 04:36 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Tom Lane <[email protected]>; Matheus Alcantara <[email protected]>; pgsql-hackers
> On Oct 8, 2025, at 11:43, Chao Li <[email protected]> wrote:
>
> 6
> ```
> +static inline void
> +ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
> +{
> + memset(key, 0, sizeof(ChannelHashKey));
> + key->dboid = dboid;
> + strlcpy(key->channel, channel, NAMEDATALEN);
> +}
> ```
>
> Do we really need the memset()? If “channel” is of length NAMEDATALEN, then it still results in a non-0 terminated key->channel; if channel is shorter than NAMEDATALEN, strlcpy will auto add a tailing ‘\0’. I think previous code should have ensured length of channel should be less than NAMEDATALEN.
For comment 6, the result is the same that I don’t think memset() is needed. However, my previous explanation of strlcpy() was wrong, which should apply to strncpy(). For strlcpy(), it already makes a termination ‘\0’.
And one more nit comment:
9
```
+ int allocated_listeners; /* Allocated size of array */
```
For “size” here, I guess you meant “length”, though “size” also works, but usually “size” means bytes occupied by an array and “length” means number of elements of an array. So, “length” would be clearer here.
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-08 14:31 Joel Jacobson <[email protected]>
parent: Tom Lane <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-08 14:31 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Matheus Alcantara <[email protected]>; pgsql-hackers
On Tue, Oct 7, 2025, at 22:15, Tom Lane wrote:
> "Joel Jacobson" <[email protected]> writes:
>> Ops, I see I got the list_member() code wrong. I've changed it to now
>> create String nodes, and then use strVal().
>
> Might be better to revert to the previous coding. Using String
> nodes is going to roughly double the space eaten for the list,
> and it seems like it's not buying you a lot.
>
>> I also changed back to dshash_find(..., false) in SignalBackends(),
>> since that makes more sense to me, since we're not modifying entry.
>
> Agreed.
>
> I did a code coverage run and it seems like things are in pretty
> good shape already. async.c is about 88% covered and a lot of the
> omissions are either Trace_notify or unreached error reports, which
> I'm not especially concerned about. The visible coverage gaps are:
>
> 1. asyncQueueFillWarning. This wasn't covered before either, because
> it doesn't seem very practical to exercise it in an everyday
> regression test. Since your patch doesn't touch that code nor the
> queue contents, I'm not concerned here.
I agree.
> 2. AtSubCommit_Notify's reparenting stanza. This is a pre-existing
> omission too, but maybe worth doing something about?
>
> 3. AtSubAbort_Notify's pendingActions cleanup loop; same comments.
>
> 4. notification_match is not called at all. Again, pre-existing
> coverage gap.
I've added test coverage for all three items above.
> 5. ChannelHashAddListener: "already registered" case is not reached,
> which surprises me a bit, and neither is the "grow the array" stanza.
> Since this is new code it might be worth adding coverage.
I've added a test for the "grow the array" stanza.
The "already registered" case seems impossible to reach, since the
caller, Exec_ListenCommit, returns early if IsListeningOn.
Patches:
0001-optimize_listen_notify-v12.patch:
Improve LISTEN/NOTIFY test coverage
0002-optimize_listen_notify-v12.patch:
Optimize LISTEN/NOTIFY with channel-specific listener tracking
I split this into two patches, to make it easier to verify that the
second patch doesn't affect the tests added by the first patch. The 0001
patch also includes the "grow the array" test, which is pointless
without the 0002 patch, but felt better to add it first anyway.
I've also made changes in v12 based on feedback from Chao Li, to which I
will reply to shortly.
/Joel
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v12.patch (7.8K, 2-0001-optimize_listen_notify-v12.patch)
download | inline diff:
From 960f8aba7d76c35ba4049f6e94a11a4118e5a438 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 103 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 52 ++++++++++
2 files changed, 154 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..9c19843d2d7 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 5 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..942b09d5735 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,26 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +94,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v12.patch (22.4K, 3-0002-optimize_listen_notify-v12.patch)
download | inline diff:
From cd483d06907b0879e96983f2663b3b5b75a79eb5 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Tue, 7 Oct 2025 20:56:47 +0200
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with channel-specific listener
tracking
Currently, idle listening backends cause a dramatic slowdown due to
context switching when they are signaled and wake up. This is wasteful
when they are not listening to the channel being notified.
Just 10 extra idle listening connections cause a slowdown from 8700 TPS
to 6100 TPS, 100 extra cause it to drop to 2000 TPS, and at 1000 extra
it falls to 250 TPS.
This patch introduces targeted signaling for LISTEN/NOTIFY, improving
scalability in workloads with many idle listeners.
A dynamic shared hash table now tracks which backends listen on each
(database, channel) pair, which SignalBackends() uses to perform
targeted signaling. In addition, it staggers wakeups by signaling one
backend at the global tail to help it advance gradually, and forces any
excessively lagging backends to catch up. A per-backend wakeup_pending
flag avoids redundant signals.
---
src/backend/commands/async.c | 470 ++++++++++++++++--
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 436 insertions(+), 38 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..efa25740c9c 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -71,13 +73,17 @@
* make any actual updates to the effective listen state (listenChannels).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which consults the shared channel hash table to
+ * identify listeners for the channels that have pending notifications
+ * in the current database. Each selected backend is marked as having a
+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to it.
+ *
+ * To maintain queue health, SignalBackends() also wakes one backend
+ * positioned at the global queue tail to help advance it, and signals
+ * any backend that has fallen too far behind to catch up. These measures
+ * prevent the notification queue from growing indefinitely, while mostly
+ * limiting wakeups to the backends that actually need them.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +134,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,14 +144,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +172,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ProcNumber array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -227,8 +260,8 @@ typedef struct QueuePosition
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +279,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
/*
@@ -288,11 +322,91 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +415,7 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
* The SLRU buffer area through which we access the notification queue
@@ -457,6 +572,10 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static void ChannelHashAddListener(const char *channel);
+static void ChannelHashRemoveListener(const char *channel);
+static List *GetPendingNotifyChannels(void);
/*
* Compute the difference between two queue page numbers.
@@ -521,12 +640,16 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1152,6 +1275,7 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+ ChannelHashAddListener(channel);
}
/*
@@ -1175,6 +1299,7 @@ Exec_UnlistenCommit(const char *channel)
{
listenChannels = foreach_delete_current(listenChannels, q);
pfree(lchan);
+ ChannelHashRemoveListener(channel);
break;
}
}
@@ -1193,9 +1318,18 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ ListCell *lc;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ foreach(lc, listenChannels)
+ {
+ char *channel = (char *) lfirst(lc);
+
+ ChannelHashRemoveListener(channel);
+ }
+
list_free_deep(listenChannels);
listenChannels = NIL;
}
@@ -1565,12 +1699,16 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends registered as listeners for channels
+ * with pending notifications. However, when there is no traffic on some
+ * channels, listeners on such channels will fall further and further
+ * behind. Waken them if they are too far behind, so that they'll
+ * advance their queue position pointers, allowing the global tail to
+ * advance.
+ *
+ * To stagger wakeups of lagging backends, wake the backend furthest
+ * behind (at the tail), amortizing the context-switching cost across
+ * successive notifications instead of paying it all at once.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1721,9 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ List *channels;
+ ListCell *lc;
+ int64 queue_length;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1596,37 +1737,109 @@ SignalBackends(void)
procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
count = 0;
+ channels = GetPendingNotifyChannels();
+
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, channels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ProcNumber *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up or wrong database */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ queue_length = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(QUEUE_TAIL));
+
+ /* Check for lagging backends when the queue spans multiple pages */
+ if (queue_length > 0)
+ {
+ bool tail_woken = false;
+
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int64 lag;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /* Signal one backend positioned at the global tail */
+ if (!tail_woken && asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_TAIL),
+ QUEUE_POS_PAGE(pos)) == 0)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ tail_woken = true;
+ continue;
+ }
+
+ lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(pos));
+
+ /* Need to signal if a backend has fallen too far behind */
+ if (lag >= QUEUE_CLEANUP_DELAY)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1865,6 +2078,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2395,3 +2609,183 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
+
+/*
+ * ChannelHashAddListener
+ * Register as a listener for the specified channel.
+ */
+static void
+ChannelHashAddListener(const char *channel)
+{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
+
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listenersArray to InvalidDsaPointer as
+ * a marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ /* Need to add this listener */
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ /* Grow the array (double the size) */
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ new_array);
+
+ /* Copy existing listeners */
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->numListeners);
+
+ /* Free old array and update entry */
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ /* Add the new listener */
+ listeners[entry->numListeners] = MyProcNumber;
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
+}
+
+/*
+ * ChannelHashRemoveListener
+ * Unregister as a listener for the specified channel.
+ */
+static void
+ChannelHashRemoveListener(const char *channel)
+{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ int i;
+
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ /* Found it, remove by shifting remaining elements */
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ {
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+ }
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+ return;
+ }
+ }
+
+ /* Not found in list */
+ dshash_release_lock(channelHash, entry);
+}
+
+/*
+ * GetPendingNotifyChannels
+ * Get list of unique channel names from pending notifications.
+ */
+static List *
+GetPendingNotifyChannels(void)
+{
+ List *channels = NIL;
+ ListCell *p;
+ ListCell *q;
+ bool found;
+
+ if (!pendingNotifies)
+ return NIL;
+
+ foreach(p, pendingNotifies->events)
+ {
+ Notification *n = (Notification *) lfirst(p);
+ char *channel = n->data;
+
+ found = false;
+ foreach(q, channels)
+ {
+ char *existing = (char *) lfirst(q);
+
+ if (strcmp(existing, channel) == 0)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ channels = lappend(channels, channel);
+ }
+
+ return channels;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 37f26f6c6b7..2d9e2ae2b02 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -411,6 +411,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-08 14:53 Joel Jacobson <[email protected]>
parent: Chao Li <[email protected]>
1 sibling, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-08 14:53 UTC (permalink / raw)
To: Chao Li <[email protected]>; +Cc: Tom Lane <[email protected]>; Matheus Alcantara <[email protected]>; pgsql-hackers
On Wed, Oct 8, 2025, at 05:43, Chao Li wrote:
> After several rounds of reviewing, the code is already very good. I
> just got a few small comments:
Thanks for feedback!
The below changes have been incorporated into the v12 version
sent in my previous email.
>> On Oct 8, 2025, at 03:26, Joel Jacobson <[email protected]> wrote:
>>
>>
>> /Joel<optimize_listen_notify-v11.patch>
>
>
> 1
> ```
> + channels = GetPendingNotifyChannels();
> +
> LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
> - for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i
> = QUEUE_NEXT_LISTENER(i))
> + foreach(lc, channels)
> ```
>
> I don’t see where “channels” is freed. GetPendingNotifyChannels()
> creates a list of Nodes, both the list and Nodes the list points to
> should be freed.
Per suggestion from Tom Lane I reverted back GetPendingNotifyChannels(),
so this comment is not applicable any longer.
> 2
> ```
> + foreach(lc, channels)
> {
> - int32 pid = QUEUE_BACKEND_PID(i);
> - QueuePosition pos;
> + char *channel = strVal(lfirst(lc));
> + ChannelEntry *entry;
> + ProcNumber *listeners;
> + ChannelHashKey key;
>
> - Assert(pid != InvalidPid);
> - pos = QUEUE_BACKEND_POS(i);
> - if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
> + if (channel_hash == NULL)
> + entry = NULL;
> + else
> ```
>
> I wonder whether or not “channel_hash” can be NULL here? Maybe possible
> if a channel is un-listened while the event is pending?
Yes, I think channelHash can be NULL here if doing a NOTIFY
when there hasn't been a LISTEN yet.
> So, maybe add a comment here to explain the logic.
Not sure I think that's necessary.
What do you suggest that comment would say?
> 3
> The same piece of code as 2.
>
> I think the code can be optimized a little bit. First, we can
> initialize entry to NULL, then we don’t the if-else. Second, “key” is
> only used for dshash_find(), so it can defined where it is used.
>
> foreach(lc, channels)
> {
> char *channel = strVal(lfirst(lc));
> ChannelEntry *entry = NULL;
> ProcNumber *listeners;
> //ChannelHashKey key;
>
> if (channel_hash != NULL)
> {
> ChannelHashKey key;
> ChannelHashPrepareKey(&key, MyDatabaseId, channel);
> entry = dshash_find(channel_hash, &key, false);
> }
>
> if (entry == NULL)
> continue; /* No listeners registered for this channel */
Nice, I agree that's more readable, I changed it like that.
> 4
> ```
> + if (signaled[i] || QUEUE_BACKEND_WAKEUP_PENDING(i))
> + continue;
> ```
>
> I wonder if “signaled[i]” is a duplicate flag of
> "QUEUE_BACKEND_WAKEUP_PENDING(i)”?
>
> I understand signaled is local, and QUEUE_BACKEND_WAKEUP_PENDING is in
> shared memory and may be set by other processes, but in local, when
> signaled[I] is set, QUEUE_BACKEND_WAKEUP_PENDING(i) is also set. And
> because of NotifyQueueLock, other process should not be able to cleanup
> the flag.
>
> But if “signals” is really needed, maybe we can use Bitmapset
> (src/backend/nodes/bitmapset.c), that would use 1/8 of memories
> comparing to the bool array.
I agree, since we're holding an exclusive lock, the signaled array is reundant.
I've removed it, so that we rely only on the wakeupPending flag.
> 5
> ```
> /*
> @@ -1865,6 +2087,7 @@ asyncQueueReadAllNotifications(void)
> LWLockAcquire(NotifyQueueLock, LW_SHARED);
> /* Assert checks that we have a valid state entry */
> Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
> + QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
> ```
>
> This piece of code originally only read the shared memory, so it can
> use LW_SHARED lock mode, but now it writes to the shared memory, do we
> need to change the lock mode to “exclusive”?
No, LW_SHARED is sufficient here, since the backend only modifies its own state,
and no other backend could do that, without holding an exclusive lock.
> 6
> ```
> +static inline void
> +ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
> +{
> + memset(key, 0, sizeof(ChannelHashKey));
> + key->dboid = dboid;
> + strlcpy(key->channel, channel, NAMEDATALEN);
> +}
> ```
>
> Do we really need the memset()? If “channel” is of length NAMEDATALEN,
> then it still results in a non-0 terminated key->channel; if channel is
> shorter than NAMEDATALEN, strlcpy will auto add a tailing ‘\0’. I think
> previous code should have ensured length of channel should be less than
> NAMEDATALEN.
Yes, I think we need memset, since I fear that when the hash table keys
are compared, every byte of the struct might be inspected, so without
zero-initializing it, there could be unused bytes after the null
terminator, that could then cause logically identical keys to be wrongly
considered different.
I haven't checked the implementation though, but my gut feeling says
it's better to be a bit paranoid here.
> 7
> ```
> *
> * Resist the temptation to make this really large. While that would save
> * work in some places, it would add cost in others. In particular, this
> @@ -246,6 +280,7 @@ typedef struct QueueBackendStatus
> Oid dboid; /* backend's database OID, or InvalidOid */
> ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
> QueuePosition pos; /* backend has read queue up to here */
> + bool wakeup_pending; /* signal sent but not yet processed */
> } QueueBackendStatus;
> ```
>
> In the same structure, rest of fields are all in camel case, I think
> it’s better to rename the new field to “wakeupPending”.
>
> 8
> ```
> @@ -288,11 +323,91 @@ typedef struct AsyncQueueControl
> ProcNumber firstListener; /* id of first listener, or
> * INVALID_PROC_NUMBER */
> TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
> + dsa_handle channel_hash_dsa;
> + dshash_table_handle channel_hash_dsh;
> QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
> ```
>
> Same as 7, but in this case, type names are not camel case, maybe okay
> for field names. I don’t have a strong opinion here.
I've did a major renaming of all new code, to better match the casing style.
It seems like helper functions and fields areNamedLikeThis, while
API-functions AreNamedLikeThis.
If we don't like this naming, I'm happy to change it again, please advise.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-08 18:46 Tom Lane <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Tom Lane @ 2025-10-08 18:46 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Matheus Alcantara <[email protected]>; pgsql-hackers
"Joel Jacobson" <[email protected]> writes:
> On Tue, Oct 7, 2025, at 22:15, Tom Lane wrote:
>> 5. ChannelHashAddListener: "already registered" case is not reached,
>> which surprises me a bit, and neither is the "grow the array" stanza.
> I've added a test for the "grow the array" stanza.
> The "already registered" case seems impossible to reach, since the
> caller, Exec_ListenCommit, returns early if IsListeningOn.
Maybe we should remove the check for "already registered" then,
or reduce it to an Assert? Seems pointless to check twice.
Or thinking a little bigger: why are we maintaining the set of
channels-listened-to both as a list and a hash? Could we remove
the list form?
regards, tom lane
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-09 01:11 Chao Li <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Chao Li @ 2025-10-09 01:11 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Tom Lane <[email protected]>; Matheus Alcantara <[email protected]>; pgsql-hackers
> On Oct 8, 2025, at 22:53, Joel Jacobson <[email protected]> wrote:
>
>> 1
>> ```
>> + channels = GetPendingNotifyChannels();
>> +
>> LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
>> - for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i
>> = QUEUE_NEXT_LISTENER(i))
>> + foreach(lc, channels)
>> ```
>>
>> I don’t see where “channels” is freed. GetPendingNotifyChannels()
>> creates a list of Nodes, both the list and Nodes the list points to
>> should be freed.
>
> Per suggestion from Tom Lane I reverted back GetPendingNotifyChannels(),
> so this comment is not applicable any longer.
I think you just reverted the usage of list_member() and makeNode(), but returned “channels” is still built by “lappend()” that allocates memory for the List structure. So you need to use “list_free(channels)” to free the memory.
>> 5
>> ```
>> /*
>> @@ -1865,6 +2087,7 @@ asyncQueueReadAllNotifications(void)
>> LWLockAcquire(NotifyQueueLock, LW_SHARED);
>> /* Assert checks that we have a valid state entry */
>> Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
>> + QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
>> ```
>>
>> This piece of code originally only read the shared memory, so it can
>> use LW_SHARED lock mode, but now it writes to the shared memory, do we
>> need to change the lock mode to “exclusive”?
>
> No, LW_SHARED is sufficient here, since the backend only modifies its own state,
> and no other backend could do that, without holding an exclusive lock.
Yes, the backend only modifies its own state to “false”, but other backends may set its state to “true”, that is a race condition. So I still think an exclusive lock is needed.
>
>> 6
>> ```
>> +static inline void
>> +ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
>> +{
>> + memset(key, 0, sizeof(ChannelHashKey));
>> + key->dboid = dboid;
>> + strlcpy(key->channel, channel, NAMEDATALEN);
>> +}
>> ```
>>
>> Do we really need the memset()? If “channel” is of length NAMEDATALEN,
>> then it still results in a non-0 terminated key->channel; if channel is
>> shorter than NAMEDATALEN, strlcpy will auto add a tailing ‘\0’. I think
>> previous code should have ensured length of channel should be less than
>> NAMEDATALEN.
>
> Yes, I think we need memset, since I fear that when the hash table keys
> are compared, every byte of the struct might be inspected, so without
> zero-initializing it, there could be unused bytes after the null
> terminator, that could then cause logically identical keys to be wrongly
> considered different.
>
> I haven't checked the implementation though, but my gut feeling says
> it's better to be a bit paranoid here.
The hash function channel_hash_func() is defined by your own code, it use strnlen() to get length of channel name, so that bytes after ‘\0’ won’t be used.
And I guess you missed comment 9:
9
```
+ int allocated_listeners; /* Allocated size of array */
```
For “size” here, I guess you meant “length”, though “size” also works, but usually “size” means bytes occupied by an array and “length” means number of elements of an array. So, “length” would be clearer here.
And I got a new comment for v12:
10
```
+ found = false;
+ foreach(q, channels)
+ {
+ char *existing = (char *) lfirst(q);
+
+ if (strcmp(existing, channel) == 0)
+ {
```
Might be safer to do “strncmp(existing, channel, NAMEDATALEN)”.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-09 08:07 Joel Jacobson <[email protected]>
parent: Chao Li <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-09 08:07 UTC (permalink / raw)
To: Chao Li <[email protected]>; +Cc: Tom Lane <[email protected]>; Matheus Alcantara <[email protected]>; pgsql-hackers
On Thu, Oct 9, 2025, at 03:11, Chao Li wrote:
> I think you just reverted the usage of list_member() and makeNode(),
> but returned “channels” is still built by “lappend()” that allocates
> memory for the List structure. So you need to use “list_free(channels)”
> to free the memory.
Right. However, I'll see if I can make Tom's idea work of possibly removing the list form, instead.
>>> ```
>>> /*
>>> @@ -1865,6 +2087,7 @@ asyncQueueReadAllNotifications(void)
>>> LWLockAcquire(NotifyQueueLock, LW_SHARED);
>>> /* Assert checks that we have a valid state entry */
>>> Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
>>> + QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
>>> ```
>>>
>>> This piece of code originally only read the shared memory, so it can
>>> use LW_SHARED lock mode, but now it writes to the shared memory, do we
>>> need to change the lock mode to “exclusive”?
>>
>> No, LW_SHARED is sufficient here, since the backend only modifies its own state,
>> and no other backend could do that, without holding an exclusive lock.
>
> Yes, the backend only modifies its own state to “false”, but other
> backends may set its state to “true”, that is a race condition. So I
> still think an exclusive lock is needed.
No, other backends cannot alter our state without holding an exclusive lock,
and they cannot obtain an exclusive lock on our backend until we've released
the shared lock we're holding.
>>> 6
>>> ```
>>> +static inline void
>>> +ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
>>> +{
>>> + memset(key, 0, sizeof(ChannelHashKey));
>>> + key->dboid = dboid;
>>> + strlcpy(key->channel, channel, NAMEDATALEN);
>>> +}
>>> ```
>>>
>>> Do we really need the memset()? If “channel” is of length NAMEDATALEN,
>>> then it still results in a non-0 terminated key->channel; if channel is
>>> shorter than NAMEDATALEN, strlcpy will auto add a tailing ‘\0’. I think
>>> previous code should have ensured length of channel should be less than
>>> NAMEDATALEN.
>>
>> Yes, I think we need memset, since I fear that when the hash table keys
>> are compared, every byte of the struct might be inspected, so without
>> zero-initializing it, there could be unused bytes after the null
>> terminator, that could then cause logically identical keys to be wrongly
>> considered different.
>>
>> I haven't checked the implementation though, but my gut feeling says
>> it's better to be a bit paranoid here.
>
> The hash function channel_hash_func() is defined by your own code, it
> use strnlen() to get length of channel name, so that bytes after ‘\0’
> won’t be used.
No, the hash function is not used for comparison.
We're using the default dshash_memcmp for comparison:
```
/* parameters for the channel hash table */
static const dshash_parameters channelDSHParams = {
sizeof(ChannelHashKey),
sizeof(ChannelEntry),
dshash_memcmp,
channelHashFunc,
dshash_memcpy,
LWTRANCHE_NOTIFY_CHANNEL_HASH
};
```
Looking at its implementation, we can see it's using memcmp under the hood:
```
/*
* A compare function that forwards to memcmp.
*/
int
dshash_memcmp(const void *a, const void *b, size_t size, void *arg)
{
return memcmp(a, b, size);
}
```
Here, the input parameter `size` comes from `sizeof(ChannelHashKey)`,
so it will include all bytes in the comparison.
> And I guess you missed comment 9:
>
> 9
> ```
> + int allocated_listeners; /* Allocated size of array */
> ```
>
> For “size” here, I guess you meant “length”, though “size” also works,
> but usually “size” means bytes occupied by an array and “length” means
> number of elements of an array. So, “length” would be clearer here.
Agreed, will change.
> And I got a new comment for v12:
>
> 10
> ```
> + found = false;
> + foreach(q, channels)
> + {
> + char *existing = (char *) lfirst(q);
> +
> + if (strcmp(existing, channel) == 0)
> + {
> ```
>
> Might be safer to do “strncmp(existing, channel, NAMEDATALEN)”.
Good idea, will change.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-09 08:39 Chao Li <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 0 replies; 120+ messages in thread
From: Chao Li @ 2025-10-09 08:39 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Tom Lane <[email protected]>; Matheus Alcantara <[email protected]>; pgsql-hackers
> On Oct 9, 2025, at 16:07, Joel Jacobson <[email protected]> wrote:
>
>>>> ```
>>>> /*
>>>> @@ -1865,6 +2087,7 @@ asyncQueueReadAllNotifications(void)
>>>> LWLockAcquire(NotifyQueueLock, LW_SHARED);
>>>> /* Assert checks that we have a valid state entry */
>>>> Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
>>>> + QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
>>>> ```
>>>>
>>>> This piece of code originally only read the shared memory, so it can
>>>> use LW_SHARED lock mode, but now it writes to the shared memory, do we
>>>> need to change the lock mode to “exclusive”?
>>>
>>> No, LW_SHARED is sufficient here, since the backend only modifies its own state,
>>> and no other backend could do that, without holding an exclusive lock.
>>
>> Yes, the backend only modifies its own state to “false”, but other
>> backends may set its state to “true”, that is a race condition. So I
>> still think an exclusive lock is needed.
>
> No, other backends cannot alter our state without holding an exclusive lock,
> and they cannot obtain an exclusive lock on our backend until we've released
> the shared lock we're holding.
>
Ah… That’s true. This comment is resolved.
>>>>
>>
>> The hash function channel_hash_func() is defined by your own code, it
>> use strnlen() to get length of channel name, so that bytes after ‘\0’
>> won’t be used.
>
> No, the hash function is not used for comparison.
> We're using the default dshash_memcmp for comparison:
>
> ```
> /* parameters for the channel hash table */
> static const dshash_parameters channelDSHParams = {
> sizeof(ChannelHashKey),
> sizeof(ChannelEntry),
> dshash_memcmp,
> channelHashFunc,
> dshash_memcpy,
> LWTRANCHE_NOTIFY_CHANNEL_HASH
> };
> ```
>
> Looking at its implementation, we can see it's using memcmp under the hood:
>
> ```
> /*
> * A compare function that forwards to memcmp.
> */
> int
> dshash_memcmp(const void *a, const void *b, size_t size, void *arg)
> {
> return memcmp(a, b, size);
> }
> ```
>
> Here, the input parameter `size` comes from `sizeof(ChannelHashKey)`,
> so it will include all bytes in the comparison.
>
Okay, I think I misunderstood hash_function. So, this comment is also resolved.
I am thinking loudly. When a hash key is created, it has been memset to 0, meaning that in key->channel, all bytes after ‘\0’ are also 0, there should not be any random bytes in hash key, so that in channelHashFunc(), we don’t need to to use strnlen() anymore, which improves performance a little bit. Like this:
h = DatumGetUInt32(hash_uint32(k->dboid));
h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
sizeof(k->channel)));
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-10 18:46 Joel Jacobson <[email protected]>
parent: Tom Lane <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-10 18:46 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Matheus Alcantara <[email protected]>; pgsql-hackers
On Wed, Oct 8, 2025, at 20:46, Tom Lane wrote:
> "Joel Jacobson" <[email protected]> writes:
>> On Tue, Oct 7, 2025, at 22:15, Tom Lane wrote:
>>> 5. ChannelHashAddListener: "already registered" case is not reached,
>>> which surprises me a bit, and neither is the "grow the array" stanza.
>
>> I've added a test for the "grow the array" stanza.
>
>> The "already registered" case seems impossible to reach, since the
>> caller, Exec_ListenCommit, returns early if IsListeningOn.
>
> Maybe we should remove the check for "already registered" then,
> or reduce it to an Assert? Seems pointless to check twice.
>
> Or thinking a little bigger: why are we maintaining the set of
> channels-listened-to both as a list and a hash? Could we remove
> the list form?
Yes, it was indeed possible to remove the list form.
Some functions got a bit more complex, but I think it's worth it since a
single source of truth seems like an important design goal.
This also made LISTEN faster when a backend is listening on plenty of
channels, since we can now lookup the channel in the hash, instead of
having to go through the list as before. The additional linear scan of
the listenersArray didn't add any noticeable extra cost even with
thousands of listening backends for the channel.
I also tried to keep listenersArray sorted and binary-search it, but
even with thousands of listening backends, I couldn't measure any
overall latency difference of LISTEN, so I kept the linear scan to keep
it simple.
In Exec_ListenCommit, I've now inlined code that is similar to
IsListeningOn. I didn't want to use IsListeningOn since it felt wasteful
having to do dshash_find, when we instead can just use
dshash_find_or_insert, to handle both cases.
I also added a static int numChannelsListeningOn variable, to avoid the
possibly expensive operation of going through the entire hash, to be
able to check `numChannelsListeningOn == 0` instead of the now removed
`listenChannels == NIL`. It's of course critical to keep
numChannelsListeningOn in sync, but I think it should be safe? Would of
course be better to avoid this variable. Maybe the extra cycles that
would cost would be worth it?
/Joel
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v13.patch (7.8K, 2-0001-optimize_listen_notify-v13.patch)
download | inline diff:
From 53991adb8dc5a8a96a39c4eacaf85be06db4879f Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 103 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 52 ++++++++++
2 files changed, 154 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..9c19843d2d7 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 5 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..942b09d5735 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,26 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +94,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v13.patch (29.8K, 3-0002-optimize_listen_notify-v13.patch)
download | inline diff:
From f9319393dd97ea00ef637fda0a92f9d7b4e2fa19 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Tue, 7 Oct 2025 20:56:47 +0200
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with channel-specific listener
tracking
Currently, idle listening backends cause a dramatic slowdown due to
context switching when they are signaled and wake up. This is wasteful
when they are not listening to the channel being notified.
Just 10 extra idle listening connections cause a slowdown from 8700 TPS
to 6100 TPS, 100 extra cause it to drop to 2000 TPS, and at 1000 extra
it falls to 250 TPS.
This patch introduces targeted signaling for LISTEN/NOTIFY, improving
scalability in workloads with many idle listeners.
A dynamic shared hash table now tracks which backends listen on each
(database, channel) pair, which SignalBackends() uses to perform
targeted signaling. In addition, it staggers wakeups by signaling one
backend at the global tail to help it advance gradually, and forces any
excessively lagging backends to catch up. A per-backend wakeup_pending
flag avoids redundant signals.
---
src/backend/commands/async.c | 591 +++++++++++++++---
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 505 insertions(+), 90 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..bb5ebfab26d 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,20 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
+ * make any actual updates to the effective listen state (channelHash).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which consults the shared channel hash table to
+ * identify listeners for the channels that have pending notifications
+ * in the current database. Each selected backend is marked as having a
+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to it.
+ *
+ * To maintain queue health, SignalBackends() also wakes one backend
+ * positioned at the global queue tail to help advance it, and signals
+ * any backend that has fallen too far behind to catch up. These measures
+ * prevent the notification queue from growing indefinitely, while mostly
+ * limiting wakeups to the backends that actually need them.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +134,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,14 +144,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +172,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ProcNumber array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -227,8 +260,8 @@ typedef struct QueuePosition
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +279,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
/*
@@ -288,11 +322,91 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +415,7 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
* The SLRU buffer area through which we access the notification queue
@@ -312,17 +427,11 @@ static SlruCtlData NotifyCtlData;
#define QUEUE_FULL_WARN_INTERVAL 5000 /* warn at most once every 5s */
-/*
- * listenChannels identifies the channels we are actually listening to
- * (ie, have committed a LISTEN on). It is a simple list of channel names,
- * allocated in TopMemoryContext.
- */
-static List *listenChannels = NIL; /* list of C strings */
-
/*
* State for pending LISTEN/UNLISTEN actions consists of an ordered list of
* all actions requested in the current transaction. As explained above,
- * we don't actually change listenChannels until we reach transaction commit.
+ * we don't actually change the shared channelHash until we reach transaction
+ * commit.
*
* The list is kept in CurTransactionContext. In subtransactions, each
* subtransaction has its own list in its own CurTransactionContext, but
@@ -418,6 +527,9 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/* Count of channels we're currently listening on */
+static int numChannelsListeningOn = 0;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -457,6 +569,8 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static List *GetPendingNotifyChannels(void);
/*
* Compute the difference between two queue page numbers.
@@ -521,12 +635,16 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -683,7 +801,7 @@ Async_Notify(const char *channel, const char *payload)
* Common code for listen, unlisten, unlisten all commands.
*
* Adds the request to the list of pending actions.
- * Actual update of the listenChannels list happens during transaction
+ * Actual update of the shared channelHash happens during transaction
* commit.
*/
static void
@@ -782,24 +900,60 @@ Async_UnlistenAll(void)
/*
* SQL function: return a set of the channel names this backend is actively
* listening to.
- *
- * Note: this coding relies on the fact that the listenChannels list cannot
- * change within a transaction.
*/
Datum
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ List *listenChannels;
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
+
+ /* get channels from channelHash and store in function context */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ listenChannels = NIL;
+
+ if (channelHash != NULL)
+ {
+ dshash_seq_init(&status, channelHash, false);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ listenChannels = lappend(listenChannels, pstrdup(entry->key.channel));
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
+ }
+
+ funcctx->user_fctx = listenChannels;
+ MemoryContextSwitchTo(oldcontext);
}
/* stuff done on every call of the function */
funcctx = SRF_PERCALL_SETUP();
+ listenChannels = (List *) funcctx->user_fctx;
if (funcctx->call_cntr < list_length(listenChannels))
{
@@ -957,7 +1111,7 @@ PreCommit_Notify(void)
*
* This is called at transaction commit, after committing to clog.
*
- * Update listenChannels and clear transaction-local state.
+ * Update channelHash and clear transaction-local state.
*
* If we issued any notifications in the transaction, send signals to
* listening backends (possibly including ourselves) to process them.
@@ -1002,7 +1156,7 @@ AtCommit_Notify(void)
}
/* If no longer listening to anything, get out of listener array */
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener && numChannelsListeningOn == 0)
asyncQueueUnregister();
/*
@@ -1130,55 +1284,131 @@ Exec_ListenPreCommit(void)
/*
* Exec_ListenCommit --- subroutine for AtCommit_Notify
*
- * Add the channel to the list of channels we are listening on.
+ * Add the channel to the shared channelHash.
*/
static void
Exec_ListenCommit(const char *channel)
{
- MemoryContext oldcontext;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
- /* Do nothing if we are already listening on this channel */
- if (IsListeningOn(channel))
- return;
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
/*
- * Add the new channel name to listenChannels.
- *
- * XXX It is theoretically possible to get an out-of-memory failure here,
- * which would be bad because we already committed. For the moment it
- * doesn't seem worth trying to guard against that, but maybe improve this
- * later.
+ * For new entries, we initialize listenersArray to InvalidDsaPointer as a
+ * marker. This handles both the initial creation and potential retry
+ * after OOM.
*/
- oldcontext = MemoryContextSwitchTo(TopMemoryContext);
- listenChannels = lappend(listenChannels, pstrdup(channel));
- MemoryContextSwitchTo(oldcontext);
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners] = MyProcNumber;
+ entry->numListeners++;
+ numChannelsListeningOn++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Remove the specified channel from channelHash.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
- ListCell *q;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
- foreach(q, listenChannels)
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(q);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i] == MyProcNumber)
{
- listenChannels = foreach_delete_current(listenChannels, q);
- pfree(lchan);
- break;
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ numChannelsListeningOn--;
+ return;
}
}
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1193,33 +1423,82 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
- list_free_deep(listenChannels);
- listenChannels = NIL;
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+ int i;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
+ numChannelsListeningOn = 0;
}
/*
* Test whether we are actively listening on the given channel name.
*
* Note: this function is executed for every notification found in the queue.
- * Perhaps it is worth further optimization, eg convert the list to a sorted
- * array so we can binary-search it. In practice the list is likely to be
- * fairly short, though.
*/
static bool
IsListeningOn(const char *channel)
{
- ListCell *p;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
- foreach(p, listenChannels)
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ entry = dshash_find(channelHash, &key, false);
+ if (entry == NULL)
+ return false; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(p);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
return true;
+ }
}
+
+ dshash_release_lock(channelHash, entry);
return false;
}
@@ -1230,7 +1509,7 @@ IsListeningOn(const char *channel)
static void
asyncQueueUnregister(void)
{
- Assert(listenChannels == NIL); /* else caller error */
+ Assert(numChannelsListeningOn == 0); /* else caller error */
if (!amRegisteredListener) /* nothing to do */
return;
@@ -1565,12 +1844,16 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends registered as listeners for channels
+ * with pending notifications. However, when there is no traffic on some
+ * channels, listeners on such channels will fall further and further
+ * behind. Waken them if they are too far behind, so that they'll
+ * advance their queue position pointers, allowing the global tail to
+ * advance.
+ *
+ * To stagger wakeups of lagging backends, wake the backend furthest
+ * behind (at the tail), amortizing the context-switching cost across
+ * successive notifications instead of paying it all at once.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1866,9 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ List *channels;
+ ListCell *lc;
+ int64 queue_length;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1596,37 +1882,109 @@ SignalBackends(void)
procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
count = 0;
+ channels = GetPendingNotifyChannels();
+
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, channels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ProcNumber *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up or wrong database */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ queue_length = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(QUEUE_TAIL));
+
+ /* Check for lagging backends when the queue spans multiple pages */
+ if (queue_length > 0)
+ {
+ bool tail_woken = false;
+
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int64 lag;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /* Signal one backend positioned at the global tail */
+ if (!tail_woken && asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_TAIL),
+ QUEUE_POS_PAGE(pos)) == 0)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ tail_woken = true;
+ continue;
+ }
+
+ lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(pos));
+
+ /* Need to signal if a backend has fallen too far behind */
+ if (lag >= QUEUE_CLEANUP_DELAY)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1673,9 +2031,9 @@ AtAbort_Notify(void)
/*
* If we LISTEN but then roll back the transaction after PreCommit_Notify,
* we have registered as a listener but have not made any entry in
- * listenChannels. In that case, deregister again.
+ * channelHash. In that case, deregister again.
*/
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener && numChannelsListeningOn == 0)
asyncQueueUnregister();
/* And clean up */
@@ -1865,6 +2223,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2186,7 +2545,7 @@ ProcessIncomingNotify(bool flush)
notifyInterruptPending = false;
/* Do nothing else if we aren't actively listening */
- if (listenChannels == NIL)
+ if (numChannelsListeningOn == 0)
return;
if (Trace_notify)
@@ -2395,3 +2754,55 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
+
+/*
+ * GetPendingNotifyChannels
+ * Get list of unique channel names from pending notifications.
+ */
+static List *
+GetPendingNotifyChannels(void)
+{
+ List *channels = NIL;
+ ListCell *p;
+ ListCell *q;
+ bool found;
+
+ if (!pendingNotifies)
+ return NIL;
+
+ foreach(p, pendingNotifies->events)
+ {
+ Notification *n = (Notification *) lfirst(p);
+ char *channel = n->data;
+
+ found = false;
+ foreach(q, channels)
+ {
+ char *existing = (char *) lfirst(q);
+
+ if (strcmp(existing, channel) == 0)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ channels = lappend(channels, channel);
+ }
+
+ return channels;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..5ccdd4043e8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-11 06:43 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-11 06:43 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Matheus Alcantara <[email protected]>; pgsql-hackers
On Fri, Oct 10, 2025, at 20:46, Joel Jacobson wrote:
> On Wed, Oct 8, 2025, at 20:46, Tom Lane wrote:
>> "Joel Jacobson" <[email protected]> writes:
>>> On Tue, Oct 7, 2025, at 22:15, Tom Lane wrote:
>>>> 5. ChannelHashAddListener: "already registered" case is not reached,
>>>> which surprises me a bit, and neither is the "grow the array" stanza.
>>
>>> I've added a test for the "grow the array" stanza.
>>
>>> The "already registered" case seems impossible to reach, since the
>>> caller, Exec_ListenCommit, returns early if IsListeningOn.
>>
>> Maybe we should remove the check for "already registered" then,
>> or reduce it to an Assert? Seems pointless to check twice.
>>
>> Or thinking a little bigger: why are we maintaining the set of
>> channels-listened-to both as a list and a hash? Could we remove
>> the list form?
>
> Yes, it was indeed possible to remove the list form.
>
> Some functions got a bit more complex, but I think it's worth it since a
> single source of truth seems like an important design goal.
>
> This also made LISTEN faster when a backend is listening on plenty of
> channels, since we can now lookup the channel in the hash, instead of
> having to go through the list as before. The additional linear scan of
> the listenersArray didn't add any noticeable extra cost even with
> thousands of listening backends for the channel.
>
> I also tried to keep listenersArray sorted and binary-search it, but
> even with thousands of listening backends, I couldn't measure any
> overall latency difference of LISTEN, so I kept the linear scan to keep
> it simple.
>
> In Exec_ListenCommit, I've now inlined code that is similar to
> IsListeningOn. I didn't want to use IsListeningOn since it felt wasteful
> having to do dshash_find, when we instead can just use
> dshash_find_or_insert, to handle both cases.
>
> I also added a static int numChannelsListeningOn variable, to avoid the
> possibly expensive operation of going through the entire hash, to be
> able to check `numChannelsListeningOn == 0` instead of the now removed
> `listenChannels == NIL`. It's of course critical to keep
> numChannelsListeningOn in sync, but I think it should be safe? Would of
> course be better to avoid this variable. Maybe the extra cycles that
> would cost would be worth it?
In addition to previously suggested optimization, there is another major
one that seems doable, that would mean a great improvement for workload
having large traffic differences between channels, i.e. some low traffic
and some high traffic.
I'm not entirely sure this approach is correct though, I've might
misunderstood the guarantees of the heavyweight lock. My assumption is
based on that there can only be one backend that is currently running
the code in PreCommit_Notify after having aquired the heavyweight lock.
If this is not true, then it doesn't work. What made me worried is the
exclusive lock we also take inside the same function, I don't see the
point of it since we're already holding the heavyweight lock, but maybe
this is just to "allows deadlocks to be detected" like the comment says?
---
Patches:
* 0001-optimize_listen_notify-v14.patch:
Just adds additional test coverage of async.c
* 0002-optimize_listen_notify-v14.patch:
Adds the shared channel hash.
Unchanged since 0002-optimize_listen_notify-v13.patch.
* 0003-optimize_listen_notify-v14.patch:
Optimize LISTEN/NOTIFY by advancing idle backends directly
Building on the previous channel-specific listener tracking
optimization, this patch further reduces context switching by detecting
idle listening backends that don't listen to any of the channels being
notified and advancing their queue positions directly without waking
them up.
When a backend commits notifications, it now saves both the queue head
position before and after writing. In SignalBackends(), backends that
are at the old queue head and weren't marked for wakeup (meaning they
don't listen to any of the notified channels) are advanced directly to
the new queue head. This eliminates unnecessary wakeups for these
backends, which would otherwise wake up, scan through all the
notifications, skip each one, and advance to the same position anyway.
The implementation carefully handles the race condition where other
backends may write notifications after the heavyweight lock is released
but before SignalBackends() is called. By saving queueHeadAfterWrite
immediately after writing (before releasing the lock), we ensure
backends are only advanced over the exact notifications we wrote, not
notifications from other concurrent backends.
---
Benchmark:
% ./pgbench_patched --listen-notify-benchmark --notify-round-trips=10000 --notify-idle-step=10
pgbench_patched: starting LISTEN/NOTIFY round-trip benchmark
pgbench_patched: round-trips per iteration: 10000
pgbench_patched: idle listeners added per iteration: 10
master:
idle_listeners round_trips_per_sec max_latency_usec
0 33592.9 2278
10 14251.1 1041
20 9258.7 1367
30 6144.2 2277
40 4653.1 1690
50 3780.7 2869
60 3234.9 3215
70 2818.9 3652
80 2458.7 3219
90 2203.1 3505
100 1951.9 1739
0002-optimize_listen_notify-v14.patch:
idle_listeners round_trips_per_sec max_latency_usec
0 33936.2 889
10 30631.9 1233
20 22404.7 7862
30 19446.2 9539
40 16013.3 13963
50 14310.1 16983
60 12827.0 21363
70 11271.9 24775
80 10764.4 28703
90 9568.1 31693
100 9241.3 32724
0003-optimize_listen_notify-v14.patch:
idle_listeners round_trips_per_sec max_latency_usec
0 33236.8 1090
10 34681.0 1338
20 34530.4 1372
30 34061.6 1339
40 33084.5 913
50 33847.5 955
60 33675.8 1239
70 28857.4 20443
80 33324.9 786
90 33612.3 758
100 31259.2 7706
As we can see, with 0002, the ping-pong round-trips per second degrades
much slower than master, but the wakeup of idle listening backends still
needs to happen at some point, much fewer wakeups, and staggered over
time, but still makes it go down from 33k to 9k due to 100 idle
listening backends. With 0003, the round-trips per second is sustained,
unaffected by additional idle listening backends.
I've also attached the pgbench patch as a .txt in
pgbench-listen-notify-benchmark-patch.txt, since it's not part of this
patch, it's just provided to help others verify the results.
/Joel
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 1515ed405ba..b462dcc8348 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -35,6 +35,7 @@
#include <ctype.h>
#include <float.h>
+#include <inttypes.h>
#include <limits.h>
#include <math.h>
#include <signal.h>
@@ -237,6 +238,11 @@ static const char *const PARTITION_METHOD[] = {"none", "range", "hash"};
/* random seed used to initialize base_random_sequence */
static int64 random_seed = -1;
+/* LISTEN/NOTIFY benchmark mode parameters */
+static bool listen_notify_mode = false; /* enable LISTEN/NOTIFY benchmark */
+static int notify_round_trips = 100; /* number of round-trips per iteration */
+static int notify_idle_step = 10; /* idle listeners to add per iteration */
+
/*
* end of configurable parameters
*********************************************************************/
@@ -930,6 +936,10 @@ usage(void)
" (same as \"-b simple-update\")\n"
" -S, --select-only perform SELECT-only transactions\n"
" (same as \"-b select-only\")\n"
+ " --listen-notify-benchmark\n"
+ " run LISTEN/NOTIFY round-trip benchmark\n"
+ " --notify-round-trips=NUM number of round-trips per iteration (default: 100)\n"
+ " --notify-idle-step=NUM idle listeners to add per iteration (default: 10)\n"
"\nBenchmarking options:\n"
" -c, --client=NUM number of concurrent database clients (default: 1)\n"
" -C, --connect establish new connection for each transaction\n"
@@ -6689,6 +6699,216 @@ set_random_seed(const char *seed)
return true;
}
+/*
+ * Run LISTEN/NOTIFY round-trip benchmark
+ *
+ * This benchmark measures the round-trip time between two processes that
+ * ping-pong NOTIFY messages while adding idle listening connections.
+ */
+static void
+runListenNotifyBenchmark(void)
+{
+ PGconn *conn1 = NULL;
+ PGconn *conn2 = NULL;
+ PGconn **idle_conns = NULL;
+ int num_idle = 0;
+ int max_idle = 10000; /* reasonable upper limit */
+ PGresult *res;
+ char channel1[] = "pgbench_channel_1";
+ char channel2[] = "pgbench_channel_2";
+ char notify_cmd[256];
+ bool first_failure = false;
+
+ pg_log_info("starting LISTEN/NOTIFY round-trip benchmark");
+ pg_log_info("round-trips per iteration: %d", notify_round_trips);
+ pg_log_info("idle listeners added per iteration: %d", notify_idle_step);
+ printf("\n%14s %19s %19s\n", "idle_listeners", "round_trips_per_sec", "max_latency_usec");
+
+ /* Allocate array for idle connections */
+ idle_conns = (PGconn **) pg_malloc0(max_idle * sizeof(PGconn *));
+
+ /* Create two active connections for ping-pong */
+ conn1 = doConnect();
+ if (conn1 == NULL)
+ pg_fatal("failed to create connection 1");
+
+ conn2 = doConnect();
+ if (conn2 == NULL)
+ pg_fatal("failed to create connection 2");
+
+ /* Set up LISTEN on both connections */
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", channel1);
+ res = PQexec(conn1, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("LISTEN failed on connection 1: %s", PQerrorMessage(conn1));
+ PQclear(res);
+
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", channel2);
+ res = PQexec(conn2, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("LISTEN failed on connection 2: %s", PQerrorMessage(conn2));
+ PQclear(res);
+
+ /* Main benchmark loop: measure round-trips then add idle connections */
+ while (num_idle < max_idle)
+ {
+ int i;
+ int64 total_latency = 0;
+ int64 max_latency = 0;
+
+ /* Perform round-trip measurements */
+ for (i = 0; i < notify_round_trips; i++)
+ {
+ pg_time_usec_t start_time,
+ end_time;
+ int64 latency;
+ PGnotify *notify;
+ int sock;
+ fd_set input_mask;
+ struct timeval tv;
+
+ /* Clear any pending notifications */
+ PQconsumeInput(conn1);
+ while ((notify = PQnotifies(conn1)) != NULL)
+ PQfreemem(notify);
+ PQconsumeInput(conn2);
+ while ((notify = PQnotifies(conn2)) != NULL)
+ PQfreemem(notify);
+
+ /* Start timer and send notification from conn1 */
+ start_time = pg_time_now();
+ snprintf(notify_cmd, sizeof(notify_cmd), "NOTIFY %s", channel2);
+ res = PQexec(conn1, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("NOTIFY failed: %s", PQerrorMessage(conn1));
+ PQclear(res);
+
+ /* Wait for notification on conn2 */
+ sock = PQsocket(conn2);
+ notify = NULL;
+ while (notify == NULL)
+ {
+ PQconsumeInput(conn2);
+ notify = PQnotifies(conn2);
+ if (notify == NULL)
+ {
+ /* Wait for data on socket */
+ FD_ZERO(&input_mask);
+ FD_SET(sock, &input_mask);
+ tv.tv_sec = 10; /* 10 second timeout */
+ tv.tv_usec = 0;
+ if (select(sock + 1, &input_mask, NULL, NULL, &tv) < 0)
+ pg_fatal("select() failed: %m");
+ }
+ }
+ PQfreemem(notify);
+
+ /* Send notification back from conn2 */
+ snprintf(notify_cmd, sizeof(notify_cmd), "NOTIFY %s", channel1);
+ res = PQexec(conn2, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("NOTIFY failed: %s", PQerrorMessage(conn2));
+ PQclear(res);
+
+ /* Wait for notification on conn1 */
+ sock = PQsocket(conn1);
+ notify = NULL;
+ while (notify == NULL)
+ {
+ PQconsumeInput(conn1);
+ notify = PQnotifies(conn1);
+ if (notify == NULL)
+ {
+ /* Wait for data on socket */
+ FD_ZERO(&input_mask);
+ FD_SET(sock, &input_mask);
+ tv.tv_sec = 10; /* 10 second timeout */
+ tv.tv_usec = 0;
+ if (select(sock + 1, &input_mask, NULL, NULL, &tv) < 0)
+ pg_fatal("select() failed: %m");
+ }
+ }
+ PQfreemem(notify);
+
+ /* End timer */
+ end_time = pg_time_now();
+
+ /* Calculate individual round-trip latency */
+ latency = end_time - start_time;
+
+ /* Accumulate total latency and track maximum */
+ total_latency += latency;
+ if (latency > max_latency)
+ max_latency = latency;
+ }
+
+ /* Calculate and report round-trips per second and max latency */
+ fprintf(stdout, "%14d %19.1f %19" PRId64 "\n",
+ num_idle,
+ 1000000.0 * notify_round_trips / total_latency,
+ max_latency);
+ fflush(stdout);
+
+ /* Stop if we hit connection limit */
+ if (first_failure)
+ break;
+
+ /* Add idle listening connections */
+ for (i = 0; i < notify_idle_step && num_idle < max_idle; i++)
+ {
+ PGconn *idle_conn;
+ char idle_channel[256];
+
+ idle_conn = doConnect();
+ if (idle_conn == NULL)
+ {
+ if (!first_failure)
+ {
+ pg_log_info("reached max_connections at %d idle listeners", num_idle);
+ first_failure = true;
+ }
+ break;
+ }
+
+ /* Each idle connection listens on a unique channel */
+ snprintf(idle_channel, sizeof(idle_channel), "idle_%d", num_idle);
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", idle_channel);
+
+ res = PQexec(idle_conn, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ {
+ pg_log_warning("LISTEN failed on idle connection %d: %s",
+ num_idle, PQerrorMessage(idle_conn));
+ PQfinish(idle_conn);
+ PQclear(res);
+ first_failure = true;
+ break;
+ }
+ PQclear(res);
+
+ idle_conns[num_idle] = idle_conn;
+ num_idle++;
+ }
+
+ /* Stop if we couldn't add any connections */
+ if (first_failure && i == 0)
+ break;
+ }
+
+ /* Clean up */
+ pg_log_info("cleaning up connections");
+ PQfinish(conn1);
+ PQfinish(conn2);
+ for (int i = 0; i < num_idle; i++)
+ {
+ if (idle_conns[i])
+ PQfinish(idle_conns[i]);
+ }
+ pg_free(idle_conns);
+
+ pg_log_info("LISTEN/NOTIFY benchmark completed");
+}
+
int
main(int argc, char **argv)
{
@@ -6739,6 +6959,9 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"listen-notify-benchmark", no_argument, NULL, 18},
+ {"notify-round-trips", required_argument, NULL, 19},
+ {"notify-idle-step", required_argument, NULL, 20},
{NULL, 0, NULL, 0}
};
@@ -7092,6 +7315,22 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* listen-notify-benchmark */
+ listen_notify_mode = true;
+ benchmarking_option_set = true;
+ break;
+ case 19: /* notify-round-trips */
+ benchmarking_option_set = true;
+ if (!option_parse_int(optarg, "--notify-round-trips", 1, INT_MAX,
+ ¬ify_round_trips))
+ exit(1);
+ break;
+ case 20: /* notify-idle-step */
+ benchmarking_option_set = true;
+ if (!option_parse_int(optarg, "--notify-idle-step", 1, INT_MAX,
+ ¬ify_idle_step))
+ exit(1);
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7210,6 +7449,20 @@ main(int argc, char **argv)
pg_fatal("some of the specified options cannot be used in benchmarking mode");
}
+ /* Handle LISTEN/NOTIFY benchmark mode */
+ if (listen_notify_mode)
+ {
+ /* Establish a database connection for setup */
+ if ((con = doConnect()) == NULL)
+ pg_fatal("could not connect to database");
+
+ /* Run the LISTEN/NOTIFY benchmark */
+ runListenNotifyBenchmark();
+
+ PQfinish(con);
+ exit(0);
+ }
+
if (nxacts > 0 && duration > 0)
pg_fatal("specify either a number of transactions (-t) or a duration (-T), not both");
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v14.patch (7.8K, 2-0001-optimize_listen_notify-v14.patch)
download | inline diff:
From 183c8a106705a6391cd68f406019253d36680da4 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/3] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 103 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 52 ++++++++++
2 files changed, 154 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..9c19843d2d7 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 5 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..942b09d5735 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,26 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +94,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v14.patch (29.8K, 3-0002-optimize_listen_notify-v14.patch)
download | inline diff:
From b39c5b71f6d6b219ab06c6b731e2317f480edf5d Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Tue, 7 Oct 2025 20:56:47 +0200
Subject: [PATCH 2/3] Optimize LISTEN/NOTIFY with channel-specific listener
tracking
Currently, idle listening backends cause a dramatic slowdown due to
context switching when they are signaled and wake up. This is wasteful
when they are not listening to the channel being notified.
Just 10 extra idle listening connections cause a slowdown from 8700 TPS
to 6100 TPS, 100 extra cause it to drop to 2000 TPS, and at 1000 extra
it falls to 250 TPS.
This patch introduces targeted signaling for LISTEN/NOTIFY, improving
scalability in workloads with many idle listeners.
A dynamic shared hash table now tracks which backends listen on each
(database, channel) pair, which SignalBackends() uses to perform
targeted signaling. In addition, it staggers wakeups by signaling one
backend at the global tail to help it advance gradually, and forces any
excessively lagging backends to catch up. A per-backend wakeup_pending
flag avoids redundant signals.
---
src/backend/commands/async.c | 591 +++++++++++++++---
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 505 insertions(+), 90 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..bb5ebfab26d 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,20 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
+ * make any actual updates to the effective listen state (channelHash).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which consults the shared channel hash table to
+ * identify listeners for the channels that have pending notifications
+ * in the current database. Each selected backend is marked as having a
+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to it.
+ *
+ * To maintain queue health, SignalBackends() also wakes one backend
+ * positioned at the global queue tail to help advance it, and signals
+ * any backend that has fallen too far behind to catch up. These measures
+ * prevent the notification queue from growing indefinitely, while mostly
+ * limiting wakeups to the backends that actually need them.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +134,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,14 +144,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +172,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ProcNumber array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -227,8 +260,8 @@ typedef struct QueuePosition
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +279,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
/*
@@ -288,11 +322,91 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +415,7 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
* The SLRU buffer area through which we access the notification queue
@@ -312,17 +427,11 @@ static SlruCtlData NotifyCtlData;
#define QUEUE_FULL_WARN_INTERVAL 5000 /* warn at most once every 5s */
-/*
- * listenChannels identifies the channels we are actually listening to
- * (ie, have committed a LISTEN on). It is a simple list of channel names,
- * allocated in TopMemoryContext.
- */
-static List *listenChannels = NIL; /* list of C strings */
-
/*
* State for pending LISTEN/UNLISTEN actions consists of an ordered list of
* all actions requested in the current transaction. As explained above,
- * we don't actually change listenChannels until we reach transaction commit.
+ * we don't actually change the shared channelHash until we reach transaction
+ * commit.
*
* The list is kept in CurTransactionContext. In subtransactions, each
* subtransaction has its own list in its own CurTransactionContext, but
@@ -418,6 +527,9 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/* Count of channels we're currently listening on */
+static int numChannelsListeningOn = 0;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -457,6 +569,8 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static List *GetPendingNotifyChannels(void);
/*
* Compute the difference between two queue page numbers.
@@ -521,12 +635,16 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -683,7 +801,7 @@ Async_Notify(const char *channel, const char *payload)
* Common code for listen, unlisten, unlisten all commands.
*
* Adds the request to the list of pending actions.
- * Actual update of the listenChannels list happens during transaction
+ * Actual update of the shared channelHash happens during transaction
* commit.
*/
static void
@@ -782,24 +900,60 @@ Async_UnlistenAll(void)
/*
* SQL function: return a set of the channel names this backend is actively
* listening to.
- *
- * Note: this coding relies on the fact that the listenChannels list cannot
- * change within a transaction.
*/
Datum
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ List *listenChannels;
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
+
+ /* get channels from channelHash and store in function context */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ listenChannels = NIL;
+
+ if (channelHash != NULL)
+ {
+ dshash_seq_init(&status, channelHash, false);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ listenChannels = lappend(listenChannels, pstrdup(entry->key.channel));
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
+ }
+
+ funcctx->user_fctx = listenChannels;
+ MemoryContextSwitchTo(oldcontext);
}
/* stuff done on every call of the function */
funcctx = SRF_PERCALL_SETUP();
+ listenChannels = (List *) funcctx->user_fctx;
if (funcctx->call_cntr < list_length(listenChannels))
{
@@ -957,7 +1111,7 @@ PreCommit_Notify(void)
*
* This is called at transaction commit, after committing to clog.
*
- * Update listenChannels and clear transaction-local state.
+ * Update channelHash and clear transaction-local state.
*
* If we issued any notifications in the transaction, send signals to
* listening backends (possibly including ourselves) to process them.
@@ -1002,7 +1156,7 @@ AtCommit_Notify(void)
}
/* If no longer listening to anything, get out of listener array */
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener && numChannelsListeningOn == 0)
asyncQueueUnregister();
/*
@@ -1130,55 +1284,131 @@ Exec_ListenPreCommit(void)
/*
* Exec_ListenCommit --- subroutine for AtCommit_Notify
*
- * Add the channel to the list of channels we are listening on.
+ * Add the channel to the shared channelHash.
*/
static void
Exec_ListenCommit(const char *channel)
{
- MemoryContext oldcontext;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
- /* Do nothing if we are already listening on this channel */
- if (IsListeningOn(channel))
- return;
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
/*
- * Add the new channel name to listenChannels.
- *
- * XXX It is theoretically possible to get an out-of-memory failure here,
- * which would be bad because we already committed. For the moment it
- * doesn't seem worth trying to guard against that, but maybe improve this
- * later.
+ * For new entries, we initialize listenersArray to InvalidDsaPointer as a
+ * marker. This handles both the initial creation and potential retry
+ * after OOM.
*/
- oldcontext = MemoryContextSwitchTo(TopMemoryContext);
- listenChannels = lappend(listenChannels, pstrdup(channel));
- MemoryContextSwitchTo(oldcontext);
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners] = MyProcNumber;
+ entry->numListeners++;
+ numChannelsListeningOn++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Remove the specified channel from channelHash.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
- ListCell *q;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
- foreach(q, listenChannels)
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(q);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i] == MyProcNumber)
{
- listenChannels = foreach_delete_current(listenChannels, q);
- pfree(lchan);
- break;
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ numChannelsListeningOn--;
+ return;
}
}
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1193,33 +1423,82 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
- list_free_deep(listenChannels);
- listenChannels = NIL;
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+ int i;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
+ numChannelsListeningOn = 0;
}
/*
* Test whether we are actively listening on the given channel name.
*
* Note: this function is executed for every notification found in the queue.
- * Perhaps it is worth further optimization, eg convert the list to a sorted
- * array so we can binary-search it. In practice the list is likely to be
- * fairly short, though.
*/
static bool
IsListeningOn(const char *channel)
{
- ListCell *p;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
- foreach(p, listenChannels)
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ entry = dshash_find(channelHash, &key, false);
+ if (entry == NULL)
+ return false; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(p);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
return true;
+ }
}
+
+ dshash_release_lock(channelHash, entry);
return false;
}
@@ -1230,7 +1509,7 @@ IsListeningOn(const char *channel)
static void
asyncQueueUnregister(void)
{
- Assert(listenChannels == NIL); /* else caller error */
+ Assert(numChannelsListeningOn == 0); /* else caller error */
if (!amRegisteredListener) /* nothing to do */
return;
@@ -1565,12 +1844,16 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends registered as listeners for channels
+ * with pending notifications. However, when there is no traffic on some
+ * channels, listeners on such channels will fall further and further
+ * behind. Waken them if they are too far behind, so that they'll
+ * advance their queue position pointers, allowing the global tail to
+ * advance.
+ *
+ * To stagger wakeups of lagging backends, wake the backend furthest
+ * behind (at the tail), amortizing the context-switching cost across
+ * successive notifications instead of paying it all at once.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1866,9 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ List *channels;
+ ListCell *lc;
+ int64 queue_length;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1596,37 +1882,109 @@ SignalBackends(void)
procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
count = 0;
+ channels = GetPendingNotifyChannels();
+
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, channels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ProcNumber *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up or wrong database */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ queue_length = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(QUEUE_TAIL));
+
+ /* Check for lagging backends when the queue spans multiple pages */
+ if (queue_length > 0)
+ {
+ bool tail_woken = false;
+
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int64 lag;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /* Signal one backend positioned at the global tail */
+ if (!tail_woken && asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_TAIL),
+ QUEUE_POS_PAGE(pos)) == 0)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ tail_woken = true;
+ continue;
+ }
+
+ lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(pos));
+
+ /* Need to signal if a backend has fallen too far behind */
+ if (lag >= QUEUE_CLEANUP_DELAY)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1673,9 +2031,9 @@ AtAbort_Notify(void)
/*
* If we LISTEN but then roll back the transaction after PreCommit_Notify,
* we have registered as a listener but have not made any entry in
- * listenChannels. In that case, deregister again.
+ * channelHash. In that case, deregister again.
*/
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener && numChannelsListeningOn == 0)
asyncQueueUnregister();
/* And clean up */
@@ -1865,6 +2223,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2186,7 +2545,7 @@ ProcessIncomingNotify(bool flush)
notifyInterruptPending = false;
/* Do nothing else if we aren't actively listening */
- if (listenChannels == NIL)
+ if (numChannelsListeningOn == 0)
return;
if (Trace_notify)
@@ -2395,3 +2754,55 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
+
+/*
+ * GetPendingNotifyChannels
+ * Get list of unique channel names from pending notifications.
+ */
+static List *
+GetPendingNotifyChannels(void)
+{
+ List *channels = NIL;
+ ListCell *p;
+ ListCell *q;
+ bool found;
+
+ if (!pendingNotifies)
+ return NIL;
+
+ foreach(p, pendingNotifies->events)
+ {
+ Notification *n = (Notification *) lfirst(p);
+ char *channel = n->data;
+
+ found = false;
+ foreach(q, channels)
+ {
+ char *existing = (char *) lfirst(q);
+
+ if (strcmp(existing, channel) == 0)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ channels = lappend(channels, channel);
+ }
+
+ return channels;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..5ccdd4043e8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
[application/octet-stream] 0003-optimize_listen_notify-v14.patch (5.0K, 4-0003-optimize_listen_notify-v14.patch)
download | inline diff:
From e8dcd9e1d035d8d711312a83daa791da6d8906a9 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sat, 11 Oct 2025 07:28:57 +0200
Subject: [PATCH 3/3] Optimize LISTEN/NOTIFY by advancing idle backends
directly
Building on the previous channel-specific listener tracking
optimization, this patch further reduces context switching by detecting
idle listening backends that don't listen to any of the channels being
notified and advancing their queue positions directly without waking
them up.
When a backend commits notifications, it now saves both the queue head
position before and after writing. In SignalBackends(), backends that
are at the old queue head and weren't marked for wakeup (meaning they
don't listen to any of the notified channels) are advanced directly to
the new queue head. This eliminates unnecessary wakeups for these
backends, which would otherwise wake up, scan through all the
notifications, skip each one, and advance to the same position anyway.
The implementation carefully handles the race condition where other
backends may write notifications after the heavyweight lock is released
but before SignalBackends() is called. By saving queueHeadAfterWrite
immediately after writing (before releasing the lock), we ensure
backends are only advanced over the exact notifications we wrote, not
notifications from other concurrent backends.
---
src/backend/commands/async.c | 62 ++++++++++++++++++++++++++++++++++++
1 file changed, 62 insertions(+)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index bb5ebfab26d..a4cc7395c08 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -500,6 +500,8 @@ typedef struct NotificationList
int nestingLevel; /* current transaction nesting depth */
List *events; /* list of Notification structs */
HTAB *hashtab; /* hash of NotificationHash structs, or NULL */
+ QueuePosition queueHeadBeforeWrite; /* QUEUE_HEAD before writing notifies */
+ QueuePosition queueHeadAfterWrite; /* QUEUE_HEAD after writing notifies */
struct NotificationList *upper; /* details for upper transaction levels */
} NotificationList;
@@ -1048,6 +1050,7 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -1076,6 +1079,9 @@ PreCommit_Notify(void)
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(pendingNotifies->queueHeadBeforeWrite, 0, 0);
+
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
while (nextNotify != NULL)
@@ -1093,6 +1099,19 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /*
+ * On the first iteration, save the queue head position before we
+ * write any notifications. This is used by SignalBackends() to
+ * identify backends that can be advanced directly without waking
+ * them up.
+ */
+ if (firstIteration)
+ {
+ pendingNotifies->queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
@@ -1102,6 +1121,18 @@ PreCommit_Notify(void)
LWLockRelease(NotifyQueueLock);
}
+ /*
+ * Save the queue head after writing all our notifications. This is
+ * used by SignalBackends() to know where to advance idle backends to.
+ * We must save this now because other backends may write their own
+ * notifications after we release the heavyweight lock but before we
+ * call SignalBackends(), and we must not advance backends over those
+ * other notifications.
+ */
+ LWLockAcquire(NotifyQueueLock, LW_SHARED);
+ pendingNotifies->queueHeadAfterWrite = QUEUE_HEAD;
+ LWLockRelease(NotifyQueueLock);
+
/* Note that we don't clear pendingNotifies; AtCommit_Notify will. */
}
}
@@ -1934,6 +1965,37 @@ SignalBackends(void)
dshash_release_lock(channelHash, entry);
}
+ /*
+ * Avoid needing to wake listening backends that are at the old queue head
+ * (before we wrote our notifications) that we know are not interested in
+ * our notifications, since otherwise they would have been marked for
+ * wakeup by now. Do this by advancing them directly to the new queue
+ * head.
+ */
+ if (pendingNotifies != NULL)
+ {
+ QueuePosition oldHead = pendingNotifies->queueHeadBeforeWrite;
+ QueuePosition newHead = pendingNotifies->queueHeadAfterWrite;
+
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
+ {
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ if (QUEUE_POS_EQUAL(pos, oldHead) &&
+ QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ {
+ QUEUE_BACKEND_POS(i) = newHead;
+ }
+ }
+ }
+
queue_length = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
QUEUE_POS_PAGE(QUEUE_TAIL));
--
2.50.1
[text/plain] pgbench-listen-notify-benchmark-patch.txt (9.3K, 5-pgbench-listen-notify-benchmark-patch.txt)
download | inline diff:
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 1515ed405ba..b462dcc8348 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -35,6 +35,7 @@
#include <ctype.h>
#include <float.h>
+#include <inttypes.h>
#include <limits.h>
#include <math.h>
#include <signal.h>
@@ -237,6 +238,11 @@ static const char *const PARTITION_METHOD[] = {"none", "range", "hash"};
/* random seed used to initialize base_random_sequence */
static int64 random_seed = -1;
+/* LISTEN/NOTIFY benchmark mode parameters */
+static bool listen_notify_mode = false; /* enable LISTEN/NOTIFY benchmark */
+static int notify_round_trips = 100; /* number of round-trips per iteration */
+static int notify_idle_step = 10; /* idle listeners to add per iteration */
+
/*
* end of configurable parameters
*********************************************************************/
@@ -930,6 +936,10 @@ usage(void)
" (same as \"-b simple-update\")\n"
" -S, --select-only perform SELECT-only transactions\n"
" (same as \"-b select-only\")\n"
+ " --listen-notify-benchmark\n"
+ " run LISTEN/NOTIFY round-trip benchmark\n"
+ " --notify-round-trips=NUM number of round-trips per iteration (default: 100)\n"
+ " --notify-idle-step=NUM idle listeners to add per iteration (default: 10)\n"
"\nBenchmarking options:\n"
" -c, --client=NUM number of concurrent database clients (default: 1)\n"
" -C, --connect establish new connection for each transaction\n"
@@ -6689,6 +6699,216 @@ set_random_seed(const char *seed)
return true;
}
+/*
+ * Run LISTEN/NOTIFY round-trip benchmark
+ *
+ * This benchmark measures the round-trip time between two processes that
+ * ping-pong NOTIFY messages while adding idle listening connections.
+ */
+static void
+runListenNotifyBenchmark(void)
+{
+ PGconn *conn1 = NULL;
+ PGconn *conn2 = NULL;
+ PGconn **idle_conns = NULL;
+ int num_idle = 0;
+ int max_idle = 10000; /* reasonable upper limit */
+ PGresult *res;
+ char channel1[] = "pgbench_channel_1";
+ char channel2[] = "pgbench_channel_2";
+ char notify_cmd[256];
+ bool first_failure = false;
+
+ pg_log_info("starting LISTEN/NOTIFY round-trip benchmark");
+ pg_log_info("round-trips per iteration: %d", notify_round_trips);
+ pg_log_info("idle listeners added per iteration: %d", notify_idle_step);
+ printf("\n%14s %19s %19s\n", "idle_listeners", "round_trips_per_sec", "max_latency_usec");
+
+ /* Allocate array for idle connections */
+ idle_conns = (PGconn **) pg_malloc0(max_idle * sizeof(PGconn *));
+
+ /* Create two active connections for ping-pong */
+ conn1 = doConnect();
+ if (conn1 == NULL)
+ pg_fatal("failed to create connection 1");
+
+ conn2 = doConnect();
+ if (conn2 == NULL)
+ pg_fatal("failed to create connection 2");
+
+ /* Set up LISTEN on both connections */
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", channel1);
+ res = PQexec(conn1, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("LISTEN failed on connection 1: %s", PQerrorMessage(conn1));
+ PQclear(res);
+
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", channel2);
+ res = PQexec(conn2, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("LISTEN failed on connection 2: %s", PQerrorMessage(conn2));
+ PQclear(res);
+
+ /* Main benchmark loop: measure round-trips then add idle connections */
+ while (num_idle < max_idle)
+ {
+ int i;
+ int64 total_latency = 0;
+ int64 max_latency = 0;
+
+ /* Perform round-trip measurements */
+ for (i = 0; i < notify_round_trips; i++)
+ {
+ pg_time_usec_t start_time,
+ end_time;
+ int64 latency;
+ PGnotify *notify;
+ int sock;
+ fd_set input_mask;
+ struct timeval tv;
+
+ /* Clear any pending notifications */
+ PQconsumeInput(conn1);
+ while ((notify = PQnotifies(conn1)) != NULL)
+ PQfreemem(notify);
+ PQconsumeInput(conn2);
+ while ((notify = PQnotifies(conn2)) != NULL)
+ PQfreemem(notify);
+
+ /* Start timer and send notification from conn1 */
+ start_time = pg_time_now();
+ snprintf(notify_cmd, sizeof(notify_cmd), "NOTIFY %s", channel2);
+ res = PQexec(conn1, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("NOTIFY failed: %s", PQerrorMessage(conn1));
+ PQclear(res);
+
+ /* Wait for notification on conn2 */
+ sock = PQsocket(conn2);
+ notify = NULL;
+ while (notify == NULL)
+ {
+ PQconsumeInput(conn2);
+ notify = PQnotifies(conn2);
+ if (notify == NULL)
+ {
+ /* Wait for data on socket */
+ FD_ZERO(&input_mask);
+ FD_SET(sock, &input_mask);
+ tv.tv_sec = 10; /* 10 second timeout */
+ tv.tv_usec = 0;
+ if (select(sock + 1, &input_mask, NULL, NULL, &tv) < 0)
+ pg_fatal("select() failed: %m");
+ }
+ }
+ PQfreemem(notify);
+
+ /* Send notification back from conn2 */
+ snprintf(notify_cmd, sizeof(notify_cmd), "NOTIFY %s", channel1);
+ res = PQexec(conn2, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("NOTIFY failed: %s", PQerrorMessage(conn2));
+ PQclear(res);
+
+ /* Wait for notification on conn1 */
+ sock = PQsocket(conn1);
+ notify = NULL;
+ while (notify == NULL)
+ {
+ PQconsumeInput(conn1);
+ notify = PQnotifies(conn1);
+ if (notify == NULL)
+ {
+ /* Wait for data on socket */
+ FD_ZERO(&input_mask);
+ FD_SET(sock, &input_mask);
+ tv.tv_sec = 10; /* 10 second timeout */
+ tv.tv_usec = 0;
+ if (select(sock + 1, &input_mask, NULL, NULL, &tv) < 0)
+ pg_fatal("select() failed: %m");
+ }
+ }
+ PQfreemem(notify);
+
+ /* End timer */
+ end_time = pg_time_now();
+
+ /* Calculate individual round-trip latency */
+ latency = end_time - start_time;
+
+ /* Accumulate total latency and track maximum */
+ total_latency += latency;
+ if (latency > max_latency)
+ max_latency = latency;
+ }
+
+ /* Calculate and report round-trips per second and max latency */
+ fprintf(stdout, "%14d %19.1f %19" PRId64 "\n",
+ num_idle,
+ 1000000.0 * notify_round_trips / total_latency,
+ max_latency);
+ fflush(stdout);
+
+ /* Stop if we hit connection limit */
+ if (first_failure)
+ break;
+
+ /* Add idle listening connections */
+ for (i = 0; i < notify_idle_step && num_idle < max_idle; i++)
+ {
+ PGconn *idle_conn;
+ char idle_channel[256];
+
+ idle_conn = doConnect();
+ if (idle_conn == NULL)
+ {
+ if (!first_failure)
+ {
+ pg_log_info("reached max_connections at %d idle listeners", num_idle);
+ first_failure = true;
+ }
+ break;
+ }
+
+ /* Each idle connection listens on a unique channel */
+ snprintf(idle_channel, sizeof(idle_channel), "idle_%d", num_idle);
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", idle_channel);
+
+ res = PQexec(idle_conn, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ {
+ pg_log_warning("LISTEN failed on idle connection %d: %s",
+ num_idle, PQerrorMessage(idle_conn));
+ PQfinish(idle_conn);
+ PQclear(res);
+ first_failure = true;
+ break;
+ }
+ PQclear(res);
+
+ idle_conns[num_idle] = idle_conn;
+ num_idle++;
+ }
+
+ /* Stop if we couldn't add any connections */
+ if (first_failure && i == 0)
+ break;
+ }
+
+ /* Clean up */
+ pg_log_info("cleaning up connections");
+ PQfinish(conn1);
+ PQfinish(conn2);
+ for (int i = 0; i < num_idle; i++)
+ {
+ if (idle_conns[i])
+ PQfinish(idle_conns[i]);
+ }
+ pg_free(idle_conns);
+
+ pg_log_info("LISTEN/NOTIFY benchmark completed");
+}
+
int
main(int argc, char **argv)
{
@@ -6739,6 +6959,9 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"listen-notify-benchmark", no_argument, NULL, 18},
+ {"notify-round-trips", required_argument, NULL, 19},
+ {"notify-idle-step", required_argument, NULL, 20},
{NULL, 0, NULL, 0}
};
@@ -7092,6 +7315,22 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* listen-notify-benchmark */
+ listen_notify_mode = true;
+ benchmarking_option_set = true;
+ break;
+ case 19: /* notify-round-trips */
+ benchmarking_option_set = true;
+ if (!option_parse_int(optarg, "--notify-round-trips", 1, INT_MAX,
+ ¬ify_round_trips))
+ exit(1);
+ break;
+ case 20: /* notify-idle-step */
+ benchmarking_option_set = true;
+ if (!option_parse_int(optarg, "--notify-idle-step", 1, INT_MAX,
+ ¬ify_idle_step))
+ exit(1);
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7210,6 +7449,20 @@ main(int argc, char **argv)
pg_fatal("some of the specified options cannot be used in benchmarking mode");
}
+ /* Handle LISTEN/NOTIFY benchmark mode */
+ if (listen_notify_mode)
+ {
+ /* Establish a database connection for setup */
+ if ((con = doConnect()) == NULL)
+ pg_fatal("could not connect to database");
+
+ /* Run the LISTEN/NOTIFY benchmark */
+ runListenNotifyBenchmark();
+
+ PQfinish(con);
+ exit(0);
+ }
+
if (nxacts > 0 && duration > 0)
pg_fatal("specify either a number of transactions (-t) or a duration (-T), not both");
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-11 07:43 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-11 07:43 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Matheus Alcantara <[email protected]>; pgsql-hackers
On Sat, Oct 11, 2025, at 08:43, Joel Jacobson wrote:
> In addition to previously suggested optimization, there is another major
> one that seems doable, that would mean a great improvement for workload
> having large traffic differences between channels, i.e. some low traffic
> and some high traffic.
>
> I'm not entirely sure this approach is correct though, I've might
> misunderstood the guarantees of the heavyweight lock. My assumption is
> based on that there can only be one backend that is currently running
> the code in PreCommit_Notify after having aquired the heavyweight lock.
> If this is not true, then it doesn't work. What made me worried is the
> exclusive lock we also take inside the same function, I don't see the
> point of it since we're already holding the heavyweight lock, but maybe
> this is just to "allows deadlocks to be detected" like the comment says?
..,
> * 0003-optimize_listen_notify-v14.patch:
>
> Optimize LISTEN/NOTIFY by advancing idle backends directly
>
> Building on the previous channel-specific listener tracking
> optimization, this patch further reduces context switching by detecting
> idle listening backends that don't listen to any of the channels being
> notified and advancing their queue positions directly without waking
> them up.
...
> 0003-optimize_listen_notify-v14.patch:
>
> idle_listeners round_trips_per_sec max_latency_usec
> 0 33236.8 1090
> 10 34681.0 1338
> 20 34530.4 1372
> 30 34061.6 1339
> 40 33084.5 913
> 50 33847.5 955
> 60 33675.8 1239
> 70 28857.4 20443
> 80 33324.9 786
> 90 33612.3 758
> 100 31259.2 7706
I noticed the strange data point at idle_listeners=70.
This made me think about the "wake tail only" trick,
and realized this is now unnecessary with the new 0003 idea.
New version attached that removes that part from the 0003 patch.
This also of course improved the stability of max_latency_usec,
since in this specific benchmark all other listeners are always idle,
so they don't need to be woken up ever:
idle_listeners round_trips_per_sec max_latency_usec
0 33631.8 546
10 34318.0 586
20 34813.0 596
30 35073.4 574
40 34646.1 569
50 33755.5 634
60 33991.6 561
70 34049.0 550
80 33886.0 541
90 33545.0 540
100 33163.1 660
/Joel
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 1515ed405ba..b462dcc8348 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -35,6 +35,7 @@
#include <ctype.h>
#include <float.h>
+#include <inttypes.h>
#include <limits.h>
#include <math.h>
#include <signal.h>
@@ -237,6 +238,11 @@ static const char *const PARTITION_METHOD[] = {"none", "range", "hash"};
/* random seed used to initialize base_random_sequence */
static int64 random_seed = -1;
+/* LISTEN/NOTIFY benchmark mode parameters */
+static bool listen_notify_mode = false; /* enable LISTEN/NOTIFY benchmark */
+static int notify_round_trips = 100; /* number of round-trips per iteration */
+static int notify_idle_step = 10; /* idle listeners to add per iteration */
+
/*
* end of configurable parameters
*********************************************************************/
@@ -930,6 +936,10 @@ usage(void)
" (same as \"-b simple-update\")\n"
" -S, --select-only perform SELECT-only transactions\n"
" (same as \"-b select-only\")\n"
+ " --listen-notify-benchmark\n"
+ " run LISTEN/NOTIFY round-trip benchmark\n"
+ " --notify-round-trips=NUM number of round-trips per iteration (default: 100)\n"
+ " --notify-idle-step=NUM idle listeners to add per iteration (default: 10)\n"
"\nBenchmarking options:\n"
" -c, --client=NUM number of concurrent database clients (default: 1)\n"
" -C, --connect establish new connection for each transaction\n"
@@ -6689,6 +6699,216 @@ set_random_seed(const char *seed)
return true;
}
+/*
+ * Run LISTEN/NOTIFY round-trip benchmark
+ *
+ * This benchmark measures the round-trip time between two processes that
+ * ping-pong NOTIFY messages while adding idle listening connections.
+ */
+static void
+runListenNotifyBenchmark(void)
+{
+ PGconn *conn1 = NULL;
+ PGconn *conn2 = NULL;
+ PGconn **idle_conns = NULL;
+ int num_idle = 0;
+ int max_idle = 10000; /* reasonable upper limit */
+ PGresult *res;
+ char channel1[] = "pgbench_channel_1";
+ char channel2[] = "pgbench_channel_2";
+ char notify_cmd[256];
+ bool first_failure = false;
+
+ pg_log_info("starting LISTEN/NOTIFY round-trip benchmark");
+ pg_log_info("round-trips per iteration: %d", notify_round_trips);
+ pg_log_info("idle listeners added per iteration: %d", notify_idle_step);
+ printf("\n%14s %19s %19s\n", "idle_listeners", "round_trips_per_sec", "max_latency_usec");
+
+ /* Allocate array for idle connections */
+ idle_conns = (PGconn **) pg_malloc0(max_idle * sizeof(PGconn *));
+
+ /* Create two active connections for ping-pong */
+ conn1 = doConnect();
+ if (conn1 == NULL)
+ pg_fatal("failed to create connection 1");
+
+ conn2 = doConnect();
+ if (conn2 == NULL)
+ pg_fatal("failed to create connection 2");
+
+ /* Set up LISTEN on both connections */
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", channel1);
+ res = PQexec(conn1, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("LISTEN failed on connection 1: %s", PQerrorMessage(conn1));
+ PQclear(res);
+
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", channel2);
+ res = PQexec(conn2, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("LISTEN failed on connection 2: %s", PQerrorMessage(conn2));
+ PQclear(res);
+
+ /* Main benchmark loop: measure round-trips then add idle connections */
+ while (num_idle < max_idle)
+ {
+ int i;
+ int64 total_latency = 0;
+ int64 max_latency = 0;
+
+ /* Perform round-trip measurements */
+ for (i = 0; i < notify_round_trips; i++)
+ {
+ pg_time_usec_t start_time,
+ end_time;
+ int64 latency;
+ PGnotify *notify;
+ int sock;
+ fd_set input_mask;
+ struct timeval tv;
+
+ /* Clear any pending notifications */
+ PQconsumeInput(conn1);
+ while ((notify = PQnotifies(conn1)) != NULL)
+ PQfreemem(notify);
+ PQconsumeInput(conn2);
+ while ((notify = PQnotifies(conn2)) != NULL)
+ PQfreemem(notify);
+
+ /* Start timer and send notification from conn1 */
+ start_time = pg_time_now();
+ snprintf(notify_cmd, sizeof(notify_cmd), "NOTIFY %s", channel2);
+ res = PQexec(conn1, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("NOTIFY failed: %s", PQerrorMessage(conn1));
+ PQclear(res);
+
+ /* Wait for notification on conn2 */
+ sock = PQsocket(conn2);
+ notify = NULL;
+ while (notify == NULL)
+ {
+ PQconsumeInput(conn2);
+ notify = PQnotifies(conn2);
+ if (notify == NULL)
+ {
+ /* Wait for data on socket */
+ FD_ZERO(&input_mask);
+ FD_SET(sock, &input_mask);
+ tv.tv_sec = 10; /* 10 second timeout */
+ tv.tv_usec = 0;
+ if (select(sock + 1, &input_mask, NULL, NULL, &tv) < 0)
+ pg_fatal("select() failed: %m");
+ }
+ }
+ PQfreemem(notify);
+
+ /* Send notification back from conn2 */
+ snprintf(notify_cmd, sizeof(notify_cmd), "NOTIFY %s", channel1);
+ res = PQexec(conn2, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("NOTIFY failed: %s", PQerrorMessage(conn2));
+ PQclear(res);
+
+ /* Wait for notification on conn1 */
+ sock = PQsocket(conn1);
+ notify = NULL;
+ while (notify == NULL)
+ {
+ PQconsumeInput(conn1);
+ notify = PQnotifies(conn1);
+ if (notify == NULL)
+ {
+ /* Wait for data on socket */
+ FD_ZERO(&input_mask);
+ FD_SET(sock, &input_mask);
+ tv.tv_sec = 10; /* 10 second timeout */
+ tv.tv_usec = 0;
+ if (select(sock + 1, &input_mask, NULL, NULL, &tv) < 0)
+ pg_fatal("select() failed: %m");
+ }
+ }
+ PQfreemem(notify);
+
+ /* End timer */
+ end_time = pg_time_now();
+
+ /* Calculate individual round-trip latency */
+ latency = end_time - start_time;
+
+ /* Accumulate total latency and track maximum */
+ total_latency += latency;
+ if (latency > max_latency)
+ max_latency = latency;
+ }
+
+ /* Calculate and report round-trips per second and max latency */
+ fprintf(stdout, "%14d %19.1f %19" PRId64 "\n",
+ num_idle,
+ 1000000.0 * notify_round_trips / total_latency,
+ max_latency);
+ fflush(stdout);
+
+ /* Stop if we hit connection limit */
+ if (first_failure)
+ break;
+
+ /* Add idle listening connections */
+ for (i = 0; i < notify_idle_step && num_idle < max_idle; i++)
+ {
+ PGconn *idle_conn;
+ char idle_channel[256];
+
+ idle_conn = doConnect();
+ if (idle_conn == NULL)
+ {
+ if (!first_failure)
+ {
+ pg_log_info("reached max_connections at %d idle listeners", num_idle);
+ first_failure = true;
+ }
+ break;
+ }
+
+ /* Each idle connection listens on a unique channel */
+ snprintf(idle_channel, sizeof(idle_channel), "idle_%d", num_idle);
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", idle_channel);
+
+ res = PQexec(idle_conn, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ {
+ pg_log_warning("LISTEN failed on idle connection %d: %s",
+ num_idle, PQerrorMessage(idle_conn));
+ PQfinish(idle_conn);
+ PQclear(res);
+ first_failure = true;
+ break;
+ }
+ PQclear(res);
+
+ idle_conns[num_idle] = idle_conn;
+ num_idle++;
+ }
+
+ /* Stop if we couldn't add any connections */
+ if (first_failure && i == 0)
+ break;
+ }
+
+ /* Clean up */
+ pg_log_info("cleaning up connections");
+ PQfinish(conn1);
+ PQfinish(conn2);
+ for (int i = 0; i < num_idle; i++)
+ {
+ if (idle_conns[i])
+ PQfinish(idle_conns[i]);
+ }
+ pg_free(idle_conns);
+
+ pg_log_info("LISTEN/NOTIFY benchmark completed");
+}
+
int
main(int argc, char **argv)
{
@@ -6739,6 +6959,9 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"listen-notify-benchmark", no_argument, NULL, 18},
+ {"notify-round-trips", required_argument, NULL, 19},
+ {"notify-idle-step", required_argument, NULL, 20},
{NULL, 0, NULL, 0}
};
@@ -7092,6 +7315,22 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* listen-notify-benchmark */
+ listen_notify_mode = true;
+ benchmarking_option_set = true;
+ break;
+ case 19: /* notify-round-trips */
+ benchmarking_option_set = true;
+ if (!option_parse_int(optarg, "--notify-round-trips", 1, INT_MAX,
+ ¬ify_round_trips))
+ exit(1);
+ break;
+ case 20: /* notify-idle-step */
+ benchmarking_option_set = true;
+ if (!option_parse_int(optarg, "--notify-idle-step", 1, INT_MAX,
+ ¬ify_idle_step))
+ exit(1);
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7210,6 +7449,20 @@ main(int argc, char **argv)
pg_fatal("some of the specified options cannot be used in benchmarking mode");
}
+ /* Handle LISTEN/NOTIFY benchmark mode */
+ if (listen_notify_mode)
+ {
+ /* Establish a database connection for setup */
+ if ((con = doConnect()) == NULL)
+ pg_fatal("could not connect to database");
+
+ /* Run the LISTEN/NOTIFY benchmark */
+ runListenNotifyBenchmark();
+
+ PQfinish(con);
+ exit(0);
+ }
+
if (nxacts > 0 && duration > 0)
pg_fatal("specify either a number of transactions (-t) or a duration (-T), not both");
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v15.patch (7.8K, 2-0001-optimize_listen_notify-v15.patch)
download | inline diff:
From 183c8a106705a6391cd68f406019253d36680da4 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/3] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 103 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 52 ++++++++++
2 files changed, 154 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..9c19843d2d7 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 5 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..942b09d5735 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,26 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +94,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v15.patch (29.8K, 3-0002-optimize_listen_notify-v15.patch)
download | inline diff:
From b39c5b71f6d6b219ab06c6b731e2317f480edf5d Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Tue, 7 Oct 2025 20:56:47 +0200
Subject: [PATCH 2/3] Optimize LISTEN/NOTIFY with channel-specific listener
tracking
Currently, idle listening backends cause a dramatic slowdown due to
context switching when they are signaled and wake up. This is wasteful
when they are not listening to the channel being notified.
Just 10 extra idle listening connections cause a slowdown from 8700 TPS
to 6100 TPS, 100 extra cause it to drop to 2000 TPS, and at 1000 extra
it falls to 250 TPS.
This patch introduces targeted signaling for LISTEN/NOTIFY, improving
scalability in workloads with many idle listeners.
A dynamic shared hash table now tracks which backends listen on each
(database, channel) pair, which SignalBackends() uses to perform
targeted signaling. In addition, it staggers wakeups by signaling one
backend at the global tail to help it advance gradually, and forces any
excessively lagging backends to catch up. A per-backend wakeup_pending
flag avoids redundant signals.
---
src/backend/commands/async.c | 591 +++++++++++++++---
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 505 insertions(+), 90 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..bb5ebfab26d 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,20 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
+ * make any actual updates to the effective listen state (channelHash).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which consults the shared channel hash table to
+ * identify listeners for the channels that have pending notifications
+ * in the current database. Each selected backend is marked as having a
+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to it.
+ *
+ * To maintain queue health, SignalBackends() also wakes one backend
+ * positioned at the global queue tail to help advance it, and signals
+ * any backend that has fallen too far behind to catch up. These measures
+ * prevent the notification queue from growing indefinitely, while mostly
+ * limiting wakeups to the backends that actually need them.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +134,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,14 +144,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +172,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ProcNumber array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -227,8 +260,8 @@ typedef struct QueuePosition
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +279,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
/*
@@ -288,11 +322,91 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +415,7 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
* The SLRU buffer area through which we access the notification queue
@@ -312,17 +427,11 @@ static SlruCtlData NotifyCtlData;
#define QUEUE_FULL_WARN_INTERVAL 5000 /* warn at most once every 5s */
-/*
- * listenChannels identifies the channels we are actually listening to
- * (ie, have committed a LISTEN on). It is a simple list of channel names,
- * allocated in TopMemoryContext.
- */
-static List *listenChannels = NIL; /* list of C strings */
-
/*
* State for pending LISTEN/UNLISTEN actions consists of an ordered list of
* all actions requested in the current transaction. As explained above,
- * we don't actually change listenChannels until we reach transaction commit.
+ * we don't actually change the shared channelHash until we reach transaction
+ * commit.
*
* The list is kept in CurTransactionContext. In subtransactions, each
* subtransaction has its own list in its own CurTransactionContext, but
@@ -418,6 +527,9 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/* Count of channels we're currently listening on */
+static int numChannelsListeningOn = 0;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -457,6 +569,8 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static List *GetPendingNotifyChannels(void);
/*
* Compute the difference between two queue page numbers.
@@ -521,12 +635,16 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -683,7 +801,7 @@ Async_Notify(const char *channel, const char *payload)
* Common code for listen, unlisten, unlisten all commands.
*
* Adds the request to the list of pending actions.
- * Actual update of the listenChannels list happens during transaction
+ * Actual update of the shared channelHash happens during transaction
* commit.
*/
static void
@@ -782,24 +900,60 @@ Async_UnlistenAll(void)
/*
* SQL function: return a set of the channel names this backend is actively
* listening to.
- *
- * Note: this coding relies on the fact that the listenChannels list cannot
- * change within a transaction.
*/
Datum
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ List *listenChannels;
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
+
+ /* get channels from channelHash and store in function context */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ listenChannels = NIL;
+
+ if (channelHash != NULL)
+ {
+ dshash_seq_init(&status, channelHash, false);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ listenChannels = lappend(listenChannels, pstrdup(entry->key.channel));
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
+ }
+
+ funcctx->user_fctx = listenChannels;
+ MemoryContextSwitchTo(oldcontext);
}
/* stuff done on every call of the function */
funcctx = SRF_PERCALL_SETUP();
+ listenChannels = (List *) funcctx->user_fctx;
if (funcctx->call_cntr < list_length(listenChannels))
{
@@ -957,7 +1111,7 @@ PreCommit_Notify(void)
*
* This is called at transaction commit, after committing to clog.
*
- * Update listenChannels and clear transaction-local state.
+ * Update channelHash and clear transaction-local state.
*
* If we issued any notifications in the transaction, send signals to
* listening backends (possibly including ourselves) to process them.
@@ -1002,7 +1156,7 @@ AtCommit_Notify(void)
}
/* If no longer listening to anything, get out of listener array */
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener && numChannelsListeningOn == 0)
asyncQueueUnregister();
/*
@@ -1130,55 +1284,131 @@ Exec_ListenPreCommit(void)
/*
* Exec_ListenCommit --- subroutine for AtCommit_Notify
*
- * Add the channel to the list of channels we are listening on.
+ * Add the channel to the shared channelHash.
*/
static void
Exec_ListenCommit(const char *channel)
{
- MemoryContext oldcontext;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
- /* Do nothing if we are already listening on this channel */
- if (IsListeningOn(channel))
- return;
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
/*
- * Add the new channel name to listenChannels.
- *
- * XXX It is theoretically possible to get an out-of-memory failure here,
- * which would be bad because we already committed. For the moment it
- * doesn't seem worth trying to guard against that, but maybe improve this
- * later.
+ * For new entries, we initialize listenersArray to InvalidDsaPointer as a
+ * marker. This handles both the initial creation and potential retry
+ * after OOM.
*/
- oldcontext = MemoryContextSwitchTo(TopMemoryContext);
- listenChannels = lappend(listenChannels, pstrdup(channel));
- MemoryContextSwitchTo(oldcontext);
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners] = MyProcNumber;
+ entry->numListeners++;
+ numChannelsListeningOn++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Remove the specified channel from channelHash.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
- ListCell *q;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
- foreach(q, listenChannels)
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(q);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i] == MyProcNumber)
{
- listenChannels = foreach_delete_current(listenChannels, q);
- pfree(lchan);
- break;
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ numChannelsListeningOn--;
+ return;
}
}
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1193,33 +1423,82 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
- list_free_deep(listenChannels);
- listenChannels = NIL;
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+ int i;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
+ numChannelsListeningOn = 0;
}
/*
* Test whether we are actively listening on the given channel name.
*
* Note: this function is executed for every notification found in the queue.
- * Perhaps it is worth further optimization, eg convert the list to a sorted
- * array so we can binary-search it. In practice the list is likely to be
- * fairly short, though.
*/
static bool
IsListeningOn(const char *channel)
{
- ListCell *p;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
- foreach(p, listenChannels)
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ entry = dshash_find(channelHash, &key, false);
+ if (entry == NULL)
+ return false; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(p);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
return true;
+ }
}
+
+ dshash_release_lock(channelHash, entry);
return false;
}
@@ -1230,7 +1509,7 @@ IsListeningOn(const char *channel)
static void
asyncQueueUnregister(void)
{
- Assert(listenChannels == NIL); /* else caller error */
+ Assert(numChannelsListeningOn == 0); /* else caller error */
if (!amRegisteredListener) /* nothing to do */
return;
@@ -1565,12 +1844,16 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends registered as listeners for channels
+ * with pending notifications. However, when there is no traffic on some
+ * channels, listeners on such channels will fall further and further
+ * behind. Waken them if they are too far behind, so that they'll
+ * advance their queue position pointers, allowing the global tail to
+ * advance.
+ *
+ * To stagger wakeups of lagging backends, wake the backend furthest
+ * behind (at the tail), amortizing the context-switching cost across
+ * successive notifications instead of paying it all at once.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1866,9 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ List *channels;
+ ListCell *lc;
+ int64 queue_length;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1596,37 +1882,109 @@ SignalBackends(void)
procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
count = 0;
+ channels = GetPendingNotifyChannels();
+
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, channels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ProcNumber *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up or wrong database */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ queue_length = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(QUEUE_TAIL));
+
+ /* Check for lagging backends when the queue spans multiple pages */
+ if (queue_length > 0)
+ {
+ bool tail_woken = false;
+
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int64 lag;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ /* Signal one backend positioned at the global tail */
+ if (!tail_woken && asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_TAIL),
+ QUEUE_POS_PAGE(pos)) == 0)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ tail_woken = true;
+ continue;
+ }
+
+ lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(pos));
+
+ /* Need to signal if a backend has fallen too far behind */
+ if (lag >= QUEUE_CLEANUP_DELAY)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1673,9 +2031,9 @@ AtAbort_Notify(void)
/*
* If we LISTEN but then roll back the transaction after PreCommit_Notify,
* we have registered as a listener but have not made any entry in
- * listenChannels. In that case, deregister again.
+ * channelHash. In that case, deregister again.
*/
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener && numChannelsListeningOn == 0)
asyncQueueUnregister();
/* And clean up */
@@ -1865,6 +2223,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2186,7 +2545,7 @@ ProcessIncomingNotify(bool flush)
notifyInterruptPending = false;
/* Do nothing else if we aren't actively listening */
- if (listenChannels == NIL)
+ if (numChannelsListeningOn == 0)
return;
if (Trace_notify)
@@ -2395,3 +2754,55 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
+
+/*
+ * GetPendingNotifyChannels
+ * Get list of unique channel names from pending notifications.
+ */
+static List *
+GetPendingNotifyChannels(void)
+{
+ List *channels = NIL;
+ ListCell *p;
+ ListCell *q;
+ bool found;
+
+ if (!pendingNotifies)
+ return NIL;
+
+ foreach(p, pendingNotifies->events)
+ {
+ Notification *n = (Notification *) lfirst(p);
+ char *channel = n->data;
+
+ found = false;
+ foreach(q, channels)
+ {
+ char *existing = (char *) lfirst(q);
+
+ if (strcmp(existing, channel) == 0)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ channels = lappend(channels, channel);
+ }
+
+ return channels;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..5ccdd4043e8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
[application/octet-stream] 0003-optimize_listen_notify-v15.patch (5.9K, 4-0003-optimize_listen_notify-v15.patch)
download | inline diff:
From 3fe37ec554905d69f71a05e9dec26d5b3ac7fd23 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sat, 11 Oct 2025 07:28:57 +0200
Subject: [PATCH 3/3] Optimize LISTEN/NOTIFY by advancing idle backends
directly
Building on the previous channel-specific listener tracking
optimization, this patch further reduces context switching by detecting
idle listening backends that don't listen to any of the channels being
notified and advancing their queue positions directly without waking
them up.
When a backend commits notifications, it now saves both the queue head
position before and after writing. In SignalBackends(), backends that
are at the old queue head and weren't marked for wakeup (meaning they
don't listen to any of the notified channels) are advanced directly to
the new queue head. This eliminates unnecessary wakeups for these
backends, which would otherwise wake up, scan through all the
notifications, skip each one, and advance to the same position anyway.
The implementation carefully handles the race condition where other
backends may write notifications after the heavyweight lock is released
but before SignalBackends() is called. By saving queueHeadAfterWrite
immediately after writing (before releasing the lock), we ensure
backends are only advanced over the exact notifications we wrote, not
notifications from other concurrent backends.
---
src/backend/commands/async.c | 79 ++++++++++++++++++++++++++++--------
1 file changed, 62 insertions(+), 17 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index bb5ebfab26d..5570f73dd13 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -500,6 +500,8 @@ typedef struct NotificationList
int nestingLevel; /* current transaction nesting depth */
List *events; /* list of Notification structs */
HTAB *hashtab; /* hash of NotificationHash structs, or NULL */
+ QueuePosition queueHeadBeforeWrite; /* QUEUE_HEAD before writing notifies */
+ QueuePosition queueHeadAfterWrite; /* QUEUE_HEAD after writing notifies */
struct NotificationList *upper; /* details for upper transaction levels */
} NotificationList;
@@ -1048,6 +1050,7 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -1076,6 +1079,9 @@ PreCommit_Notify(void)
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(pendingNotifies->queueHeadBeforeWrite, 0, 0);
+
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
while (nextNotify != NULL)
@@ -1093,6 +1099,19 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /*
+ * On the first iteration, save the queue head position before we
+ * write any notifications. This is used by SignalBackends() to
+ * identify backends that can be advanced directly without waking
+ * them up.
+ */
+ if (firstIteration)
+ {
+ pendingNotifies->queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
@@ -1102,6 +1121,18 @@ PreCommit_Notify(void)
LWLockRelease(NotifyQueueLock);
}
+ /*
+ * Save the queue head after writing all our notifications. This is
+ * used by SignalBackends() to know where to advance idle backends to.
+ * We must save this now because other backends may write their own
+ * notifications after we release the heavyweight lock but before we
+ * call SignalBackends(), and we must not advance backends over those
+ * other notifications.
+ */
+ LWLockAcquire(NotifyQueueLock, LW_SHARED);
+ pendingNotifies->queueHeadAfterWrite = QUEUE_HEAD;
+ LWLockRelease(NotifyQueueLock);
+
/* Note that we don't clear pendingNotifies; AtCommit_Notify will. */
}
}
@@ -1934,14 +1965,43 @@ SignalBackends(void)
dshash_release_lock(channelHash, entry);
}
+ /*
+ * Avoid needing to wake listening backends that are at the old queue head
+ * (before we wrote our notifications) that we know are not interested in
+ * our notifications, since otherwise they would have been marked for
+ * wakeup by now. Do this by advancing them directly to the new queue
+ * head.
+ */
+ if (pendingNotifies != NULL)
+ {
+ QueuePosition oldHead = pendingNotifies->queueHeadBeforeWrite;
+ QueuePosition newHead = pendingNotifies->queueHeadAfterWrite;
+
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
+ {
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ if (QUEUE_POS_EQUAL(pos, oldHead) &&
+ QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ {
+ QUEUE_BACKEND_POS(i) = newHead;
+ }
+ }
+ }
+
queue_length = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
QUEUE_POS_PAGE(QUEUE_TAIL));
/* Check for lagging backends when the queue spans multiple pages */
if (queue_length > 0)
{
- bool tail_woken = false;
-
for (ProcNumber i = QUEUE_FIRST_LISTENER;
i != INVALID_PROC_NUMBER;
i = QUEUE_NEXT_LISTENER(i))
@@ -1955,21 +2015,6 @@ SignalBackends(void)
pos = QUEUE_BACKEND_POS(i);
- /* Signal one backend positioned at the global tail */
- if (!tail_woken && asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_TAIL),
- QUEUE_POS_PAGE(pos)) == 0)
- {
- pid = QUEUE_BACKEND_PID(i);
- Assert(pid != InvalidPid);
-
- QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
- pids[count] = pid;
- procnos[count] = i;
- count++;
- tail_woken = true;
- continue;
- }
-
lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
QUEUE_POS_PAGE(pos));
--
2.50.1
[text/plain] pgbench-listen-notify-benchmark-patch.txt (9.3K, 5-pgbench-listen-notify-benchmark-patch.txt)
download | inline diff:
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 1515ed405ba..b462dcc8348 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -35,6 +35,7 @@
#include <ctype.h>
#include <float.h>
+#include <inttypes.h>
#include <limits.h>
#include <math.h>
#include <signal.h>
@@ -237,6 +238,11 @@ static const char *const PARTITION_METHOD[] = {"none", "range", "hash"};
/* random seed used to initialize base_random_sequence */
static int64 random_seed = -1;
+/* LISTEN/NOTIFY benchmark mode parameters */
+static bool listen_notify_mode = false; /* enable LISTEN/NOTIFY benchmark */
+static int notify_round_trips = 100; /* number of round-trips per iteration */
+static int notify_idle_step = 10; /* idle listeners to add per iteration */
+
/*
* end of configurable parameters
*********************************************************************/
@@ -930,6 +936,10 @@ usage(void)
" (same as \"-b simple-update\")\n"
" -S, --select-only perform SELECT-only transactions\n"
" (same as \"-b select-only\")\n"
+ " --listen-notify-benchmark\n"
+ " run LISTEN/NOTIFY round-trip benchmark\n"
+ " --notify-round-trips=NUM number of round-trips per iteration (default: 100)\n"
+ " --notify-idle-step=NUM idle listeners to add per iteration (default: 10)\n"
"\nBenchmarking options:\n"
" -c, --client=NUM number of concurrent database clients (default: 1)\n"
" -C, --connect establish new connection for each transaction\n"
@@ -6689,6 +6699,216 @@ set_random_seed(const char *seed)
return true;
}
+/*
+ * Run LISTEN/NOTIFY round-trip benchmark
+ *
+ * This benchmark measures the round-trip time between two processes that
+ * ping-pong NOTIFY messages while adding idle listening connections.
+ */
+static void
+runListenNotifyBenchmark(void)
+{
+ PGconn *conn1 = NULL;
+ PGconn *conn2 = NULL;
+ PGconn **idle_conns = NULL;
+ int num_idle = 0;
+ int max_idle = 10000; /* reasonable upper limit */
+ PGresult *res;
+ char channel1[] = "pgbench_channel_1";
+ char channel2[] = "pgbench_channel_2";
+ char notify_cmd[256];
+ bool first_failure = false;
+
+ pg_log_info("starting LISTEN/NOTIFY round-trip benchmark");
+ pg_log_info("round-trips per iteration: %d", notify_round_trips);
+ pg_log_info("idle listeners added per iteration: %d", notify_idle_step);
+ printf("\n%14s %19s %19s\n", "idle_listeners", "round_trips_per_sec", "max_latency_usec");
+
+ /* Allocate array for idle connections */
+ idle_conns = (PGconn **) pg_malloc0(max_idle * sizeof(PGconn *));
+
+ /* Create two active connections for ping-pong */
+ conn1 = doConnect();
+ if (conn1 == NULL)
+ pg_fatal("failed to create connection 1");
+
+ conn2 = doConnect();
+ if (conn2 == NULL)
+ pg_fatal("failed to create connection 2");
+
+ /* Set up LISTEN on both connections */
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", channel1);
+ res = PQexec(conn1, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("LISTEN failed on connection 1: %s", PQerrorMessage(conn1));
+ PQclear(res);
+
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", channel2);
+ res = PQexec(conn2, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("LISTEN failed on connection 2: %s", PQerrorMessage(conn2));
+ PQclear(res);
+
+ /* Main benchmark loop: measure round-trips then add idle connections */
+ while (num_idle < max_idle)
+ {
+ int i;
+ int64 total_latency = 0;
+ int64 max_latency = 0;
+
+ /* Perform round-trip measurements */
+ for (i = 0; i < notify_round_trips; i++)
+ {
+ pg_time_usec_t start_time,
+ end_time;
+ int64 latency;
+ PGnotify *notify;
+ int sock;
+ fd_set input_mask;
+ struct timeval tv;
+
+ /* Clear any pending notifications */
+ PQconsumeInput(conn1);
+ while ((notify = PQnotifies(conn1)) != NULL)
+ PQfreemem(notify);
+ PQconsumeInput(conn2);
+ while ((notify = PQnotifies(conn2)) != NULL)
+ PQfreemem(notify);
+
+ /* Start timer and send notification from conn1 */
+ start_time = pg_time_now();
+ snprintf(notify_cmd, sizeof(notify_cmd), "NOTIFY %s", channel2);
+ res = PQexec(conn1, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("NOTIFY failed: %s", PQerrorMessage(conn1));
+ PQclear(res);
+
+ /* Wait for notification on conn2 */
+ sock = PQsocket(conn2);
+ notify = NULL;
+ while (notify == NULL)
+ {
+ PQconsumeInput(conn2);
+ notify = PQnotifies(conn2);
+ if (notify == NULL)
+ {
+ /* Wait for data on socket */
+ FD_ZERO(&input_mask);
+ FD_SET(sock, &input_mask);
+ tv.tv_sec = 10; /* 10 second timeout */
+ tv.tv_usec = 0;
+ if (select(sock + 1, &input_mask, NULL, NULL, &tv) < 0)
+ pg_fatal("select() failed: %m");
+ }
+ }
+ PQfreemem(notify);
+
+ /* Send notification back from conn2 */
+ snprintf(notify_cmd, sizeof(notify_cmd), "NOTIFY %s", channel1);
+ res = PQexec(conn2, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("NOTIFY failed: %s", PQerrorMessage(conn2));
+ PQclear(res);
+
+ /* Wait for notification on conn1 */
+ sock = PQsocket(conn1);
+ notify = NULL;
+ while (notify == NULL)
+ {
+ PQconsumeInput(conn1);
+ notify = PQnotifies(conn1);
+ if (notify == NULL)
+ {
+ /* Wait for data on socket */
+ FD_ZERO(&input_mask);
+ FD_SET(sock, &input_mask);
+ tv.tv_sec = 10; /* 10 second timeout */
+ tv.tv_usec = 0;
+ if (select(sock + 1, &input_mask, NULL, NULL, &tv) < 0)
+ pg_fatal("select() failed: %m");
+ }
+ }
+ PQfreemem(notify);
+
+ /* End timer */
+ end_time = pg_time_now();
+
+ /* Calculate individual round-trip latency */
+ latency = end_time - start_time;
+
+ /* Accumulate total latency and track maximum */
+ total_latency += latency;
+ if (latency > max_latency)
+ max_latency = latency;
+ }
+
+ /* Calculate and report round-trips per second and max latency */
+ fprintf(stdout, "%14d %19.1f %19" PRId64 "\n",
+ num_idle,
+ 1000000.0 * notify_round_trips / total_latency,
+ max_latency);
+ fflush(stdout);
+
+ /* Stop if we hit connection limit */
+ if (first_failure)
+ break;
+
+ /* Add idle listening connections */
+ for (i = 0; i < notify_idle_step && num_idle < max_idle; i++)
+ {
+ PGconn *idle_conn;
+ char idle_channel[256];
+
+ idle_conn = doConnect();
+ if (idle_conn == NULL)
+ {
+ if (!first_failure)
+ {
+ pg_log_info("reached max_connections at %d idle listeners", num_idle);
+ first_failure = true;
+ }
+ break;
+ }
+
+ /* Each idle connection listens on a unique channel */
+ snprintf(idle_channel, sizeof(idle_channel), "idle_%d", num_idle);
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", idle_channel);
+
+ res = PQexec(idle_conn, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ {
+ pg_log_warning("LISTEN failed on idle connection %d: %s",
+ num_idle, PQerrorMessage(idle_conn));
+ PQfinish(idle_conn);
+ PQclear(res);
+ first_failure = true;
+ break;
+ }
+ PQclear(res);
+
+ idle_conns[num_idle] = idle_conn;
+ num_idle++;
+ }
+
+ /* Stop if we couldn't add any connections */
+ if (first_failure && i == 0)
+ break;
+ }
+
+ /* Clean up */
+ pg_log_info("cleaning up connections");
+ PQfinish(conn1);
+ PQfinish(conn2);
+ for (int i = 0; i < num_idle; i++)
+ {
+ if (idle_conns[i])
+ PQfinish(idle_conns[i]);
+ }
+ pg_free(idle_conns);
+
+ pg_log_info("LISTEN/NOTIFY benchmark completed");
+}
+
int
main(int argc, char **argv)
{
@@ -6739,6 +6959,9 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"listen-notify-benchmark", no_argument, NULL, 18},
+ {"notify-round-trips", required_argument, NULL, 19},
+ {"notify-idle-step", required_argument, NULL, 20},
{NULL, 0, NULL, 0}
};
@@ -7092,6 +7315,22 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* listen-notify-benchmark */
+ listen_notify_mode = true;
+ benchmarking_option_set = true;
+ break;
+ case 19: /* notify-round-trips */
+ benchmarking_option_set = true;
+ if (!option_parse_int(optarg, "--notify-round-trips", 1, INT_MAX,
+ ¬ify_round_trips))
+ exit(1);
+ break;
+ case 20: /* notify-idle-step */
+ benchmarking_option_set = true;
+ if (!option_parse_int(optarg, "--notify-idle-step", 1, INT_MAX,
+ ¬ify_idle_step))
+ exit(1);
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7210,6 +7449,20 @@ main(int argc, char **argv)
pg_fatal("some of the specified options cannot be used in benchmarking mode");
}
+ /* Handle LISTEN/NOTIFY benchmark mode */
+ if (listen_notify_mode)
+ {
+ /* Establish a database connection for setup */
+ if ((con = doConnect()) == NULL)
+ pg_fatal("could not connect to database");
+
+ /* Run the LISTEN/NOTIFY benchmark */
+ runListenNotifyBenchmark();
+
+ PQfinish(con);
+ exit(0);
+ }
+
if (nxacts > 0 && duration > 0)
pg_fatal("specify either a number of transactions (-t) or a duration (-T), not both");
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-14 16:40 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-14 16:40 UTC (permalink / raw)
To: pgsql-hackers
On Sat, Oct 11, 2025, at 09:43, Joel Jacobson wrote:
> On Sat, Oct 11, 2025, at 08:43, Joel Jacobson wrote:
>> In addition to previously suggested optimization, there is another major
...
>> I'm not entirely sure this approach is correct though
Having investigated this, the "direct advancement" approach seems
correct to me.
(I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock
is of course needed because there are other operations that don't
acquire the heavyweight-lock, that take shared/exclusive lock on
NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on
NotifyQueueLock in PreCommit_Notify is needed, since it modifies the
QUEUE_HEAD.)
Given all the experiments since my earlier message, here is a fresh,
self-contained write-up:
This series has two patches:
* 0001-optimize_listen_notify-v16.patch:
Improve test coverage of async.c. Adds isolation specs covering
previously untested paths (subxact LISTEN reparenting/merge/abort,
simple NOTIFY reparenting, notification_match dedup, and an array-growth
case used by the follow-on patch.
* 0002-optimize_listen_notify-v16.patch:
Optimize LISTEN/NOTIFY by maintaining a shared channel map and using
direct advancement to avoid useless wakeups.
Problem
-------
Today SignalBackends wakes all listeners in the same database, with no
knowledge of which backends listen on which channels. When some backends
are listening on different channels, each NOTIFY causes unnecessary
wakeups and context switches, which can become the bottleneck in
workloads.
Overview of the solution (patch 0002)
-------------------------------------
* Introduce a lazily-created DSA+dshash map (dboid, channel) ->
[ProcNumber] (channelHash). AtCommit_Notify maintains it for
LISTEN/UNLISTEN, and SignalBackends consults it to signal only
listeners on the channels notified within the transaction.
* Add a per-backend wakeupPending flag to suppress duplicate signals.
* Direct advancement: while queuing, PreCommit_Notify records the queue
head before and after our writes. Writers are globally serialized, so
the interval [oldHead, newHead) contains only our entries.
SignalBackends advances any backend still at oldHead directly to
newHead, avoiding a pointless wakeup.
* Keep the queue healthy by signaling backends that have fallen too far
behind (lag >= QUEUE_CLEANUP_DELAY) so the global tail can advance.
* pg_listening_channels and IsListeningOn now read from channelHash.
* Add LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash.
No user-visible semantic changes are intended; this is an internal
performance improvement.
Benchmark
---------
Using a patched pgbench (adds --listen-notify-benchmark; attached as
.txt to avoid confusing cfbot). Each run performs 10 000 round trips and
adds 100 idle listeners per iteration.
master (HEAD):
% ./pgbench_patched --listen-notify-benchmark --notify-round-trips=10000 --notify-idle-step=100
idle_listeners round_trips_per_sec max_latency_usec
0 32123.7 893
100 1952.5 1465
200 991.4 3438
300 663.5 2454
400 494.6 2950
500 398.6 3394
600 332.8 4272
700 287.1 4692
800 252.6 5208
900 225.4 5614
1000 202.5 6212
0002-optimize_listen_notify-v16.patch:
% ./pgbench_patched --listen-notify-benchmark --notify-round-trips=10000 --notify-idle-step=100
idle_listeners round_trips_per_sec max_latency_usec
0 31832.6 1067
100 32341.0 1035
200 31562.5 1054
300 30040.1 1057
400 29287.1 1023
500 28191.9 1201
600 28166.5 1019
700 26994.3 1094
800 26501.0 1043
900 25974.2 1005
1000 25720.6 1008
Benchmarked on MacBook Pro Apple M3 Max.
Files
-----
* 0001-optimize_listen_notify-v16.patch - tests only.
* 0002-optimize_listen_notify-v16.patch - implementation.
* pgbench-listen-notify-benchmark-patch.txt - adds --listen-notify-benchmark.
Feedback and review much welcomed.
/Joel
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 1515ed405ba..3f47c50847d 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -35,6 +35,7 @@
#include <ctype.h>
#include <float.h>
+#include <inttypes.h>
#include <limits.h>
#include <math.h>
#include <signal.h>
@@ -237,6 +238,11 @@ static const char *const PARTITION_METHOD[] = {"none", "range", "hash"};
/* random seed used to initialize base_random_sequence */
static int64 random_seed = -1;
+/* LISTEN/NOTIFY benchmark mode parameters */
+static bool listen_notify_mode = false; /* enable LISTEN/NOTIFY benchmark */
+static int notify_round_trips = 100; /* number of round-trips per iteration */
+static int notify_idle_step = 10; /* idle listeners to add per iteration */
+
/*
* end of configurable parameters
*********************************************************************/
@@ -930,6 +936,10 @@ usage(void)
" (same as \"-b simple-update\")\n"
" -S, --select-only perform SELECT-only transactions\n"
" (same as \"-b select-only\")\n"
+ " --listen-notify-benchmark\n"
+ " run LISTEN/NOTIFY round-trip benchmark\n"
+ " --notify-round-trips=NUM number of round-trips per iteration (default: 100)\n"
+ " --notify-idle-step=NUM idle listeners to add per iteration (default: 10)\n"
"\nBenchmarking options:\n"
" -c, --client=NUM number of concurrent database clients (default: 1)\n"
" -C, --connect establish new connection for each transaction\n"
@@ -6689,6 +6699,216 @@ set_random_seed(const char *seed)
return true;
}
+/*
+ * Run LISTEN/NOTIFY round-trip benchmark
+ *
+ * This benchmark measures the round-trip time between two processes that
+ * ping-pong NOTIFY messages while adding idle listening connections.
+ */
+static void
+runListenNotifyBenchmark(void)
+{
+ PGconn *conn1 = NULL;
+ PGconn *conn2 = NULL;
+ PGconn **idle_conns = NULL;
+ int num_idle = 0;
+ int max_idle = 100000; /* reasonable upper limit */
+ PGresult *res;
+ char channel1[] = "pgbench_channel_1";
+ char channel2[] = "pgbench_channel_2";
+ char notify_cmd[256];
+ bool first_failure = false;
+
+ pg_log_info("starting LISTEN/NOTIFY round-trip benchmark");
+ pg_log_info("round-trips per iteration: %d", notify_round_trips);
+ pg_log_info("idle listeners added per iteration: %d", notify_idle_step);
+ printf("\n%14s %19s %19s\n", "idle_listeners", "round_trips_per_sec", "max_latency_usec");
+
+ /* Allocate array for idle connections */
+ idle_conns = (PGconn **) pg_malloc0(max_idle * sizeof(PGconn *));
+
+ /* Create two active connections for ping-pong */
+ conn1 = doConnect();
+ if (conn1 == NULL)
+ pg_fatal("failed to create connection 1");
+
+ conn2 = doConnect();
+ if (conn2 == NULL)
+ pg_fatal("failed to create connection 2");
+
+ /* Set up LISTEN on both connections */
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", channel1);
+ res = PQexec(conn1, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("LISTEN failed on connection 1: %s", PQerrorMessage(conn1));
+ PQclear(res);
+
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", channel2);
+ res = PQexec(conn2, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("LISTEN failed on connection 2: %s", PQerrorMessage(conn2));
+ PQclear(res);
+
+ /* Main benchmark loop: measure round-trips then add idle connections */
+ while (num_idle < max_idle)
+ {
+ int i;
+ int64 total_latency = 0;
+ int64 max_latency = 0;
+
+ /* Perform round-trip measurements */
+ for (i = 0; i < notify_round_trips; i++)
+ {
+ pg_time_usec_t start_time,
+ end_time;
+ int64 latency;
+ PGnotify *notify;
+ int sock;
+ fd_set input_mask;
+ struct timeval tv;
+
+ /* Clear any pending notifications */
+ PQconsumeInput(conn1);
+ while ((notify = PQnotifies(conn1)) != NULL)
+ PQfreemem(notify);
+ PQconsumeInput(conn2);
+ while ((notify = PQnotifies(conn2)) != NULL)
+ PQfreemem(notify);
+
+ /* Start timer and send notification from conn1 */
+ start_time = pg_time_now();
+ snprintf(notify_cmd, sizeof(notify_cmd), "NOTIFY %s", channel2);
+ res = PQexec(conn1, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("NOTIFY failed: %s", PQerrorMessage(conn1));
+ PQclear(res);
+
+ /* Wait for notification on conn2 */
+ sock = PQsocket(conn2);
+ notify = NULL;
+ while (notify == NULL)
+ {
+ PQconsumeInput(conn2);
+ notify = PQnotifies(conn2);
+ if (notify == NULL)
+ {
+ /* Wait for data on socket */
+ FD_ZERO(&input_mask);
+ FD_SET(sock, &input_mask);
+ tv.tv_sec = 10; /* 10 second timeout */
+ tv.tv_usec = 0;
+ if (select(sock + 1, &input_mask, NULL, NULL, &tv) < 0)
+ pg_fatal("select() failed: %m");
+ }
+ }
+ PQfreemem(notify);
+
+ /* Send notification back from conn2 */
+ snprintf(notify_cmd, sizeof(notify_cmd), "NOTIFY %s", channel1);
+ res = PQexec(conn2, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("NOTIFY failed: %s", PQerrorMessage(conn2));
+ PQclear(res);
+
+ /* Wait for notification on conn1 */
+ sock = PQsocket(conn1);
+ notify = NULL;
+ while (notify == NULL)
+ {
+ PQconsumeInput(conn1);
+ notify = PQnotifies(conn1);
+ if (notify == NULL)
+ {
+ /* Wait for data on socket */
+ FD_ZERO(&input_mask);
+ FD_SET(sock, &input_mask);
+ tv.tv_sec = 10; /* 10 second timeout */
+ tv.tv_usec = 0;
+ if (select(sock + 1, &input_mask, NULL, NULL, &tv) < 0)
+ pg_fatal("select() failed: %m");
+ }
+ }
+ PQfreemem(notify);
+
+ /* End timer */
+ end_time = pg_time_now();
+
+ /* Calculate individual round-trip latency */
+ latency = end_time - start_time;
+
+ /* Accumulate total latency and track maximum */
+ total_latency += latency;
+ if (latency > max_latency)
+ max_latency = latency;
+ }
+
+ /* Calculate and report round-trips per second and max latency */
+ fprintf(stdout, "%14d %19.1f %19" PRId64 "\n",
+ num_idle,
+ 1000000.0 * notify_round_trips / total_latency,
+ max_latency);
+ fflush(stdout);
+
+ /* Stop if we hit connection limit */
+ if (first_failure)
+ break;
+
+ /* Add idle listening connections */
+ for (i = 0; i < notify_idle_step && num_idle < max_idle; i++)
+ {
+ PGconn *idle_conn;
+ char idle_channel[256];
+
+ idle_conn = doConnect();
+ if (idle_conn == NULL)
+ {
+ if (!first_failure)
+ {
+ pg_log_info("reached max_connections at %d idle listeners", num_idle);
+ first_failure = true;
+ }
+ break;
+ }
+
+ /* Each idle connection listens on a unique channel */
+ snprintf(idle_channel, sizeof(idle_channel), "idle_%d", num_idle);
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", idle_channel);
+
+ res = PQexec(idle_conn, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ {
+ pg_log_warning("LISTEN failed on idle connection %d: %s",
+ num_idle, PQerrorMessage(idle_conn));
+ PQfinish(idle_conn);
+ PQclear(res);
+ first_failure = true;
+ break;
+ }
+ PQclear(res);
+
+ idle_conns[num_idle] = idle_conn;
+ num_idle++;
+ }
+
+ /* Stop if we couldn't add any connections */
+ if (first_failure && i == 0)
+ break;
+ }
+
+ /* Clean up */
+ pg_log_info("cleaning up connections");
+ PQfinish(conn1);
+ PQfinish(conn2);
+ for (int i = 0; i < num_idle; i++)
+ {
+ if (idle_conns[i])
+ PQfinish(idle_conns[i]);
+ }
+ pg_free(idle_conns);
+
+ pg_log_info("LISTEN/NOTIFY benchmark completed");
+}
+
int
main(int argc, char **argv)
{
@@ -6739,6 +6959,9 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"listen-notify-benchmark", no_argument, NULL, 18},
+ {"notify-round-trips", required_argument, NULL, 19},
+ {"notify-idle-step", required_argument, NULL, 20},
{NULL, 0, NULL, 0}
};
@@ -7092,6 +7315,22 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* listen-notify-benchmark */
+ listen_notify_mode = true;
+ benchmarking_option_set = true;
+ break;
+ case 19: /* notify-round-trips */
+ benchmarking_option_set = true;
+ if (!option_parse_int(optarg, "--notify-round-trips", 1, INT_MAX,
+ ¬ify_round_trips))
+ exit(1);
+ break;
+ case 20: /* notify-idle-step */
+ benchmarking_option_set = true;
+ if (!option_parse_int(optarg, "--notify-idle-step", 1, INT_MAX,
+ ¬ify_idle_step))
+ exit(1);
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7210,6 +7449,20 @@ main(int argc, char **argv)
pg_fatal("some of the specified options cannot be used in benchmarking mode");
}
+ /* Handle LISTEN/NOTIFY benchmark mode */
+ if (listen_notify_mode)
+ {
+ /* Establish a database connection for setup */
+ if ((con = doConnect()) == NULL)
+ pg_fatal("could not connect to database");
+
+ /* Run the LISTEN/NOTIFY benchmark */
+ runListenNotifyBenchmark();
+
+ PQfinish(con);
+ exit(0);
+ }
+
if (nxacts > 0 && duration > 0)
pg_fatal("specify either a number of transactions (-t) or a duration (-T), not both");
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v16.patch (7.8K, 2-0001-optimize_listen_notify-v16.patch)
download | inline diff:
From 600fef1d835a512a34cb8118e0681832cbae5120 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 103 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 52 ++++++++++
2 files changed, 154 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..9c19843d2d7 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 5 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..942b09d5735 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,26 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +94,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v16.patch (35.4K, 3-0002-optimize_listen_notify-v16.patch)
download | inline diff:
From 1ef39e6300a87ec0abb0fd729ae75538d8ecb45e Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Tue, 14 Oct 2025 08:03:19 +0200
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of listening backends (ProcNumbers). This allows the sender to
target only those backends actually listening on the channels for which
it has queued notifications.
At commit time:
* AtCommit_Notify updates the shared channelHash to reflect any LISTEN
or UNLISTEN actions performed in the transaction.
* SignalBackends consults this hash to find the backends that are
listening on the channels being notified in the current database, and
signals only those.
Each backend's entry in AsyncQueueControl now includes a wakeupPending
flag to prevent duplicate signals while a previous wakeup is still being
processed.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
Queue health
------------
If a backend has fallen too far behind (lag >= QUEUE_CLEANUP_DELAY
pages), it is signaled to catch up so the global queue tail can advance.
Other notes
-----------
* Replaces the per-backend listenChannels list with the shared
channelHash. A simple numChannelsListeningOn counter determines
whether the backend remains registered in the global listener list.
* pg_listening_channels and IsListeningOn now read from channelHash.
* Adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
* No user-visible behavioral changes; this is an internal optimization
only.
---
src/backend/commands/async.c | 645 +++++++++++++++---
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 559 insertions(+), 90 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..08ac19fa3cb 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,27 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
+ * make any actual updates to the effective listen state (channelHash).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which consults the shared channel hash table to
+ * identify listeners for the channels that have pending notifications
+ * in the current database. Each selected backend is marked as having a
+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, avoiding unnecessary
+ * wakeups for idle listeners that have nothing to read.
+ *
+ * To maintain queue health, SignalBackends() also wakes one backend
+ * positioned at the global queue tail to help advance it, and signals
+ * any backend that has fallen too far behind to catch up. These measures
+ * prevent the notification queue from growing indefinitely, while mostly
+ * limiting wakeups to the backends that actually need them.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +141,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,14 +151,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +179,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ProcNumber array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -227,8 +267,8 @@ typedef struct QueuePosition
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +286,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
/*
@@ -288,11 +329,91 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +422,7 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
* The SLRU buffer area through which we access the notification queue
@@ -312,17 +434,11 @@ static SlruCtlData NotifyCtlData;
#define QUEUE_FULL_WARN_INTERVAL 5000 /* warn at most once every 5s */
-/*
- * listenChannels identifies the channels we are actually listening to
- * (ie, have committed a LISTEN on). It is a simple list of channel names,
- * allocated in TopMemoryContext.
- */
-static List *listenChannels = NIL; /* list of C strings */
-
/*
* State for pending LISTEN/UNLISTEN actions consists of an ordered list of
* all actions requested in the current transaction. As explained above,
- * we don't actually change listenChannels until we reach transaction commit.
+ * we don't actually change the shared channelHash until we reach transaction
+ * commit.
*
* The list is kept in CurTransactionContext. In subtransactions, each
* subtransaction has its own list in its own CurTransactionContext, but
@@ -391,6 +507,8 @@ typedef struct NotificationList
int nestingLevel; /* current transaction nesting depth */
List *events; /* list of Notification structs */
HTAB *hashtab; /* hash of NotificationHash structs, or NULL */
+ QueuePosition queueHeadBeforeWrite; /* QUEUE_HEAD before writing notifies */
+ QueuePosition queueHeadAfterWrite; /* QUEUE_HEAD after writing notifies */
struct NotificationList *upper; /* details for upper transaction levels */
} NotificationList;
@@ -418,6 +536,9 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/* Count of channels we're currently listening on */
+static int numChannelsListeningOn = 0;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -457,6 +578,8 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static List *GetPendingNotifyChannels(void);
/*
* Compute the difference between two queue page numbers.
@@ -521,12 +644,16 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -683,7 +810,7 @@ Async_Notify(const char *channel, const char *payload)
* Common code for listen, unlisten, unlisten all commands.
*
* Adds the request to the list of pending actions.
- * Actual update of the listenChannels list happens during transaction
+ * Actual update of the shared channelHash happens during transaction
* commit.
*/
static void
@@ -782,24 +909,60 @@ Async_UnlistenAll(void)
/*
* SQL function: return a set of the channel names this backend is actively
* listening to.
- *
- * Note: this coding relies on the fact that the listenChannels list cannot
- * change within a transaction.
*/
Datum
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ List *listenChannels;
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
+
+ /* get channels from channelHash and store in function context */
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+ listenChannels = NIL;
+
+ if (channelHash != NULL)
+ {
+ dshash_seq_init(&status, channelHash, false);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ listenChannels = lappend(listenChannels, pstrdup(entry->key.channel));
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
+ }
+
+ funcctx->user_fctx = listenChannels;
+ MemoryContextSwitchTo(oldcontext);
}
/* stuff done on every call of the function */
funcctx = SRF_PERCALL_SETUP();
+ listenChannels = (List *) funcctx->user_fctx;
if (funcctx->call_cntr < list_length(listenChannels))
{
@@ -894,6 +1057,7 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -922,6 +1086,9 @@ PreCommit_Notify(void)
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(pendingNotifies->queueHeadBeforeWrite, 0, 0);
+
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
while (nextNotify != NULL)
@@ -939,6 +1106,19 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /*
+ * On the first iteration, save the queue head position before we
+ * write any notifications. This is used by SignalBackends() to
+ * identify backends that can be advanced directly without waking
+ * them up.
+ */
+ if (firstIteration)
+ {
+ pendingNotifies->queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
@@ -948,6 +1128,18 @@ PreCommit_Notify(void)
LWLockRelease(NotifyQueueLock);
}
+ /*
+ * Save the queue head after writing all our notifications. This is
+ * used by SignalBackends() to know where to advance idle backends to.
+ * We must save this now because other backends may write their own
+ * notifications after we release the heavyweight lock but before we
+ * call SignalBackends(), and we must not advance backends over those
+ * other notifications.
+ */
+ LWLockAcquire(NotifyQueueLock, LW_SHARED);
+ pendingNotifies->queueHeadAfterWrite = QUEUE_HEAD;
+ LWLockRelease(NotifyQueueLock);
+
/* Note that we don't clear pendingNotifies; AtCommit_Notify will. */
}
}
@@ -957,7 +1149,7 @@ PreCommit_Notify(void)
*
* This is called at transaction commit, after committing to clog.
*
- * Update listenChannels and clear transaction-local state.
+ * Update channelHash and clear transaction-local state.
*
* If we issued any notifications in the transaction, send signals to
* listening backends (possibly including ourselves) to process them.
@@ -1002,7 +1194,7 @@ AtCommit_Notify(void)
}
/* If no longer listening to anything, get out of listener array */
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener && numChannelsListeningOn == 0)
asyncQueueUnregister();
/*
@@ -1130,55 +1322,131 @@ Exec_ListenPreCommit(void)
/*
* Exec_ListenCommit --- subroutine for AtCommit_Notify
*
- * Add the channel to the list of channels we are listening on.
+ * Add the channel to the shared channelHash.
*/
static void
Exec_ListenCommit(const char *channel)
{
- MemoryContext oldcontext;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
- /* Do nothing if we are already listening on this channel */
- if (IsListeningOn(channel))
- return;
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
/*
- * Add the new channel name to listenChannels.
- *
- * XXX It is theoretically possible to get an out-of-memory failure here,
- * which would be bad because we already committed. For the moment it
- * doesn't seem worth trying to guard against that, but maybe improve this
- * later.
+ * For new entries, we initialize listenersArray to InvalidDsaPointer as a
+ * marker. This handles both the initial creation and potential retry
+ * after OOM.
*/
- oldcontext = MemoryContextSwitchTo(TopMemoryContext);
- listenChannels = lappend(listenChannels, pstrdup(channel));
- MemoryContextSwitchTo(oldcontext);
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners] = MyProcNumber;
+ entry->numListeners++;
+ numChannelsListeningOn++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Remove the specified channel from channelHash.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
- ListCell *q;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
- foreach(q, listenChannels)
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(q);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i] == MyProcNumber)
{
- listenChannels = foreach_delete_current(listenChannels, q);
- pfree(lchan);
- break;
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ numChannelsListeningOn--;
+ return;
}
}
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1193,33 +1461,84 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
- list_free_deep(listenChannels);
- listenChannels = NIL;
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+ int i;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
+ numChannelsListeningOn = 0;
}
/*
* Test whether we are actively listening on the given channel name.
*
* Note: this function is executed for every notification found in the queue.
- * Perhaps it is worth further optimization, eg convert the list to a sorted
- * array so we can binary-search it. In practice the list is likely to be
- * fairly short, though.
+ * Perhaps it is worth further optimization, eg convert the listeners array
+ * to a sorted array so we can binary-search it.
*/
static bool
IsListeningOn(const char *channel)
{
- ListCell *p;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
- foreach(p, listenChannels)
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ entry = dshash_find(channelHash, &key, false);
+ if (entry == NULL)
+ return false; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(p);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
return true;
+ }
}
+
+ dshash_release_lock(channelHash, entry);
return false;
}
@@ -1230,7 +1549,7 @@ IsListeningOn(const char *channel)
static void
asyncQueueUnregister(void)
{
- Assert(listenChannels == NIL); /* else caller error */
+ Assert(numChannelsListeningOn == 0); /* else caller error */
if (!amRegisteredListener) /* nothing to do */
return;
@@ -1565,12 +1884,19 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are still positioned at the queue head from before our
+ * commit can be safely advanced directly to the new head, since the
+ * queue region we wrote is known to contain only our own notifications.
+ * This avoids unnecessary wakeups when there is nothing of interest to
+ * them.
+ *
+ * In addition, if a backend has fallen too far behind in the queue, we
+ * signal it so that it will advance its position and allow the global
+ * tail pointer to move forward.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1909,9 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ List *channels;
+ ListCell *lc;
+ int64 queue_length;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1596,37 +1925,120 @@ SignalBackends(void)
procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
count = 0;
+ channels = GetPendingNotifyChannels();
+
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, channels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ProcNumber *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up or wrong database */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+ if (QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ /*
+ * Avoid needing to wake listening backends that are at the old queue head
+ * (before we wrote our notifications) that we know are not interested in
+ * our notifications, since otherwise they would have been marked for
+ * wakeup by now. Do this by advancing them directly to the new queue
+ * head.
+ */
+ if (pendingNotifies != NULL)
+ {
+ QueuePosition oldHead = pendingNotifies->queueHeadBeforeWrite;
+ QueuePosition newHead = pendingNotifies->queueHeadAfterWrite;
+
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
+ {
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ if (QUEUE_POS_EQUAL(pos, oldHead))
+ QUEUE_BACKEND_POS(i) = newHead;
}
- else
+ }
+
+ queue_length = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(QUEUE_TAIL));
+
+ /* Check for lagging backends when the queue spans multiple pages */
+ if (queue_length > 0)
+ {
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int64 lag;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+
+ lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(pos));
+
+ /* Need to signal if a backend has fallen too far behind */
+ if (lag >= QUEUE_CLEANUP_DELAY)
+ {
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1673,9 +2085,9 @@ AtAbort_Notify(void)
/*
* If we LISTEN but then roll back the transaction after PreCommit_Notify,
* we have registered as a listener but have not made any entry in
- * listenChannels. In that case, deregister again.
+ * channelHash. In that case, deregister again.
*/
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener && numChannelsListeningOn == 0)
asyncQueueUnregister();
/* And clean up */
@@ -1865,6 +2277,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2186,7 +2599,7 @@ ProcessIncomingNotify(bool flush)
notifyInterruptPending = false;
/* Do nothing else if we aren't actively listening */
- if (listenChannels == NIL)
+ if (numChannelsListeningOn == 0)
return;
if (Trace_notify)
@@ -2395,3 +2808,55 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
+
+/*
+ * GetPendingNotifyChannels
+ * Get list of unique channel names from pending notifications.
+ */
+static List *
+GetPendingNotifyChannels(void)
+{
+ List *channels = NIL;
+ ListCell *p;
+ ListCell *q;
+ bool found;
+
+ if (!pendingNotifies)
+ return NIL;
+
+ foreach(p, pendingNotifies->events)
+ {
+ Notification *n = (Notification *) lfirst(p);
+ char *channel = n->data;
+
+ found = false;
+ foreach(q, channels)
+ {
+ char *existing = (char *) lfirst(q);
+
+ if (strcmp(existing, channel) == 0)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ channels = lappend(channels, channel);
+ }
+
+ return channels;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..5ccdd4043e8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
[text/plain] pgbench-listen-notify-benchmark-patch.txt (9.3K, 4-pgbench-listen-notify-benchmark-patch.txt)
download | inline diff:
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 1515ed405ba..3f47c50847d 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -35,6 +35,7 @@
#include <ctype.h>
#include <float.h>
+#include <inttypes.h>
#include <limits.h>
#include <math.h>
#include <signal.h>
@@ -237,6 +238,11 @@ static const char *const PARTITION_METHOD[] = {"none", "range", "hash"};
/* random seed used to initialize base_random_sequence */
static int64 random_seed = -1;
+/* LISTEN/NOTIFY benchmark mode parameters */
+static bool listen_notify_mode = false; /* enable LISTEN/NOTIFY benchmark */
+static int notify_round_trips = 100; /* number of round-trips per iteration */
+static int notify_idle_step = 10; /* idle listeners to add per iteration */
+
/*
* end of configurable parameters
*********************************************************************/
@@ -930,6 +936,10 @@ usage(void)
" (same as \"-b simple-update\")\n"
" -S, --select-only perform SELECT-only transactions\n"
" (same as \"-b select-only\")\n"
+ " --listen-notify-benchmark\n"
+ " run LISTEN/NOTIFY round-trip benchmark\n"
+ " --notify-round-trips=NUM number of round-trips per iteration (default: 100)\n"
+ " --notify-idle-step=NUM idle listeners to add per iteration (default: 10)\n"
"\nBenchmarking options:\n"
" -c, --client=NUM number of concurrent database clients (default: 1)\n"
" -C, --connect establish new connection for each transaction\n"
@@ -6689,6 +6699,216 @@ set_random_seed(const char *seed)
return true;
}
+/*
+ * Run LISTEN/NOTIFY round-trip benchmark
+ *
+ * This benchmark measures the round-trip time between two processes that
+ * ping-pong NOTIFY messages while adding idle listening connections.
+ */
+static void
+runListenNotifyBenchmark(void)
+{
+ PGconn *conn1 = NULL;
+ PGconn *conn2 = NULL;
+ PGconn **idle_conns = NULL;
+ int num_idle = 0;
+ int max_idle = 100000; /* reasonable upper limit */
+ PGresult *res;
+ char channel1[] = "pgbench_channel_1";
+ char channel2[] = "pgbench_channel_2";
+ char notify_cmd[256];
+ bool first_failure = false;
+
+ pg_log_info("starting LISTEN/NOTIFY round-trip benchmark");
+ pg_log_info("round-trips per iteration: %d", notify_round_trips);
+ pg_log_info("idle listeners added per iteration: %d", notify_idle_step);
+ printf("\n%14s %19s %19s\n", "idle_listeners", "round_trips_per_sec", "max_latency_usec");
+
+ /* Allocate array for idle connections */
+ idle_conns = (PGconn **) pg_malloc0(max_idle * sizeof(PGconn *));
+
+ /* Create two active connections for ping-pong */
+ conn1 = doConnect();
+ if (conn1 == NULL)
+ pg_fatal("failed to create connection 1");
+
+ conn2 = doConnect();
+ if (conn2 == NULL)
+ pg_fatal("failed to create connection 2");
+
+ /* Set up LISTEN on both connections */
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", channel1);
+ res = PQexec(conn1, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("LISTEN failed on connection 1: %s", PQerrorMessage(conn1));
+ PQclear(res);
+
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", channel2);
+ res = PQexec(conn2, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("LISTEN failed on connection 2: %s", PQerrorMessage(conn2));
+ PQclear(res);
+
+ /* Main benchmark loop: measure round-trips then add idle connections */
+ while (num_idle < max_idle)
+ {
+ int i;
+ int64 total_latency = 0;
+ int64 max_latency = 0;
+
+ /* Perform round-trip measurements */
+ for (i = 0; i < notify_round_trips; i++)
+ {
+ pg_time_usec_t start_time,
+ end_time;
+ int64 latency;
+ PGnotify *notify;
+ int sock;
+ fd_set input_mask;
+ struct timeval tv;
+
+ /* Clear any pending notifications */
+ PQconsumeInput(conn1);
+ while ((notify = PQnotifies(conn1)) != NULL)
+ PQfreemem(notify);
+ PQconsumeInput(conn2);
+ while ((notify = PQnotifies(conn2)) != NULL)
+ PQfreemem(notify);
+
+ /* Start timer and send notification from conn1 */
+ start_time = pg_time_now();
+ snprintf(notify_cmd, sizeof(notify_cmd), "NOTIFY %s", channel2);
+ res = PQexec(conn1, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("NOTIFY failed: %s", PQerrorMessage(conn1));
+ PQclear(res);
+
+ /* Wait for notification on conn2 */
+ sock = PQsocket(conn2);
+ notify = NULL;
+ while (notify == NULL)
+ {
+ PQconsumeInput(conn2);
+ notify = PQnotifies(conn2);
+ if (notify == NULL)
+ {
+ /* Wait for data on socket */
+ FD_ZERO(&input_mask);
+ FD_SET(sock, &input_mask);
+ tv.tv_sec = 10; /* 10 second timeout */
+ tv.tv_usec = 0;
+ if (select(sock + 1, &input_mask, NULL, NULL, &tv) < 0)
+ pg_fatal("select() failed: %m");
+ }
+ }
+ PQfreemem(notify);
+
+ /* Send notification back from conn2 */
+ snprintf(notify_cmd, sizeof(notify_cmd), "NOTIFY %s", channel1);
+ res = PQexec(conn2, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ pg_fatal("NOTIFY failed: %s", PQerrorMessage(conn2));
+ PQclear(res);
+
+ /* Wait for notification on conn1 */
+ sock = PQsocket(conn1);
+ notify = NULL;
+ while (notify == NULL)
+ {
+ PQconsumeInput(conn1);
+ notify = PQnotifies(conn1);
+ if (notify == NULL)
+ {
+ /* Wait for data on socket */
+ FD_ZERO(&input_mask);
+ FD_SET(sock, &input_mask);
+ tv.tv_sec = 10; /* 10 second timeout */
+ tv.tv_usec = 0;
+ if (select(sock + 1, &input_mask, NULL, NULL, &tv) < 0)
+ pg_fatal("select() failed: %m");
+ }
+ }
+ PQfreemem(notify);
+
+ /* End timer */
+ end_time = pg_time_now();
+
+ /* Calculate individual round-trip latency */
+ latency = end_time - start_time;
+
+ /* Accumulate total latency and track maximum */
+ total_latency += latency;
+ if (latency > max_latency)
+ max_latency = latency;
+ }
+
+ /* Calculate and report round-trips per second and max latency */
+ fprintf(stdout, "%14d %19.1f %19" PRId64 "\n",
+ num_idle,
+ 1000000.0 * notify_round_trips / total_latency,
+ max_latency);
+ fflush(stdout);
+
+ /* Stop if we hit connection limit */
+ if (first_failure)
+ break;
+
+ /* Add idle listening connections */
+ for (i = 0; i < notify_idle_step && num_idle < max_idle; i++)
+ {
+ PGconn *idle_conn;
+ char idle_channel[256];
+
+ idle_conn = doConnect();
+ if (idle_conn == NULL)
+ {
+ if (!first_failure)
+ {
+ pg_log_info("reached max_connections at %d idle listeners", num_idle);
+ first_failure = true;
+ }
+ break;
+ }
+
+ /* Each idle connection listens on a unique channel */
+ snprintf(idle_channel, sizeof(idle_channel), "idle_%d", num_idle);
+ snprintf(notify_cmd, sizeof(notify_cmd), "LISTEN %s", idle_channel);
+
+ res = PQexec(idle_conn, notify_cmd);
+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ {
+ pg_log_warning("LISTEN failed on idle connection %d: %s",
+ num_idle, PQerrorMessage(idle_conn));
+ PQfinish(idle_conn);
+ PQclear(res);
+ first_failure = true;
+ break;
+ }
+ PQclear(res);
+
+ idle_conns[num_idle] = idle_conn;
+ num_idle++;
+ }
+
+ /* Stop if we couldn't add any connections */
+ if (first_failure && i == 0)
+ break;
+ }
+
+ /* Clean up */
+ pg_log_info("cleaning up connections");
+ PQfinish(conn1);
+ PQfinish(conn2);
+ for (int i = 0; i < num_idle; i++)
+ {
+ if (idle_conns[i])
+ PQfinish(idle_conns[i]);
+ }
+ pg_free(idle_conns);
+
+ pg_log_info("LISTEN/NOTIFY benchmark completed");
+}
+
int
main(int argc, char **argv)
{
@@ -6739,6 +6959,9 @@ main(int argc, char **argv)
{"verbose-errors", no_argument, NULL, 15},
{"exit-on-abort", no_argument, NULL, 16},
{"debug", no_argument, NULL, 17},
+ {"listen-notify-benchmark", no_argument, NULL, 18},
+ {"notify-round-trips", required_argument, NULL, 19},
+ {"notify-idle-step", required_argument, NULL, 20},
{NULL, 0, NULL, 0}
};
@@ -7092,6 +7315,22 @@ main(int argc, char **argv)
case 17: /* debug */
pg_logging_increase_verbosity();
break;
+ case 18: /* listen-notify-benchmark */
+ listen_notify_mode = true;
+ benchmarking_option_set = true;
+ break;
+ case 19: /* notify-round-trips */
+ benchmarking_option_set = true;
+ if (!option_parse_int(optarg, "--notify-round-trips", 1, INT_MAX,
+ ¬ify_round_trips))
+ exit(1);
+ break;
+ case 20: /* notify-idle-step */
+ benchmarking_option_set = true;
+ if (!option_parse_int(optarg, "--notify-idle-step", 1, INT_MAX,
+ ¬ify_idle_step))
+ exit(1);
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -7210,6 +7449,20 @@ main(int argc, char **argv)
pg_fatal("some of the specified options cannot be used in benchmarking mode");
}
+ /* Handle LISTEN/NOTIFY benchmark mode */
+ if (listen_notify_mode)
+ {
+ /* Establish a database connection for setup */
+ if ((con = doConnect()) == NULL)
+ pg_fatal("could not connect to database");
+
+ /* Run the LISTEN/NOTIFY benchmark */
+ runListenNotifyBenchmark();
+
+ PQfinish(con);
+ exit(0);
+ }
+
if (nxacts > 0 && duration > 0)
pg_fatal("specify either a number of transactions (-t) or a duration (-T), not both");
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-14 21:19 Tom Lane <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 2 replies; 120+ messages in thread
From: Tom Lane @ 2025-10-14 21:19 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: pgsql-hackers
"Joel Jacobson" <[email protected]> writes:
> Having investigated this, the "direct advancement" approach seems
> correct to me.
> (I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock
> is of course needed because there are other operations that don't
> acquire the heavyweight-lock, that take shared/exclusive lock on
> NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on
> NotifyQueueLock in PreCommit_Notify is needed, since it modifies the
> QUEUE_HEAD.)
Right. What the heavyweight lock buys for us in this context is that
we can be sure no other would-be notifier can insert any messages
in between ours, even though we may take and release NotifyQueueLock
several times to allow readers to sneak in. That in turn means that
it's safe to advance readers over that whole set of messages if we
know we didn't wake them up for any of those messages.
There is a false-positive possibility if a reader was previously
signaled but hasn't yet awoken: we will think that maybe we signaled
it and hence not advance its pointer. This is an error in the safe
direction however, and it will advance its pointer when it does
wake up.
A potential complaint is that we are doubling down on the need for
that heavyweight lock, despite the upthread discussion about maybe
getting rid of it for better scalability. However, this patch
only requires holding a lock across all the insertions, not holding
it through commit which I think is the true scalability blockage.
If we did want to get rid of that lock, we'd only need to stop
releasing NotifyQueueLock at insertion page boundary crossings,
which I suspect isn't really that useful anyway. (In connection
with that though, I think you ought to capture both the "before" and
"after" pointers within that lock interval, not expend another lock
acquisition later.)
It would be good if the patch's comments made these points ...
also, the comments above struct AsyncQueueControl need to be
updated, because changing some other backend's queue pos is
not legal under any of the stated rules.
> Given all the experiments since my earlier message, here is a fresh,
> self-contained write-up:
I'm getting itchy about removing the local listenChannels list,
because what you've done is to replace it with a shared data
structure that can't be accessed without a good deal of locking
overhead. That seems like it could easily be a net loss.
Also, I really do not like this implementation of
GetPendingNotifyChannels, as it looks like O(N^2) effort.
The potentially large length of the list it builds is scary too,
considering the comments that SignalBackends had better not fail.
If we have to do it that way it'd be better to collect the list
during PreCommit_Notify.
The "Avoid needing to wake listening backends" loop should probably
be combined with the loop after it; I don't quite see the point of
iterating over all the listening backends twice. Also, why is the
second loop only paying attention to backends in the same DB?
I don't love adding queueHeadBeforeWrite and queueHeadAfterWrite into
the pendingNotifies data structure, as they have no visible connection
to that. In particular, we will have multiple NotificationList
structs when there's nested transactions, and it's certainly
meaningless to have such fields in more than one place. Probably
just making them independent static variables is the best way.
The overall layout of what the patch injects where needs another
look. I don't like inserting code before typedefs and static
variables within a module: that's not our normal layout style.
regards, tom lane
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-15 03:19 Chao Li <[email protected]>
parent: Tom Lane <[email protected]>
1 sibling, 1 reply; 120+ messages in thread
From: Chao Li @ 2025-10-15 03:19 UTC (permalink / raw)
To: Tom Lane <[email protected]>; Joel Jacobson <[email protected]>; +Cc: pgsql-hackers
> On Oct 15, 2025, at 05:19, Tom Lane <[email protected]> wrote:
>
> "Joel Jacobson" <[email protected]> writes:
>> Having investigated this, the "direct advancement" approach seems
>> correct to me.
>
>> (I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock
>> is of course needed because there are other operations that don't
>> acquire the heavyweight-lock, that take shared/exclusive lock on
>> NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on
>> NotifyQueueLock in PreCommit_Notify is needed, since it modifies the
>> QUEUE_HEAD.)
>
> Right. What the heavyweight lock buys for us in this context is that
> we can be sure no other would-be notifier can insert any messages
> in between ours, even though we may take and release NotifyQueueLock
> several times to allow readers to sneak in. That in turn means that
> it's safe to advance readers over that whole set of messages if we
> know we didn't wake them up for any of those messages.
>
> There is a false-positive possibility if a reader was previously
> signaled but hasn't yet awoken: we will think that maybe we signaled
> it and hence not advance its pointer. This is an error in the safe
> direction however, and it will advance its pointer when it does
> wake up.
>
> A potential complaint is that we are doubling down on the need for
> that heavyweight lock, despite the upthread discussion about maybe
> getting rid of it for better scalability. However, this patch
> only requires holding a lock across all the insertions, not holding
> it through commit which I think is the true scalability blockage.
> If we did want to get rid of that lock, we'd only need to stop
> releasing NotifyQueueLock at insertion page boundary crossings,
> which I suspect isn't really that useful anyway. (In connection
> with that though, I think you ought to capture both the "before" and
> "after" pointers within that lock interval, not expend another lock
> acquisition later.)
>
> It would be good if the patch's comments made these points ...
> also, the comments above struct AsyncQueueControl need to be
> updated, because changing some other backend's queue pos is
> not legal under any of the stated rules.
>
I used to think “direct advancement” was a good idea. After reading Tom’s explanation, and reading v16 again carefully, now I also consider it’s adding complexity and could be fragile.
I just composed an example of race condition, please see if it is valid.
Because recoding queueHeadBeforeWrite and queueHeadAfterWrite happen in PreCommit_Notify() and checking them happens in AtCommit_Notify(), there is an interval in between, something may happen.
Say a listener A, it’s head pointing to 1.
And current QueueHead is 1.
Now two notifiers B and C are committing:
* B enters PreCommit_Notify(), it gets the NotifyQueueLock first, it records headBeforeWrite = 1 and writes to 3, and records headAfterWrite = 3.
* Now QueueHead is 3.
* C enters PreCommit_Notify(), it records headBeforeWrite = 3 and writes to 5, and records headAfterWrite = 5.
* Now QueueHead is 5
* C starts to run AtCommit_Notify(), as A’s head is 1, doesn’t equal to C’s headBeforeWrite, C won’t advance A’s head.
* A starts to run AtCommit_Notify(), A’s head equals to B’s beforeHeadWrite, B will advance A’s head to 3.
* At this time, QueueHead is 5, and A’s head is 3, so “direct advancement” will never work for A until A wakes up next time.
I am brainstorming. Maybe we can use a simpler strategy. If a backend’s queue lag exceeds a threshold, then wake it up. This solution is simpler and reliable, also reducing the total wake-up count.
>
>> Given all the experiments since my earlier message, here is a fresh,
>> self-contained write-up:
>
> I'm getting itchy about removing the local listenChannels list,
> because what you've done is to replace it with a shared data
> structure that can't be accessed without a good deal of locking
> overhead. That seems like it could easily be a net loss.
>
> Also, I really do not like this implementation of
> GetPendingNotifyChannels, as it looks like O(N^2) effort.
> The potentially large length of the list it builds is scary too,
> considering the comments that SignalBackends had better not fail.
> If we have to do it that way it'd be better to collect the list
> during PreCommit_Notify.
>
I agree with Tom that GetPendingNotifyChannels() is too heavy and unnecessary.
In PreCommit_Notify(), we can maintain a local hash table to record pending nofications’ channel names. dahash also supports hash table in local memory.
Then in SignalBackends(), we no longer need GetPendingNotifyChannels(), we can just iterate all keys of the local channel name hash.
And the local static numChannelsListeningOn is also not needed. We can get the count from the local hash.
WRT to v6, I got a few new comments:
1 - 0002
```
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
+ * make any actual updates to the effective listen state (channelHash).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which consults the shared channel hash table to
+ * identify listeners for the channels that have pending notifications
+ * in the current database. Each selected backend is marked as having a
+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to it.
```
In this comment, you refer to “channelHash” and “the shared channel hash table”, they are the same thing, but easy to make readers to misunderstand.
2 - 0002
```
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ List *listenChannels;
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
```
listenChannels is only used within the “if”, so it’s definition can be moved into the “if”.
3 - 0002
```
+ queue_length = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(QUEUE_TAIL));
+
+ /* Check for lagging backends when the queue spans multiple pages */
+ if (queue_length > 0)
+ {
```
I wonder why this check is needed. If queue_length is 0, can we return immediately from SignalBackends()?
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-15 03:22 Joel Jacobson <[email protected]>
parent: Tom Lane <[email protected]>
1 sibling, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-15 03:22 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: pgsql-hackers
On Tue, Oct 14, 2025, at 23:19, Tom Lane wrote:
> "Joel Jacobson" <[email protected]> writes:
>> Having investigated this, the "direct advancement" approach seems
>> correct to me.
>
>> (I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock
>> is of course needed because there are other operations that don't
>> acquire the heavyweight-lock, that take shared/exclusive lock on
>> NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on
>> NotifyQueueLock in PreCommit_Notify is needed, since it modifies the
>> QUEUE_HEAD.)
>
> Right. What the heavyweight lock buys for us in this context is that
> we can be sure no other would-be notifier can insert any messages
> in between ours, even though we may take and release NotifyQueueLock
> several times to allow readers to sneak in. That in turn means that
> it's safe to advance readers over that whole set of messages if we
> know we didn't wake them up for any of those messages.
Right.
> There is a false-positive possibility if a reader was previously
> signaled but hasn't yet awoken: we will think that maybe we signaled
> it and hence not advance its pointer. This is an error in the safe
> direction however, and it will advance its pointer when it does
> wake up.
I've added a comment on this in SignalBackends.
> A potential complaint is that we are doubling down on the need for
> that heavyweight lock, despite the upthread discussion about maybe
> getting rid of it for better scalability. However, this patch
> only requires holding a lock across all the insertions, not holding
> it through commit which I think is the true scalability blockage.
>
> If we did want to get rid of that lock, we'd only need to stop
> releasing NotifyQueueLock at insertion page boundary crossings,
> which I suspect isn't really that useful anyway.
Right. So if the upthread discussion would get rid of the heavyweight
lock we would just need to hold the exclusive lock across all
insertions. Good to know the two efforts are not conflicting.
> (In connection
> with that though, I think you ought to capture both the "before" and
> "after" pointers within that lock interval, not expend another lock
> acquisition later.)
Fixed.
> It would be good if the patch's comments made these points ...
I've added a comment inside PreCommit_Notify on how it would suffice to
hold the exclusive lock across all insertions, for the purpose of
setting the "before" and "after" pointers, if the heavyweight lock would
be removed.
> also, the comments above struct AsyncQueueControl need to be
> updated, because changing some other backend's queue pos is
> not legal under any of the stated rules.
Fixed.
>> Given all the experiments since my earlier message, here is a fresh,
>> self-contained write-up:
>
> I'm getting itchy about removing the local listenChannels list,
> because what you've done is to replace it with a shared data
> structure that can't be accessed without a good deal of locking
> overhead. That seems like it could easily be a net loss.
I agree, I also prefer the local listenChannels list.
I've changed it back.
> Also, I really do not like this implementation of
> GetPendingNotifyChannels, as it looks like O(N^2) effort.
> The potentially large length of the list it builds is scary too,
> considering the comments that SignalBackends had better not fail.
> If we have to do it that way it'd be better to collect the list
> during PreCommit_Notify.
I agree. I've removed GetPendingNotifyChannels and added a local list,
named pendingNotifyChannels instead, collected during PreCommit_Notify.
> The "Avoid needing to wake listening backends" loop should probably
> be combined with the loop after it; I don't quite see the point of
> iterating over all the listening backends twice.
I agree. Fixed.
> Also, why is the
> second loop only paying attention to backends in the same DB?
Fixed. (We're already sure it's the same DB, since that's part of the
hash key. I've removed the redundant check.)
> I don't love adding queueHeadBeforeWrite and queueHeadAfterWrite into
> the pendingNotifies data structure, as they have no visible connection
> to that. In particular, we will have multiple NotificationList
> structs when there's nested transactions, and it's certainly
> meaningless to have such fields in more than one place. Probably
> just making them independent static variables is the best way.
Fixed.
> The overall layout of what the patch injects where needs another
> look. I don't like inserting code before typedefs and static
> variables within a module: that's not our normal layout style.
Fixed.
/Joel
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v17.patch (7.8K, 2-0001-optimize_listen_notify-v17.patch)
download | inline diff:
From 600fef1d835a512a34cb8118e0681832cbae5120 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 103 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 52 ++++++++++
2 files changed, 154 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..9c19843d2d7 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 5 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..942b09d5735 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,26 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +94,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v17.patch (30.9K, 3-0002-optimize_listen_notify-v17.patch)
download | inline diff:
From 21e28730b9c2a27d3a2ae97ee31b12e9931f1aab Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Tue, 14 Oct 2025 08:03:19 +0200
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of listening backends (ProcNumbers). This allows the sender to
target only those backends actually listening on the channels for which
it has queued notifications.
At commit time:
* AtCommit_Notify updates the shared channelHash to reflect any LISTEN
or UNLISTEN actions performed in the transaction.
* SignalBackends consults this hash to find the backends that are
listening on the channels being notified in the current database, and
signals only those.
Each backend's entry in AsyncQueueControl now includes a wakeupPending
flag to prevent duplicate signals while a previous wakeup is still being
processed.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
Queue health
------------
If a backend has fallen too far behind (lag >= QUEUE_CLEANUP_DELAY
pages), it is signaled to catch up so the global queue tail can advance.
Other notes
-----------
* Maintains dual data structures: a shared channelHash for determining
which backends to signal, and a local per-backend listenChannels list
for fast lock-free lookups during notification processing. This avoids
contention on the shared hash during the high-frequency IsListeningOn
checks that occur for every notification read from the queue.
* Backends remain registered in the global listener list as long as
listenChannels is non-empty.
* Adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
* No user-visible behavioral changes; this is an internal optimization
only.
---
src/backend/commands/async.c | 528 ++++++++++++++++--
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 487 insertions(+), 45 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..cecd4958dea 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,27 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
+ * make any actual updates to the effective listen state (channelHash).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which consults the shared channel hash table to
+ * identify listeners for the channels that have pending notifications
+ * in the current database. Each selected backend is marked as having a
+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, avoiding unnecessary
+ * wakeups for idle listeners that have nothing to read.
+ *
+ * To maintain queue health, SignalBackends() also wakes one backend
+ * positioned at the global queue tail to help advance it, and signals
+ * any backend that has fallen too far behind to catch up. These measures
+ * prevent the notification queue from growing indefinitely, while mostly
+ * limiting wakeups to the backends that actually need them.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +141,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,14 +151,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +179,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ProcNumber array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -227,8 +267,8 @@ typedef struct QueuePosition
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +286,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
/*
@@ -260,9 +301,9 @@ typedef struct QueueBackendStatus
* (since no other backend will inspect it).
*
* When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
- * entries of other backends and also change the head pointer. When holding
- * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointers.
+ * entries of other backends, change the head pointer, and advance other
+ * backends' queue positions. When holding both NotifyQueueLock and
+ * NotifyQueueTailLock in EXCLUSIVE mode, backends can change the tail pointers.
*
* SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
* the control lock for the pg_notify SLRU buffers.
@@ -288,11 +329,16 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +347,7 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
* The SLRU buffer area through which we access the notification queue
@@ -418,6 +465,20 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/*
+ * Queue head positions for direct advancement.
+ * These are captured during PreCommit_Notify while holding the heavyweight
+ * lock on database 0, ensuring no other backend can insert notifications
+ * between them. SignalBackends uses these to advance idle backends.
+ */
+static QueuePosition queueHeadBeforeWrite;
+static QueuePosition queueHeadAfterWrite;
+
+/*
+ * List of channels with pending notifications in the current transaction.
+ */
+static List *pendingNotifyChannels = NIL;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -457,6 +518,9 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+static void initChannelHash(void);
/*
* Compute the difference between two queue page numbers.
@@ -478,6 +542,80 @@ asyncQueuePagePrecedes(int64 p, int64 q)
return p < q;
}
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
/*
* Report space needed for our shared memory area
*/
@@ -521,12 +659,16 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -894,6 +1036,7 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -922,6 +1065,35 @@ PreCommit_Notify(void)
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /*
+ * Build list of unique channels for SignalBackends().
+ */
+ pendingNotifyChannels = NIL;
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
+
+ /*
+ * For the direct advancement optimization in SignalBackends(), we
+ * need to ensure that no other backend can insert queue entries
+ * between queueHeadBeforeWrite and queueHeadAfterWrite. The
+ * heavyweight lock above provides this guarantee, since it serializes
+ * all writers.
+ *
+ * Note: if the heavyweight lock were ever removed for scalability
+ * reasons, we could achieve the same guarantee by holding
+ * NotifyQueueLock in EXCLUSIVE mode across all our insertions, rather
+ * than releasing and reacquiring it for each page as we do below.
+ */
+
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(queueHeadBeforeWrite, 0, 0);
+
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
while (nextNotify != NULL)
@@ -939,12 +1111,33 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /*
+ * On the first iteration, save the queue head position before we
+ * write any notifications. This is used by SignalBackends() to
+ * identify backends that can be advanced directly without waking
+ * them up.
+ */
+ if (firstIteration)
+ {
+ queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many notifications in the NOTIFY queue")));
nextNotify = asyncQueueAddEntries(nextNotify);
+
+ /*
+ * Capture the queue head after each batch of entries. On the
+ * last iteration, this gives us the final queue head position for
+ * SignalBackends() to use when advancing idle backends.
+ */
+ queueHeadAfterWrite = QUEUE_HEAD;
+
LWLockRelease(NotifyQueueLock);
}
@@ -1135,6 +1328,10 @@ Exec_ListenPreCommit(void)
static void
Exec_ListenCommit(const char *channel)
{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
MemoryContext oldcontext;
/* Do nothing if we are already listening on this channel */
@@ -1152,21 +1349,84 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+
+ /* Now update the shared channelHash for SignalBackends() to use */
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listenersArray to InvalidDsaPointer as a
+ * marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners] = MyProcNumber;
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Remove the specified channel from the list of channels we are listening on.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
ListCell *q;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
+ /* Remove from our local cache */
foreach(q, listenChannels)
{
char *lchan = (char *) lfirst(q);
@@ -1179,6 +1439,46 @@ Exec_UnlistenCommit(const char *channel)
}
}
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ return;
+ }
+ }
+
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1193,11 +1493,51 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ /* Clear our local cache */
list_free_deep(listenChannels);
listenChannels = NIL;
+
+ /* Now clear from shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+ int i;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
}
/*
@@ -1565,12 +1905,19 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are still positioned at the queue head from before our
+ * commit can be safely advanced directly to the new head, since the
+ * queue region we wrote is known to contain only our own notifications.
+ * This avoids unnecessary wakeups when there is nothing of interest to
+ * them.
+ *
+ * In addition, if a backend has fallen too far behind in the queue, we
+ * signal it so that it will advance its position and allow the global
+ * tail pointer to move forward.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1930,7 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ ListCell *lc;
/*
* Identify backends that we need to signal. We don't want to send
@@ -1597,36 +1945,111 @@ SignalBackends(void)
count = 0;
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, pendingNotifyChannels)
+ {
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ProcNumber *listeners;
+
+ if (channelHash != NULL)
+ {
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ /*
+ * Direct advancement and lagging backend detection.
+ *
+ * Direct advancement: avoid waking backends still positioned at the old
+ * queue head that aren't interested in our notifications.
+ *
+ * The heavyweight lock on database 0 (held in PreCommit_Notify) ensures
+ * no other backend can insert notifications in the region we just wrote.
+ * Even though we may take and release NotifyQueueLock multiple times
+ * while writing, the heavyweight lock guarantees this region contains
+ * only our messages. Therefore, any backend still positioned at the
+ * queue head from before our write can be safely advanced to the current
+ * queue head without waking it.
+ *
+ * False-positive possibility: if a backend was previously signaled but
+ * hasn't yet awoken, we'll skip advancing it (because wakeupPending is
+ * true). This is safe - the backend will advance its pointer when it
+ * does wake up. The alternative (advancing it anyway) would risk
+ * advancing over notifications from whoever signaled it.
+ *
+ * Lagging backends: we also check if any backend has fallen too far
+ * behind and signal it to catch up, allowing the global tail to advance.
+ */
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- int32 pid = QUEUE_BACKEND_PID(i);
QueuePosition pos;
+ int64 lag;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
- Assert(pid != InvalidPid);
pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+
+ /* Direct advancement for idle backends at the old head */
+ if (pendingNotifies != NULL &&
+ QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
- if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
- continue;
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ pos = queueHeadAfterWrite;
}
- else
+
+ /* Signal backends that have fallen too far behind */
+ lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(pos));
+
+ if (lag >= QUEUE_CLEANUP_DELAY)
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
- continue;
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1865,6 +2288,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2373,7 +2797,7 @@ notification_match(const void *key1, const void *key2, Size keysize)
return 1; /* not equal */
}
-/* Clear the pendingActions and pendingNotifies lists. */
+/* Clear the pendingActions, pendingNotifies, and pendingNotifyChannels lists. */
static void
ClearPendingActionsAndNotifies(void)
{
@@ -2385,6 +2809,7 @@ ClearPendingActionsAndNotifies(void)
*/
pendingActions = NULL;
pendingNotifies = NULL;
+ pendingNotifyChannels = NIL;
}
/*
@@ -2395,3 +2820,16 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..5ccdd4043e8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-15 11:19 Arseniy Mukhin <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Arseniy Mukhin @ 2025-10-15 11:19 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Tom Lane <[email protected]>; pgsql-hackers
Hi,
Thank you for working on it! Benchmarking looks great. There are several points:
I tried the patch and it seems listeners sometimes don't receive
notifications. To reproduce it you can try to listen to the channel in
one psql session and send notifications from another psql session. But
all tests are fine, so I tried to write a TAP test to reproduce it. It
passes on master and fails with the patch, so looks like it's real.
Please find the repro in attachments. I added the TAP test to amcheck
module just for simplicity.
I think "Direct advancement" is a good idea. But the way it's
implemented now has a concurrency bug. Listeners store its current
position in the local variable 'pos' during the reading in
asyncQueueReadAllNotifications() and don't hold NotifyQueueLock. It
means that some notifier can directly advance the listener's position
while the listener has an old value in the local variable. The same
time we use listener positions to find out the limit we can truncate
the queue in asyncQueueAdvanceTail(). asyncQueueAdvanceTail() doesn't
know that listeners have a local copy of their positions and can
truncate the queue beyond that which means listeners can try to read
notifications from the truncated segment. I managed to reproduce it
locally. Please let me know if more details are needed.
BTW error message a bit confusing:
2025-10-15 13:32:15.570 MSK [261845] ERROR: could not access status
of transaction 0
2025-10-15 13:32:15.570 MSK [261845] DETAIL: Could not open file
"pg_notify/000000000000001": No such file or directory.
Looks like all slru IO errors have an error message about transaction
status. It's not a problem really as we have a directory path in the
log.
Best regards,
Arseniy Mukhin
Attachments:
[application/octet-stream] listen-notify-test.patch.nocfbot (1.6K, 2-listen-notify-test.patch.nocfbot)
download
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-15 14:16 Tom Lane <[email protected]>
parent: Arseniy Mukhin <[email protected]>
0 siblings, 3 replies; 120+ messages in thread
From: Tom Lane @ 2025-10-15 14:16 UTC (permalink / raw)
To: Arseniy Mukhin <[email protected]>; +Cc: Joel Jacobson <[email protected]>; pgsql-hackers
Arseniy Mukhin <[email protected]> writes:
> I think "Direct advancement" is a good idea. But the way it's
> implemented now has a concurrency bug. Listeners store its current
> position in the local variable 'pos' during the reading in
> asyncQueueReadAllNotifications() and don't hold NotifyQueueLock. It
> means that some notifier can directly advance the listener's position
> while the listener has an old value in the local variable. The same
> time we use listener positions to find out the limit we can truncate
> the queue in asyncQueueAdvanceTail().
Good catch!
I think we can perhaps salvage the idea if we invent a separate
"advisory" queue position field, which tells its backend "hey,
you could skip as far as here if you want", but is not used for
purposes of SLRU truncation. Alternatively, split the queue pos
into "this is where to read next" and "this is as much as I'm
definitively done with", where the second field gets advanced at
the end of asyncQueueReadAllNotifications. Not sure which
view would be less confusing (in the end I guess they're nearly
the same thing, differently explained).
A different line of thought could be to get rid of
asyncQueueReadAllNotifications's optimization of moving the
queue pos only once, per
* (We could alternatively retake NotifyQueueLock and move the position
* before handling each individual message, but that seems like too much
* lock traffic.)
Since we only need shared lock to advance our own queue pos,
maybe that wouldn't be too awful. Not sure.
regards, tom lane
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-15 15:36 Joel Jacobson <[email protected]>
parent: Chao Li <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-15 15:36 UTC (permalink / raw)
To: Chao Li <[email protected]>; Tom Lane <[email protected]>; +Cc: pgsql-hackers
On Wed, Oct 15, 2025, at 05:19, Chao Li wrote:
> * B enters PreCommit_Notify(), it gets the NotifyQueueLock first, it
> records headBeforeWrite = 1 and writes to 3, and records headAfterWrite
> = 3.
> * Now QueueHead is 3.
> * C enters PreCommit_Notify(), it records headBeforeWrite = 3 and
> writes to 5, and records headAfterWrite = 5.
No, when C enters PreCommit_Notify, it will be waiting on the
heavyweight lock, currently held by B, which B will hold
until it commits. It will then see headBeforeWrite = 3.
> * Now QueueHead is 5
> * C starts to run AtCommit_Notify(), as A’s head is 1, doesn’t equal
> to C’s headBeforeWrite, C won’t advance A’s head.
> * A starts to run AtCommit_Notify(), A’s head equals to B’s
> beforeHeadWrite, B will advance A’s head to 3.
No, like explained above, B cannot be running here,
it must have committed already (or aborted) since C
was waiting on the heavyweight lock held by B.
The example therefore seems invalid to me.
> I agree with Tom that GetPendingNotifyChannels() is too heavy and unnecessary.
>
> In PreCommit_Notify(), we can maintain a local hash table to record
> pending nofications’ channel names. dahash also supports hash table in
> local memory.
I'm confused, I assume you mean "dynahash" since there is no "dahash"
in the sources? I see dynahash has local-to-a-backend support,
but I don't see why we would need a hash table for this,
we just iterate over it once in SignalBackends,
I think the local list is fine.
The latest version gets rid of GetPendingNotifyChannels()
and replaces it with the local list pendingNotifyChannels.
> And the local static numChannelsListeningOn is also not needed. We can
> get the count from the local hash.
No, you're mixing up the data structures.
The local hash you suggested was for pending notify channels,
but numChannelsListeningOn was needed when we didn't have
listenChannels. Now that I've reverted back to listenChannels,
I also replaced `(numChannelsListeningOn == 0)`
with `(listenChannels == NIL)`.
> WRT to v6, I got a few new comments:
...
> In this comment, you refer to “channelHash” and “the shared channel
> hash table”, they are the same thing, but easy to make readers to
> misunderstand.
Right, will try to improve this in the next version.
> pg_listening_channels(PG_FUNCTION_ARGS)
> {
> FuncCallContext *funcctx;
> + List *listenChannels;
...
> listenChannels is only used within the “if”, so it’s definition can be
> moved into the “if”.
Comment not applicable since local variable listenChannels has now been
removed from pg_listening_channels, now using the original static
listenChannels instead.
> + /* Check for lagging backends when the queue spans multiple pages */
> + if (queue_length > 0)
...
> I wonder why this check is needed. If queue_length is 0, can we return
> immediately from SignalBackends()?
This check has been removed in the latest version.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-15 19:53 Arseniy Mukhin <[email protected]>
parent: Tom Lane <[email protected]>
2 siblings, 1 reply; 120+ messages in thread
From: Arseniy Mukhin @ 2025-10-15 19:53 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Joel Jacobson <[email protected]>; pgsql-hackers
On Wed, Oct 15, 2025 at 5:16 PM Tom Lane <[email protected]> wrote:
>
> Arseniy Mukhin <[email protected]> writes:
> > I think "Direct advancement" is a good idea. But the way it's
> > implemented now has a concurrency bug. Listeners store its current
> > position in the local variable 'pos' during the reading in
> > asyncQueueReadAllNotifications() and don't hold NotifyQueueLock. It
> > means that some notifier can directly advance the listener's position
> > while the listener has an old value in the local variable. The same
> > time we use listener positions to find out the limit we can truncate
> > the queue in asyncQueueAdvanceTail().
>
> Good catch!
>
> I think we can perhaps salvage the idea if we invent a separate
> "advisory" queue position field, which tells its backend "hey,
> you could skip as far as here if you want", but is not used for
> purposes of SLRU truncation. Alternatively, split the queue pos
> into "this is where to read next" and "this is as much as I'm
> definitively done with", where the second field gets advanced at
> the end of asyncQueueReadAllNotifications. Not sure which
> view would be less confusing (in the end I guess they're nearly
> the same thing, differently explained).
>
> A different line of thought could be to get rid of
> asyncQueueReadAllNotifications's optimization of moving the
> queue pos only once, per
>
> * (We could alternatively retake NotifyQueueLock and move the position
> * before handling each individual message, but that seems like too much
> * lock traffic.)
>
> Since we only need shared lock to advance our own queue pos,
> maybe that wouldn't be too awful. Not sure.
>
> regards, tom lane
Advisory queue position field sounds good IMHO. Listeners are still
solely responsible for advancing their positions so they still need to
wake up to do it, but they will only do so if there are relevant
notifications, or if they are too far behind. In any case they will be
able to jump over all irrelevant stuff.
Best regards,
Arseniy Mukhin
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-15 20:39 Joel Jacobson <[email protected]>
parent: Arseniy Mukhin <[email protected]>
0 siblings, 0 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-10-15 20:39 UTC (permalink / raw)
To: Arseniy Mukhin <[email protected]>; Tom Lane <[email protected]>; +Cc: pgsql-hackers
On Wed, Oct 15, 2025, at 13:19, Arseniy Mukhin wrote:
> I tried the patch and it seems listeners sometimes don't receive
> notifications. To reproduce it you can try to listen to the channel in
> one psql session and send notifications from another psql session. But
> all tests are fine, so I tried to write a TAP test to reproduce it. It
> passes on master and fails with the patch, so looks like it's real.
> Please find the repro in attachments. I added the TAP test to amcheck
> module just for simplicity.
Indeed a good catch! Thanks for the TAP test. I've migrated it to
async-notify.spec, included in 0001-optimize_listen_notify-v18.patch:
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
To fix this, we now do an initChannelHash call at the beginning of
SignalBackends, since the problem was that if no LISTEN had been done in
the session which did a NOTIFY, the channel would not have been
initiated. Added a point about this in the
0002-optimize_listen_notify-v18.patch header:
* SignalBackends attaches to the channel hash at the start, ensuring
that backends performing NOTIFY without having done LISTEN can still
find listeners in the shared hash table.
On Wed, Oct 15, 2025, at 16:16, Tom Lane wrote:
> I think we can perhaps salvage the idea if we invent a separate
> "advisory" queue position field, which tells its backend "hey,
> you could skip as far as here if you want", but is not used for
> purposes of SLRU truncation. Alternatively, split the queue pos
> into "this is where to read next" and "this is as much as I'm
> definitively done with", where the second field gets advanced at
> the end of asyncQueueReadAllNotifications. Not sure which
> view would be less confusing (in the end I guess they're nearly
> the same thing, differently explained).
>
> A different line of thought could be to get rid of
> asyncQueueReadAllNotifications's optimization of moving the
> queue pos only once, per
>
> * (We could alternatively retake NotifyQueueLock and move the position
> * before handling each individual message, but that seems like too much
> * lock traffic.)
>
> Since we only need shared lock to advance our own queue pos,
> maybe that wouldn't be too awful. Not sure.
These all sounds like promising ideas.
I went ahead and tried the "split the queue pos" idea, implemented
in 0002-optimize_listen_notify-v18.patch:
Position tracking for truncation safety
----------------------------------------
To prevent race conditions during queue truncation when using direct
advancement, backend positions are now tracked using two fields:
* pos: The next position to read from. This can be advanced by other
backends via direct advancement to skip over uninteresting
notifications.
* donePos: What the backend has definitively processed and no longer
needs. This is used for determining safe truncation points.
Without this separation, a backend could be advanced by another backend
while it's reading notifications, then write back its stale local
position that points to an already-truncated page. By using donePos for
truncation decisions and taking the maximum of local and shared pos when
updating, we ensure that truncation waits for backends to finish
reading, while still allowing position advancement for optimization.
On Wed, Oct 15, 2025, at 21:53, Arseniy Mukhin wrote:
> Advisory queue position field sounds good IMHO. Listeners are still
> solely responsible for advancing their positions so they still need to
> wake up to do it, but they will only do so if there are relevant
> notifications, or if they are too far behind. In any case they will be
> able to jump over all irrelevant stuff.
I read your message too late, otherwise I would have tried that
approach first. I will try to implement that one too, and perhaps
also the third one, and then we can evaluate them to see which
one we prefer.
/Joel
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v18.patch (9.3K, 2-0001-optimize_listen_notify-v18.patch)
download | inline diff:
From f37095250521d0a29d812997b7b79d938ed9c894 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 114 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 68 +++++++++++
2 files changed, 181 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..443a6eb669f 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 7 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
@@ -104,6 +205,17 @@ step l2commit: COMMIT;
listener2: NOTIFY "c1" with payload "" from notifier
step l2stop: UNLISTEN *;
+starting permutation: lch_listen nch_notify lch_check
+step lch_listen: LISTEN ch;
+step nch_notify: NOTIFY ch, 'aa';
+step lch_check: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener_ch: NOTIFY "ch" with payload "aa" from notifier_ch
+
starting permutation: llisten lbegin usage bignotify usage
step llisten: LISTEN c1; LISTEN c2;
step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..0a01e777b98 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,38 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Listener session for cross-session notification test with channel 'ch'.
+
+session listener_ch
+step lch_listen { LISTEN ch; }
+step lch_check { SELECT 1 AS x; }
+teardown { UNLISTEN *; }
+
+# Notifier session for cross-session notification test with channel 'ch'.
+
+session notifier_ch
+step nch_notify { NOTIFY ch, 'aa'; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +106,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
@@ -73,6 +137,10 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
# and notify queue is not empty
permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
+# Check that notifications sent from a backend that has not done LISTEN
+# are properly delivered to a listener in another backend.
+permutation lch_listen nch_notify lch_check
+
# Verify that pg_notification_queue_usage correctly reports a non-zero result,
# after submitting notifications while another connection is listening for
# those notifications and waiting inside an active transaction. We have to
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v18.patch (34.3K, 3-0002-optimize_listen_notify-v18.patch)
download | inline diff:
From 620f620ac671e9d9ef2694903e108103d4e82c8e Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Tue, 14 Oct 2025 08:03:19 +0200
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of listening backends (ProcNumbers). This allows the sender to
target only those backends actually listening on the channels for which
it has queued notifications.
At commit time:
* AtCommit_Notify updates the shared channelHash to reflect any LISTEN
or UNLISTEN actions performed in the transaction.
* SignalBackends consults this hash to find the backends that are
listening on the channels being notified in the current database, and
signals only those.
Each backend's entry in AsyncQueueControl now includes a wakeupPending
flag to prevent duplicate signals while a previous wakeup is still being
processed.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
Position tracking for truncation safety
----------------------------------------
To prevent race conditions during queue truncation when using direct
advancement, backend positions are now tracked using two fields:
* pos: The next position to read from. This can be advanced by other
backends via direct advancement to skip over uninteresting
notifications.
* donePos: What the backend has definitively processed and no longer
needs. This is used for determining safe truncation points.
Without this separation, a backend could be advanced by another backend
while it's reading notifications, then write back its stale local
position that points to an already-truncated page. By using donePos for
truncation decisions and taking the maximum of local and shared pos when
updating, we ensure that truncation waits for backends to finish
reading, while still allowing position advancement for optimization.
Queue health
------------
If a backend has fallen too far behind (lag >= QUEUE_CLEANUP_DELAY
pages), it is signaled to catch up so the global queue tail can advance.
Other notes
-----------
* Maintains dual data structures: a shared channelHash for determining
which backends to signal, and a local per-backend listenChannels list
for fast lock-free lookups during notification processing. This avoids
contention on the shared hash during the high-frequency IsListeningOn
checks that occur for every notification read from the queue.
* Backends remain registered in the global listener list as long as
listenChannels is non-empty.
* SignalBackends attaches to the channel hash at the start, ensuring
that backends performing NOTIFY without having done LISTEN can still
find listeners in the shared hash table.
* Adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
* No user-visible behavioral changes; this is an internal optimization
only.
---
src/backend/commands/async.c | 565 ++++++++++++++++--
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 520 insertions(+), 49 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..678c8174cb2 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,27 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
+ * make any actual updates to the effective listen state (channelHash).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which consults the shared channel hash table to
+ * identify listeners for the channels that have pending notifications
+ * in the current database. Each selected backend is marked as having a
+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, avoiding unnecessary
+ * wakeups for idle listeners that have nothing to read.
+ *
+ * To maintain queue health, SignalBackends() also wakes one backend
+ * positioned at the global queue tail to help advance it, and signals
+ * any backend that has fallen too far behind to catch up. These measures
+ * prevent the notification queue from growing indefinitely, while mostly
+ * limiting wakeups to the backends that actually need them.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +141,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,14 +151,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +179,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ProcNumber array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -227,8 +267,8 @@ typedef struct QueuePosition
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -245,7 +285,10 @@ typedef struct QueueBackendStatus
int32 pid; /* either a PID or InvalidPid */
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
- QueuePosition pos; /* backend has read queue up to here */
+ QueuePosition pos; /* next position to read from */
+ QueuePosition donePos; /* backend has definitively processed up to
+ * here */
+ bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
/*
@@ -260,9 +303,9 @@ typedef struct QueueBackendStatus
* (since no other backend will inspect it).
*
* When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
- * entries of other backends and also change the head pointer. When holding
- * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointers.
+ * entries of other backends, change the head pointer, and advance other
+ * backends' queue positions. When holding both NotifyQueueLock and
+ * NotifyQueueTailLock in EXCLUSIVE mode, backends can change the tail pointers.
*
* SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
* the control lock for the pg_notify SLRU buffers.
@@ -288,11 +331,16 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +349,8 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_DONEPOS(i) (asyncQueueControl->backend[i].donePos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
* The SLRU buffer area through which we access the notification queue
@@ -418,6 +468,20 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/*
+ * Queue head positions for direct advancement.
+ * These are captured during PreCommit_Notify while holding the heavyweight
+ * lock on database 0, ensuring no other backend can insert notifications
+ * between them. SignalBackends uses these to advance idle backends.
+ */
+static QueuePosition queueHeadBeforeWrite;
+static QueuePosition queueHeadAfterWrite;
+
+/*
+ * List of channels with pending notifications in the current transaction.
+ */
+static List *pendingNotifyChannels = NIL;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -457,6 +521,9 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+static void initChannelHash(void);
/*
* Compute the difference between two queue page numbers.
@@ -478,6 +545,80 @@ asyncQueuePagePrecedes(int64 p, int64 q)
return p < q;
}
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
/*
* Report space needed for our shared memory area
*/
@@ -521,12 +662,17 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_DONEPOS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -894,6 +1040,7 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -922,6 +1069,35 @@ PreCommit_Notify(void)
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /*
+ * Build list of unique channels for SignalBackends().
+ */
+ pendingNotifyChannels = NIL;
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
+
+ /*
+ * For the direct advancement optimization in SignalBackends(), we
+ * need to ensure that no other backend can insert queue entries
+ * between queueHeadBeforeWrite and queueHeadAfterWrite. The
+ * heavyweight lock above provides this guarantee, since it serializes
+ * all writers.
+ *
+ * Note: if the heavyweight lock were ever removed for scalability
+ * reasons, we could achieve the same guarantee by holding
+ * NotifyQueueLock in EXCLUSIVE mode across all our insertions, rather
+ * than releasing and reacquiring it for each page as we do below.
+ */
+
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(queueHeadBeforeWrite, 0, 0);
+
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
while (nextNotify != NULL)
@@ -939,12 +1115,33 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /*
+ * On the first iteration, save the queue head position before we
+ * write any notifications. This is used by SignalBackends() to
+ * identify backends that can be advanced directly without waking
+ * them up.
+ */
+ if (firstIteration)
+ {
+ queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many notifications in the NOTIFY queue")));
nextNotify = asyncQueueAddEntries(nextNotify);
+
+ /*
+ * Capture the queue head after each batch of entries. On the
+ * last iteration, this gives us the final queue head position for
+ * SignalBackends() to use when advancing idle backends.
+ */
+ queueHeadAfterWrite = QUEUE_HEAD;
+
LWLockRelease(NotifyQueueLock);
}
@@ -1097,6 +1294,7 @@ Exec_ListenPreCommit(void)
prevListener = i;
}
QUEUE_BACKEND_POS(MyProcNumber) = max;
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
/* Insert backend into list of listeners at correct position */
@@ -1135,6 +1333,10 @@ Exec_ListenPreCommit(void)
static void
Exec_ListenCommit(const char *channel)
{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
MemoryContext oldcontext;
/* Do nothing if we are already listening on this channel */
@@ -1152,21 +1354,84 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+
+ /* Now update the shared channelHash for SignalBackends() to use */
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listenersArray to InvalidDsaPointer as a
+ * marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners] = MyProcNumber;
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Remove the specified channel from the list of channels we are listening on.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
ListCell *q;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
+ /* Remove from our local cache */
foreach(q, listenChannels)
{
char *lchan = (char *) lfirst(q);
@@ -1179,6 +1444,46 @@ Exec_UnlistenCommit(const char *channel)
}
}
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ return;
+ }
+ }
+
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1193,11 +1498,51 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ /* Clear our local cache */
list_free_deep(listenChannels);
listenChannels = NIL;
+
+ /* Now clear from shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+ int i;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
}
/*
@@ -1565,12 +1910,19 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are still positioned at the queue head from before our
+ * commit can be safely advanced directly to the new head, since the
+ * queue region we wrote is known to contain only our own notifications.
+ * This avoids unnecessary wakeups when there is nothing of interest to
+ * them.
+ *
+ * In addition, if a backend has fallen too far behind in the queue, we
+ * signal it so that it will advance its position and allow the global
+ * tail pointer to move forward.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1935,13 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ ListCell *lc;
+
+ /*
+ * Attach to the channel hash if needed. We might not have one if this
+ * backend hasn't done LISTEN, but we need it to find listeners.
+ */
+ initChannelHash();
/*
* Identify backends that we need to signal. We don't want to send
@@ -1597,36 +1956,111 @@ SignalBackends(void)
count = 0;
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, pendingNotifyChannels)
+ {
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ProcNumber *listeners;
+
+ if (channelHash != NULL)
+ {
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ /*
+ * Direct advancement and lagging backend detection.
+ *
+ * Direct advancement: avoid waking backends still positioned at the old
+ * queue head that aren't interested in our notifications.
+ *
+ * The heavyweight lock on database 0 (held in PreCommit_Notify) ensures
+ * no other backend can insert notifications in the region we just wrote.
+ * Even though we may take and release NotifyQueueLock multiple times
+ * while writing, the heavyweight lock guarantees this region contains
+ * only our messages. Therefore, any backend still positioned at the
+ * queue head from before our write can be safely advanced to the current
+ * queue head without waking it.
+ *
+ * False-positive possibility: if a backend was previously signaled but
+ * hasn't yet awoken, we'll skip advancing it (because wakeupPending is
+ * true). This is safe - the backend will advance its pointer when it
+ * does wake up. The alternative (advancing it anyway) would risk
+ * advancing over notifications from whoever signaled it.
+ *
+ * Lagging backends: we also check if any backend has fallen too far
+ * behind and signal it to catch up, allowing the global tail to advance.
+ */
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- int32 pid = QUEUE_BACKEND_PID(i);
QueuePosition pos;
+ int64 lag;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
- Assert(pid != InvalidPid);
pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+
+ /* Direct advancement for idle backends at the old head */
+ if (pendingNotifies != NULL &&
+ QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
- if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
- continue;
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ pos = queueHeadAfterWrite;
}
- else
+
+ /* Signal backends that have fallen too far behind */
+ lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(pos));
+
+ if (lag >= QUEUE_CLEANUP_DELAY)
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
- continue;
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1865,6 +2299,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -1985,9 +2420,20 @@ asyncQueueReadAllNotifications(void)
}
PG_FINALLY();
{
- /* Update shared state */
+ /*
+ * Update shared state.
+ *
+ * We update donePos to what we actually read (the local pos
+ * variable), as this is used for truncation safety. For the read
+ * position (pos), we use the maximum of our local position and the
+ * current shared position, in case another backend used direct
+ * advancement to skip us ahead while we were reading. This prevents
+ * us from going backwards and potentially pointing to a truncated
+ * page.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
- QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = pos;
+ QUEUE_BACKEND_POS(MyProcNumber) = QUEUE_POS_MAX(pos, QUEUE_BACKEND_POS(MyProcNumber));
LWLockRelease(NotifyQueueLock);
}
PG_END_TRY();
@@ -2137,7 +2583,14 @@ asyncQueueAdvanceTail(void)
for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
{
Assert(QUEUE_BACKEND_PID(i) != InvalidPid);
- min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
+
+ /*
+ * Use donePos rather than pos for truncation safety. The donePos
+ * field represents what the backend has definitively processed, while
+ * pos can be advanced by other backends via direct advancement. This
+ * prevents truncating pages that a backend is still reading from.
+ */
+ min = QUEUE_POS_MIN(min, QUEUE_BACKEND_DONEPOS(i));
}
QUEUE_TAIL = min;
oldtailpage = QUEUE_STOP_PAGE;
@@ -2373,7 +2826,7 @@ notification_match(const void *key1, const void *key2, Size keysize)
return 1; /* not equal */
}
-/* Clear the pendingActions and pendingNotifies lists. */
+/* Clear the pendingActions, pendingNotifies, and pendingNotifyChannels lists. */
static void
ClearPendingActionsAndNotifies(void)
{
@@ -2385,6 +2838,7 @@ ClearPendingActionsAndNotifies(void)
*/
pendingActions = NULL;
pendingNotifies = NULL;
+ pendingNotifyChannels = NIL;
}
/*
@@ -2395,3 +2849,16 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..5ccdd4043e8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-15 21:10 Joel Jacobson <[email protected]>
parent: Tom Lane <[email protected]>
2 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-15 21:10 UTC (permalink / raw)
To: Tom Lane <[email protected]>; Arseniy Mukhin <[email protected]>; +Cc: pgsql-hackers
On Wed, Oct 15, 2025, at 16:16, Tom Lane wrote:
> I think we can perhaps salvage the idea if we invent a separate
> "advisory" queue position field, which tells its backend "hey,
> you could skip as far as here if you want", but is not used for
> purposes of SLRU truncation.
I want to experiment with this idea too.
I assume the separate "advisory" queue position field
would actually need to be two struct fields, since a queue position
consists of a page and an offset, right?
typedef struct QueuePosition
{
int64 page; /* SLRU page number */
int offset; /* byte offset within page */
+ int64 advisoryPage; /* suggested skip-ahead page */
+ int advisoryOffset; /* suggested skip-ahead offset */
} QueuePosition;
Or would we want rather want a single "advisory" field that would also
be of type QueuePosition?
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-15 21:15 Tom Lane <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 0 replies; 120+ messages in thread
From: Tom Lane @ 2025-10-15 21:15 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Arseniy Mukhin <[email protected]>; pgsql-hackers
"Joel Jacobson" <[email protected]> writes:
> I assume the separate "advisory" queue position field
> would actually need to be two struct fields, since a queue position
> consists of a page and an offset, right?
No, I'd think you'd have both
QueuePosition pos; /* backend has read queue up to here */
QueuePosition advisory_pos; /* backend could skip queue to here */
in QueueBackendStatus. The other seems way too confusing.
regards, tom lane
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-16 02:54 Chao Li <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Chao Li @ 2025-10-16 02:54 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Tom Lane <[email protected]>; pgsql-hackers
> On Oct 15, 2025, at 23:36, Joel Jacobson <[email protected]> wrote:
>
>> I agree with Tom that GetPendingNotifyChannels() is too heavy and unnecessary.
>>
>> In PreCommit_Notify(), we can maintain a local hash table to record
>> pending nofications’ channel names. dahash also supports hash table in
>> local memory.
>
> I'm confused, I assume you mean "dynahash" since there is no "dahash"
> in the sources? I see dynahash has local-to-a-backend support,
> but I don't see why we would need a hash table for this,
> we just iterate over it once in SignalBackends,
> I think the local list is fine.
>
> The latest version gets rid of GetPendingNotifyChannels()
> and replaces it with the local list pendingNotifyChannels.
Sorry for the typo, Yes, I meant to dynahash” that you have already been using it.
In v18, I see you are building “pendingNotifyChannels” in PreCommit_Notify() with “List”:
```
+ /*
+ * Build list of unique channels for SignalBackends().
+ */
+ pendingNotifyChannels = NIL;
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
```
My suggestion of using dynahah was for the same purpose. Because list_member_ptr() iterates through all list nodes until find the target, so this code is still O(n^2).
Using a hash will make it faster. I used to work on project Concourse [1]. The system is heavily using the LISTEN/NOTIFY mechanism. There would be thousands of channels at runtime. In that case, hash search would be much faster than linear search.
[1] https://github.com/concourse/concourse
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-16 09:39 Joel Jacobson <[email protected]>
parent: Tom Lane <[email protected]>
2 siblings, 0 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-10-16 09:39 UTC (permalink / raw)
To: Tom Lane <[email protected]>; Arseniy Mukhin <[email protected]>; +Cc: pgsql-hackers
On Wed, Oct 15, 2025, at 16:16, Tom Lane wrote:
> Arseniy Mukhin <[email protected]> writes:
>> I think "Direct advancement" is a good idea. But the way it's
>> implemented now has a concurrency bug. Listeners store its current
>> position in the local variable 'pos' during the reading in
>> asyncQueueReadAllNotifications() and don't hold NotifyQueueLock. It
>> means that some notifier can directly advance the listener's position
>> while the listener has an old value in the local variable. The same
>> time we use listener positions to find out the limit we can truncate
>> the queue in asyncQueueAdvanceTail().
>
> Good catch!
I've implemented the three ideas presented below, attached as .txt files
that are diffs on top of v19, which has these changes since v17:
0002-optimize_listen_notify-v19.patch:
* Improve wording of top comment per request from Chao Li.
* Add initChannelHash call to top of SignalBackends,
to fix bug reported by Arseniy Mukhin.
> I think we can perhaps salvage the idea if we invent a separate
> "advisory" queue position field, which tells its backend "hey,
> you could skip as far as here if you want", but is not used for
> purposes of SLRU truncation.
Above idea is implemented in 0002-optimize_listen_notify-v19-alt1.txt
> Alternatively, split the queue pos
> into "this is where to read next" and "this is as much as I'm
> definitively done with", where the second field gets advanced at
> the end of asyncQueueReadAllNotifications. Not sure which
> view would be less confusing (in the end I guess they're nearly
> the same thing, differently explained).
Above idea is implemented in 0002-optimize_listen_notify-v19-alt2.txt
> A different line of thought could be to get rid of
> asyncQueueReadAllNotifications's optimization of moving the
> queue pos only once, per
>
> * (We could alternatively retake NotifyQueueLock and move the position
> * before handling each individual message, but that seems like too much
> * lock traffic.)
>
> Since we only need shared lock to advance our own queue pos,
> maybe that wouldn't be too awful. Not sure.
Above idea is implemented in 0002-optimize_listen_notify-v19-alt3.txt
/Joel
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 90a530cfc61..44442e927ff 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -264,6 +264,11 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
@@ -286,6 +291,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ QueuePosition advisoryPos; /* backend could skip queue to here */
bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
@@ -347,6 +353,7 @@ static dshash_table *channelHash = NULL;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_ADVISORY_POS(i) (asyncQueueControl->backend[i].advisoryPos)
#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
@@ -668,6 +675,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVISORY_POS(i), 0, 0);
QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -2009,9 +2017,14 @@ SignalBackends(void)
* Even though we may take and release NotifyQueueLock multiple times
* while writing, the heavyweight lock guarantees this region contains
* only our messages. Therefore, any backend still positioned at the
- * queue head from before our write can be safely advanced to the current
+ * queue head from before our write can be advised to skip to the current
* queue head without waking it.
*
+ * We use the advisoryPos field rather than directly modifying pos,
+ * because the listening backend might be concurrently reading
+ * notifications using its local copy of pos. The backend controls its
+ * own pos field and will check advisoryPos when it's safe to do so.
+ *
* False-positive possibility: if a backend was previously signaled but
* hasn't yet awoken, we'll skip advancing it (because wakeupPending is
* true). This is safe - the backend will advance its pointer when it
@@ -2038,7 +2051,7 @@ SignalBackends(void)
if (pendingNotifies != NULL &&
QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
{
- QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
pos = queueHeadAfterWrite;
}
@@ -2297,6 +2310,26 @@ asyncQueueReadAllNotifications(void)
QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
+
+ /*
+ * Check if another backend has set an advisory position for us.
+ * If so, and if we haven't yet read past that point, we can safely
+ * adopt the advisory position and skip the intervening notifications.
+ * This is safe because the advisory position is only set when we're
+ * positioned at a known point and the skipped region contains only
+ * notifications we're not interested in.
+ */
+ {
+ QueuePosition advisoryPos = QUEUE_BACKEND_ADVISORY_POS(MyProcNumber);
+
+ if (!QUEUE_POS_EQUAL(advisoryPos, pos) &&
+ QUEUE_POS_PRECEDES(pos, advisoryPos))
+ {
+ pos = advisoryPos;
+ QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ }
+ }
+
LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 90a530cfc61..e201deb5e54 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -70,14 +70,14 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the local listen state (listenChannels) and
- * shared channel hash table (channelHash). Then we signal any backends
- * that may be interested in our messages (including our own backend,
- * if listening). This is done by SignalBackends(), which consults the
- * shared channel hash table to identify listeners for the channels that
- * have pending notifications in the current database. Each selected
- * backend is marked as having a wakeup pending to avoid duplicate signals,
- * and a PROCSIG_NOTIFY_INTERRUPT signal is sent to it.
+ * make any actual updates to the effective listen state (channelHash).
+ * Then we signal any backends that may be interested in our messages
+ * (including our own backend, if listening). This is done by
+ * SignalBackends(), which consults the shared channel hash table to
+ * identify listeners for the channels that have pending notifications
+ * in the current database. Each selected backend is marked as having a
+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to it.
*
* When writing notifications, PreCommit_Notify() records the queue head
* position both before and after the write. Because all writers serialize
@@ -2282,6 +2282,7 @@ asyncQueueReadAllNotifications(void)
volatile QueuePosition pos;
QueuePosition head;
Snapshot snapshot;
+ bool reachedStop;
/* page_buffer must be adequately aligned, so use a union */
union
@@ -2350,77 +2351,83 @@ asyncQueueReadAllNotifications(void)
* It is possible that we fail while trying to send a message to our
* frontend (for example, because of encoding conversion failure). If
* that happens it is critical that we not try to send the same message
- * over and over again. Therefore, we place a PG_TRY block here that will
- * forcibly advance our queue position before we lose control to an error.
- * (We could alternatively retake NotifyQueueLock and move the position
- * before handling each individual message, but that seems like too much
- * lock traffic.)
+ * over and over again. Therefore, we must advance our queue position
+ * regularly as we process messages.
+ *
+ * We must also be careful about concurrency: SignalBackends() can
+ * directly advance our position while we're reading. To prevent
+ * overwriting such an advancement with a stale value, we update our
+ * position in shared memory after processing messages from each page,
+ * while holding NotifyQueueLock. Shared lock is sufficient since we're
+ * only updating our own position.
*/
- PG_TRY();
+ do
{
- bool reachedStop;
+ int64 curpage = QUEUE_POS_PAGE(pos);
+ int curoffset = QUEUE_POS_OFFSET(pos);
+ int slotno;
+ int copysize;
- do
+ /*
+ * We copy the data from SLRU into a local buffer, so as to avoid
+ * holding the SLRU lock while we are examining the entries and
+ * possibly transmitting them to our frontend. Copy only the part
+ * of the page we will actually inspect.
+ */
+ slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
+ InvalidTransactionId);
+ if (curpage == QUEUE_POS_PAGE(head))
{
- int64 curpage = QUEUE_POS_PAGE(pos);
- int curoffset = QUEUE_POS_OFFSET(pos);
- int slotno;
- int copysize;
+ /* we only want to read as far as head */
+ copysize = QUEUE_POS_OFFSET(head) - curoffset;
+ if (copysize < 0)
+ copysize = 0; /* just for safety */
+ }
+ else
+ {
+ /* fetch all the rest of the page */
+ copysize = QUEUE_PAGESIZE - curoffset;
+ }
+ memcpy(page_buffer.buf + curoffset,
+ NotifyCtl->shared->page_buffer[slotno] + curoffset,
+ copysize);
+ /* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+ LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
- /*
- * We copy the data from SLRU into a local buffer, so as to avoid
- * holding the SLRU lock while we are examining the entries and
- * possibly transmitting them to our frontend. Copy only the part
- * of the page we will actually inspect.
- */
- slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
- InvalidTransactionId);
- if (curpage == QUEUE_POS_PAGE(head))
- {
- /* we only want to read as far as head */
- copysize = QUEUE_POS_OFFSET(head) - curoffset;
- if (copysize < 0)
- copysize = 0; /* just for safety */
- }
- else
- {
- /* fetch all the rest of the page */
- copysize = QUEUE_PAGESIZE - curoffset;
- }
- memcpy(page_buffer.buf + curoffset,
- NotifyCtl->shared->page_buffer[slotno] + curoffset,
- copysize);
- /* Release lock that we got from SimpleLruReadPage_ReadOnly() */
- LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+ /*
+ * Process messages up to the stop position, end of page, or an
+ * uncommitted message.
+ *
+ * Our stop position is what we found to be the head's position
+ * when we entered this function. It might have changed already.
+ * But if it has, we will receive (or have already received and
+ * queued) another signal and come here again.
+ *
+ * We are not holding NotifyQueueLock here! The queue can only
+ * extend beyond the head pointer (see above). We update our
+ * backend's position after processing messages from each page to
+ * ensure we don't reprocess messages if we fail partway through,
+ * and to avoid overwriting any direct advancement that
+ * SignalBackends() might perform concurrently.
+ */
+ reachedStop = asyncQueueProcessPageEntries(&pos, head,
+ page_buffer.buf,
+ snapshot);
- /*
- * Process messages up to the stop position, end of page, or an
- * uncommitted message.
- *
- * Our stop position is what we found to be the head's position
- * when we entered this function. It might have changed already.
- * But if it has, we will receive (or have already received and
- * queued) another signal and come here again.
- *
- * We are not holding NotifyQueueLock here! The queue can only
- * extend beyond the head pointer (see above) and we leave our
- * backend's pointer where it is so nobody will truncate or
- * rewrite pages under us. Especially we don't want to hold a lock
- * while sending the notifications to the frontend.
- */
- reachedStop = asyncQueueProcessPageEntries(&pos, head,
- page_buffer.buf,
- snapshot);
- } while (!reachedStop);
- }
- PG_FINALLY();
- {
- /* Update shared state */
+ /*
+ * Update our position in shared memory. The 'pos' variable now
+ * holds our new position (advanced past all messages we just
+ * processed). This ensures that if we fail while processing
+ * messages from the next page, we won't reprocess the ones we
+ * just handled. It also prevents us from overwriting any direct
+ * advancement that another backend might have done while we were
+ * processing messages.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
QUEUE_BACKEND_POS(MyProcNumber) = pos;
LWLockRelease(NotifyQueueLock);
- }
- PG_END_TRY();
+
+ } while (!reachedStop);
/* Done with snapshot */
UnregisterSnapshot(snapshot);
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 90a530cfc61..751400b8315 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -285,7 +285,8 @@ typedef struct QueueBackendStatus
int32 pid; /* either a PID or InvalidPid */
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
- QueuePosition pos; /* backend has read queue up to here */
+ QueuePosition pos; /* next position to read from */
+ QueuePosition donePos; /* backend has definitively processed up to here */
bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
@@ -347,6 +348,7 @@ static dshash_table *channelHash = NULL;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_DONEPOS(i) (asyncQueueControl->backend[i].donePos)
#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
@@ -668,6 +670,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_DONEPOS(i), 0, 0);
QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1290,6 +1293,7 @@ Exec_ListenPreCommit(void)
prevListener = i;
}
QUEUE_BACKEND_POS(MyProcNumber) = max;
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
/* Insert backend into list of listeners at correct position */
@@ -2415,9 +2419,19 @@ asyncQueueReadAllNotifications(void)
}
PG_FINALLY();
{
- /* Update shared state */
+ /*
+ * Update shared state.
+ *
+ * We update donePos to what we actually read (the local pos variable),
+ * as this is used for truncation safety. For the read position (pos),
+ * we use the maximum of our local position and the current shared
+ * position, in case another backend used direct advancement to skip us
+ * ahead while we were reading. This prevents us from going backwards
+ * and potentially pointing to a truncated page.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
- QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = pos;
+ QUEUE_BACKEND_POS(MyProcNumber) = QUEUE_POS_MAX(pos, QUEUE_BACKEND_POS(MyProcNumber));
LWLockRelease(NotifyQueueLock);
}
PG_END_TRY();
@@ -2567,7 +2581,13 @@ asyncQueueAdvanceTail(void)
for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
{
Assert(QUEUE_BACKEND_PID(i) != InvalidPid);
- min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
+ /*
+ * Use donePos rather than pos for truncation safety. The donePos
+ * field represents what the backend has definitively processed, while
+ * pos can be advanced by other backends via direct advancement. This
+ * prevents truncating pages that a backend is still reading from.
+ */
+ min = QUEUE_POS_MIN(min, QUEUE_BACKEND_DONEPOS(i));
}
QUEUE_TAIL = min;
oldtailpage = QUEUE_STOP_PAGE;
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v19.patch (9.3K, 2-0001-optimize_listen_notify-v19.patch)
download | inline diff:
From f37095250521d0a29d812997b7b79d938ed9c894 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 114 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 68 +++++++++++
2 files changed, 181 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..443a6eb669f 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 7 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
@@ -104,6 +205,17 @@ step l2commit: COMMIT;
listener2: NOTIFY "c1" with payload "" from notifier
step l2stop: UNLISTEN *;
+starting permutation: lch_listen nch_notify lch_check
+step lch_listen: LISTEN ch;
+step nch_notify: NOTIFY ch, 'aa';
+step lch_check: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener_ch: NOTIFY "ch" with payload "aa" from notifier_ch
+
starting permutation: llisten lbegin usage bignotify usage
step llisten: LISTEN c1; LISTEN c2;
step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..0a01e777b98 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,38 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Listener session for cross-session notification test with channel 'ch'.
+
+session listener_ch
+step lch_listen { LISTEN ch; }
+step lch_check { SELECT 1 AS x; }
+teardown { UNLISTEN *; }
+
+# Notifier session for cross-session notification test with channel 'ch'.
+
+session notifier_ch
+step nch_notify { NOTIFY ch, 'aa'; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +106,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
@@ -73,6 +137,10 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
# and notify queue is not empty
permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
+# Check that notifications sent from a backend that has not done LISTEN
+# are properly delivered to a listener in another backend.
+permutation lch_listen nch_notify lch_check
+
# Verify that pg_notification_queue_usage correctly reports a non-zero result,
# after submitting notifications while another connection is listening for
# those notifications and waiting inside an active transaction. We have to
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v19.patch (31.3K, 3-0002-optimize_listen_notify-v19.patch)
download | inline diff:
From 8d77fa4296f530b0381cf2e612774f0feaf8b506 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Tue, 14 Oct 2025 08:03:19 +0200
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of listening backends (ProcNumbers). This allows the sender to
target only those backends actually listening on the channels for which
it has queued notifications.
At commit time:
* AtCommit_Notify updates the shared channelHash to reflect any LISTEN
or UNLISTEN actions performed in the transaction.
* SignalBackends consults this hash to find the backends that are
listening on the channels being notified in the current database, and
signals only those.
Each backend's entry in AsyncQueueControl now includes a wakeupPending
flag to prevent duplicate signals while a previous wakeup is still being
processed.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
Queue health
------------
If a backend has fallen too far behind (lag >= QUEUE_CLEANUP_DELAY
pages), it is signaled to catch up so the global queue tail can advance.
Other notes
-----------
* Maintains dual data structures: a shared channelHash for determining
which backends to signal, and a local per-backend listenChannels list
for fast lock-free lookups during notification processing. This avoids
contention on the shared hash during the high-frequency IsListeningOn
checks that occur for every notification read from the queue.
* Backends remain registered in the global listener list as long as
listenChannels is non-empty.
* Adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
* No user-visible behavioral changes; this is an internal optimization
only.
---
src/backend/commands/async.c | 538 ++++++++++++++++--
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 495 insertions(+), 47 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..90a530cfc61 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,27 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
- * Then we signal any backends that may be interested in our messages
- * (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * make any actual updates to the local listen state (listenChannels) and
+ * shared channel hash table (channelHash). Then we signal any backends
+ * that may be interested in our messages (including our own backend,
+ * if listening). This is done by SignalBackends(), which consults the
+ * shared channel hash table to identify listeners for the channels that
+ * have pending notifications in the current database. Each selected
+ * backend is marked as having a wakeup pending to avoid duplicate signals,
+ * and a PROCSIG_NOTIFY_INTERRUPT signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, avoiding unnecessary
+ * wakeups for idle listeners that have nothing to read.
+ *
+ * To maintain queue health, SignalBackends() also wakes one backend
+ * positioned at the global queue tail to help advance it, and signals
+ * any backend that has fallen too far behind to catch up. These measures
+ * prevent the notification queue from growing indefinitely, while mostly
+ * limiting wakeups to the backends that actually need them.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +141,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,14 +151,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +179,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ProcNumber array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -227,8 +267,8 @@ typedef struct QueuePosition
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +286,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
/*
@@ -260,9 +301,9 @@ typedef struct QueueBackendStatus
* (since no other backend will inspect it).
*
* When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
- * entries of other backends and also change the head pointer. When holding
- * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointers.
+ * entries of other backends, change the head pointer, and advance other
+ * backends' queue positions. When holding both NotifyQueueLock and
+ * NotifyQueueTailLock in EXCLUSIVE mode, backends can change the tail pointers.
*
* SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
* the control lock for the pg_notify SLRU buffers.
@@ -288,11 +329,16 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +347,7 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
* The SLRU buffer area through which we access the notification queue
@@ -418,6 +465,20 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/*
+ * Queue head positions for direct advancement.
+ * These are captured during PreCommit_Notify while holding the heavyweight
+ * lock on database 0, ensuring no other backend can insert notifications
+ * between them. SignalBackends uses these to advance idle backends.
+ */
+static QueuePosition queueHeadBeforeWrite;
+static QueuePosition queueHeadAfterWrite;
+
+/*
+ * List of channels with pending notifications in the current transaction.
+ */
+static List *pendingNotifyChannels = NIL;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -457,6 +518,9 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+static void initChannelHash(void);
/*
* Compute the difference between two queue page numbers.
@@ -478,6 +542,80 @@ asyncQueuePagePrecedes(int64 p, int64 q)
return p < q;
}
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
/*
* Report space needed for our shared memory area
*/
@@ -521,12 +659,16 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -894,6 +1036,7 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -922,6 +1065,35 @@ PreCommit_Notify(void)
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /*
+ * Build list of unique channels for SignalBackends().
+ */
+ pendingNotifyChannels = NIL;
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
+
+ /*
+ * For the direct advancement optimization in SignalBackends(), we
+ * need to ensure that no other backend can insert queue entries
+ * between queueHeadBeforeWrite and queueHeadAfterWrite. The
+ * heavyweight lock above provides this guarantee, since it serializes
+ * all writers.
+ *
+ * Note: if the heavyweight lock were ever removed for scalability
+ * reasons, we could achieve the same guarantee by holding
+ * NotifyQueueLock in EXCLUSIVE mode across all our insertions, rather
+ * than releasing and reacquiring it for each page as we do below.
+ */
+
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(queueHeadBeforeWrite, 0, 0);
+
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
while (nextNotify != NULL)
@@ -939,12 +1111,33 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /*
+ * On the first iteration, save the queue head position before we
+ * write any notifications. This is used by SignalBackends() to
+ * identify backends that can be advanced directly without waking
+ * them up.
+ */
+ if (firstIteration)
+ {
+ queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many notifications in the NOTIFY queue")));
nextNotify = asyncQueueAddEntries(nextNotify);
+
+ /*
+ * Capture the queue head after each batch of entries. On the
+ * last iteration, this gives us the final queue head position for
+ * SignalBackends() to use when advancing idle backends.
+ */
+ queueHeadAfterWrite = QUEUE_HEAD;
+
LWLockRelease(NotifyQueueLock);
}
@@ -1135,6 +1328,10 @@ Exec_ListenPreCommit(void)
static void
Exec_ListenCommit(const char *channel)
{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
MemoryContext oldcontext;
/* Do nothing if we are already listening on this channel */
@@ -1152,21 +1349,84 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+
+ /* Now update the shared channelHash for SignalBackends() to use */
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listenersArray to InvalidDsaPointer as a
+ * marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners] = MyProcNumber;
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Remove the specified channel from the list of channels we are listening on.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
ListCell *q;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
+ /* Remove from our local cache */
foreach(q, listenChannels)
{
char *lchan = (char *) lfirst(q);
@@ -1179,6 +1439,46 @@ Exec_UnlistenCommit(const char *channel)
}
}
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ return;
+ }
+ }
+
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1193,11 +1493,51 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ /* Clear our local cache */
list_free_deep(listenChannels);
listenChannels = NIL;
+
+ /* Now clear from shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+ int i;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
}
/*
@@ -1565,12 +1905,19 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are still positioned at the queue head from before our
+ * commit can be safely advanced directly to the new head, since the
+ * queue region we wrote is known to contain only our own notifications.
+ * This avoids unnecessary wakeups when there is nothing of interest to
+ * them.
+ *
+ * In addition, if a backend has fallen too far behind in the queue, we
+ * signal it so that it will advance its position and allow the global
+ * tail pointer to move forward.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1930,13 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ ListCell *lc;
+
+ /*
+ * Attach to the channel hash if needed. We might not have one if this
+ * backend hasn't done LISTEN, but we need it to find listeners.
+ */
+ initChannelHash();
/*
* Identify backends that we need to signal. We don't want to send
@@ -1597,36 +1951,111 @@ SignalBackends(void)
count = 0;
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, pendingNotifyChannels)
+ {
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ProcNumber *listeners;
+
+ if (channelHash != NULL)
+ {
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ /*
+ * Direct advancement and lagging backend detection.
+ *
+ * Direct advancement: avoid waking backends still positioned at the old
+ * queue head that aren't interested in our notifications.
+ *
+ * The heavyweight lock on database 0 (held in PreCommit_Notify) ensures
+ * no other backend can insert notifications in the region we just wrote.
+ * Even though we may take and release NotifyQueueLock multiple times
+ * while writing, the heavyweight lock guarantees this region contains
+ * only our messages. Therefore, any backend still positioned at the
+ * queue head from before our write can be safely advanced to the current
+ * queue head without waking it.
+ *
+ * False-positive possibility: if a backend was previously signaled but
+ * hasn't yet awoken, we'll skip advancing it (because wakeupPending is
+ * true). This is safe - the backend will advance its pointer when it
+ * does wake up. The alternative (advancing it anyway) would risk
+ * advancing over notifications from whoever signaled it.
+ *
+ * Lagging backends: we also check if any backend has fallen too far
+ * behind and signal it to catch up, allowing the global tail to advance.
+ */
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- int32 pid = QUEUE_BACKEND_PID(i);
QueuePosition pos;
+ int64 lag;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
- Assert(pid != InvalidPid);
pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+
+ /* Direct advancement for idle backends at the old head */
+ if (pendingNotifies != NULL &&
+ QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
- if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
- continue;
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ pos = queueHeadAfterWrite;
}
- else
+
+ /* Signal backends that have fallen too far behind */
+ lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(pos));
+
+ if (lag >= QUEUE_CLEANUP_DELAY)
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
- continue;
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1865,6 +2294,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2373,7 +2803,7 @@ notification_match(const void *key1, const void *key2, Size keysize)
return 1; /* not equal */
}
-/* Clear the pendingActions and pendingNotifies lists. */
+/* Clear the pendingActions, pendingNotifies, and pendingNotifyChannels lists. */
static void
ClearPendingActionsAndNotifies(void)
{
@@ -2385,6 +2815,7 @@ ClearPendingActionsAndNotifies(void)
*/
pendingActions = NULL;
pendingNotifies = NULL;
+ pendingNotifyChannels = NIL;
}
/*
@@ -2395,3 +2826,16 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..5ccdd4043e8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
[text/plain] 0002-optimize_listen_notify-v19-alt1.txt (3.9K, 4-0002-optimize_listen_notify-v19-alt1.txt)
download | inline diff:
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 90a530cfc61..44442e927ff 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -264,6 +264,11 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
@@ -286,6 +291,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ QueuePosition advisoryPos; /* backend could skip queue to here */
bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
@@ -347,6 +353,7 @@ static dshash_table *channelHash = NULL;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_ADVISORY_POS(i) (asyncQueueControl->backend[i].advisoryPos)
#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
@@ -668,6 +675,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVISORY_POS(i), 0, 0);
QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -2009,9 +2017,14 @@ SignalBackends(void)
* Even though we may take and release NotifyQueueLock multiple times
* while writing, the heavyweight lock guarantees this region contains
* only our messages. Therefore, any backend still positioned at the
- * queue head from before our write can be safely advanced to the current
+ * queue head from before our write can be advised to skip to the current
* queue head without waking it.
*
+ * We use the advisoryPos field rather than directly modifying pos,
+ * because the listening backend might be concurrently reading
+ * notifications using its local copy of pos. The backend controls its
+ * own pos field and will check advisoryPos when it's safe to do so.
+ *
* False-positive possibility: if a backend was previously signaled but
* hasn't yet awoken, we'll skip advancing it (because wakeupPending is
* true). This is safe - the backend will advance its pointer when it
@@ -2038,7 +2051,7 @@ SignalBackends(void)
if (pendingNotifies != NULL &&
QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
{
- QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
pos = queueHeadAfterWrite;
}
@@ -2297,6 +2310,26 @@ asyncQueueReadAllNotifications(void)
QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
+
+ /*
+ * Check if another backend has set an advisory position for us.
+ * If so, and if we haven't yet read past that point, we can safely
+ * adopt the advisory position and skip the intervening notifications.
+ * This is safe because the advisory position is only set when we're
+ * positioned at a known point and the skipped region contains only
+ * notifications we're not interested in.
+ */
+ {
+ QueuePosition advisoryPos = QUEUE_BACKEND_ADVISORY_POS(MyProcNumber);
+
+ if (!QUEUE_POS_EQUAL(advisoryPos, pos) &&
+ QUEUE_POS_PRECEDES(pos, advisoryPos))
+ {
+ pos = advisoryPos;
+ QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ }
+ }
+
LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
[text/plain] 0002-optimize_listen_notify-v19-alt3.txt (7.4K, 5-0002-optimize_listen_notify-v19-alt3.txt)
download | inline diff:
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 90a530cfc61..e201deb5e54 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -70,14 +70,14 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the local listen state (listenChannels) and
- * shared channel hash table (channelHash). Then we signal any backends
- * that may be interested in our messages (including our own backend,
- * if listening). This is done by SignalBackends(), which consults the
- * shared channel hash table to identify listeners for the channels that
- * have pending notifications in the current database. Each selected
- * backend is marked as having a wakeup pending to avoid duplicate signals,
- * and a PROCSIG_NOTIFY_INTERRUPT signal is sent to it.
+ * make any actual updates to the effective listen state (channelHash).
+ * Then we signal any backends that may be interested in our messages
+ * (including our own backend, if listening). This is done by
+ * SignalBackends(), which consults the shared channel hash table to
+ * identify listeners for the channels that have pending notifications
+ * in the current database. Each selected backend is marked as having a
+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to it.
*
* When writing notifications, PreCommit_Notify() records the queue head
* position both before and after the write. Because all writers serialize
@@ -2282,6 +2282,7 @@ asyncQueueReadAllNotifications(void)
volatile QueuePosition pos;
QueuePosition head;
Snapshot snapshot;
+ bool reachedStop;
/* page_buffer must be adequately aligned, so use a union */
union
@@ -2350,77 +2351,83 @@ asyncQueueReadAllNotifications(void)
* It is possible that we fail while trying to send a message to our
* frontend (for example, because of encoding conversion failure). If
* that happens it is critical that we not try to send the same message
- * over and over again. Therefore, we place a PG_TRY block here that will
- * forcibly advance our queue position before we lose control to an error.
- * (We could alternatively retake NotifyQueueLock and move the position
- * before handling each individual message, but that seems like too much
- * lock traffic.)
+ * over and over again. Therefore, we must advance our queue position
+ * regularly as we process messages.
+ *
+ * We must also be careful about concurrency: SignalBackends() can
+ * directly advance our position while we're reading. To prevent
+ * overwriting such an advancement with a stale value, we update our
+ * position in shared memory after processing messages from each page,
+ * while holding NotifyQueueLock. Shared lock is sufficient since we're
+ * only updating our own position.
*/
- PG_TRY();
+ do
{
- bool reachedStop;
+ int64 curpage = QUEUE_POS_PAGE(pos);
+ int curoffset = QUEUE_POS_OFFSET(pos);
+ int slotno;
+ int copysize;
- do
+ /*
+ * We copy the data from SLRU into a local buffer, so as to avoid
+ * holding the SLRU lock while we are examining the entries and
+ * possibly transmitting them to our frontend. Copy only the part
+ * of the page we will actually inspect.
+ */
+ slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
+ InvalidTransactionId);
+ if (curpage == QUEUE_POS_PAGE(head))
{
- int64 curpage = QUEUE_POS_PAGE(pos);
- int curoffset = QUEUE_POS_OFFSET(pos);
- int slotno;
- int copysize;
+ /* we only want to read as far as head */
+ copysize = QUEUE_POS_OFFSET(head) - curoffset;
+ if (copysize < 0)
+ copysize = 0; /* just for safety */
+ }
+ else
+ {
+ /* fetch all the rest of the page */
+ copysize = QUEUE_PAGESIZE - curoffset;
+ }
+ memcpy(page_buffer.buf + curoffset,
+ NotifyCtl->shared->page_buffer[slotno] + curoffset,
+ copysize);
+ /* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+ LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
- /*
- * We copy the data from SLRU into a local buffer, so as to avoid
- * holding the SLRU lock while we are examining the entries and
- * possibly transmitting them to our frontend. Copy only the part
- * of the page we will actually inspect.
- */
- slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
- InvalidTransactionId);
- if (curpage == QUEUE_POS_PAGE(head))
- {
- /* we only want to read as far as head */
- copysize = QUEUE_POS_OFFSET(head) - curoffset;
- if (copysize < 0)
- copysize = 0; /* just for safety */
- }
- else
- {
- /* fetch all the rest of the page */
- copysize = QUEUE_PAGESIZE - curoffset;
- }
- memcpy(page_buffer.buf + curoffset,
- NotifyCtl->shared->page_buffer[slotno] + curoffset,
- copysize);
- /* Release lock that we got from SimpleLruReadPage_ReadOnly() */
- LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+ /*
+ * Process messages up to the stop position, end of page, or an
+ * uncommitted message.
+ *
+ * Our stop position is what we found to be the head's position
+ * when we entered this function. It might have changed already.
+ * But if it has, we will receive (or have already received and
+ * queued) another signal and come here again.
+ *
+ * We are not holding NotifyQueueLock here! The queue can only
+ * extend beyond the head pointer (see above). We update our
+ * backend's position after processing messages from each page to
+ * ensure we don't reprocess messages if we fail partway through,
+ * and to avoid overwriting any direct advancement that
+ * SignalBackends() might perform concurrently.
+ */
+ reachedStop = asyncQueueProcessPageEntries(&pos, head,
+ page_buffer.buf,
+ snapshot);
- /*
- * Process messages up to the stop position, end of page, or an
- * uncommitted message.
- *
- * Our stop position is what we found to be the head's position
- * when we entered this function. It might have changed already.
- * But if it has, we will receive (or have already received and
- * queued) another signal and come here again.
- *
- * We are not holding NotifyQueueLock here! The queue can only
- * extend beyond the head pointer (see above) and we leave our
- * backend's pointer where it is so nobody will truncate or
- * rewrite pages under us. Especially we don't want to hold a lock
- * while sending the notifications to the frontend.
- */
- reachedStop = asyncQueueProcessPageEntries(&pos, head,
- page_buffer.buf,
- snapshot);
- } while (!reachedStop);
- }
- PG_FINALLY();
- {
- /* Update shared state */
+ /*
+ * Update our position in shared memory. The 'pos' variable now
+ * holds our new position (advanced past all messages we just
+ * processed). This ensures that if we fail while processing
+ * messages from the next page, we won't reprocess the ones we
+ * just handled. It also prevents us from overwriting any direct
+ * advancement that another backend might have done while we were
+ * processing messages.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
QUEUE_BACKEND_POS(MyProcNumber) = pos;
LWLockRelease(NotifyQueueLock);
- }
- PG_END_TRY();
+
+ } while (!reachedStop);
/* Done with snapshot */
UnregisterSnapshot(snapshot);
[text/plain] 0002-optimize_listen_notify-v19-alt2.txt (3.2K, 6-0002-optimize_listen_notify-v19-alt2.txt)
download | inline diff:
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 90a530cfc61..751400b8315 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -285,7 +285,8 @@ typedef struct QueueBackendStatus
int32 pid; /* either a PID or InvalidPid */
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
- QueuePosition pos; /* backend has read queue up to here */
+ QueuePosition pos; /* next position to read from */
+ QueuePosition donePos; /* backend has definitively processed up to here */
bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
@@ -347,6 +348,7 @@ static dshash_table *channelHash = NULL;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_DONEPOS(i) (asyncQueueControl->backend[i].donePos)
#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
@@ -668,6 +670,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_DONEPOS(i), 0, 0);
QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1290,6 +1293,7 @@ Exec_ListenPreCommit(void)
prevListener = i;
}
QUEUE_BACKEND_POS(MyProcNumber) = max;
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
/* Insert backend into list of listeners at correct position */
@@ -2415,9 +2419,19 @@ asyncQueueReadAllNotifications(void)
}
PG_FINALLY();
{
- /* Update shared state */
+ /*
+ * Update shared state.
+ *
+ * We update donePos to what we actually read (the local pos variable),
+ * as this is used for truncation safety. For the read position (pos),
+ * we use the maximum of our local position and the current shared
+ * position, in case another backend used direct advancement to skip us
+ * ahead while we were reading. This prevents us from going backwards
+ * and potentially pointing to a truncated page.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
- QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = pos;
+ QUEUE_BACKEND_POS(MyProcNumber) = QUEUE_POS_MAX(pos, QUEUE_BACKEND_POS(MyProcNumber));
LWLockRelease(NotifyQueueLock);
}
PG_END_TRY();
@@ -2567,7 +2581,13 @@ asyncQueueAdvanceTail(void)
for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
{
Assert(QUEUE_BACKEND_PID(i) != InvalidPid);
- min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
+ /*
+ * Use donePos rather than pos for truncation safety. The donePos
+ * field represents what the backend has definitively processed, while
+ * pos can be advanced by other backends via direct advancement. This
+ * prevents truncating pages that a backend is still reading from.
+ */
+ min = QUEUE_POS_MIN(min, QUEUE_BACKEND_DONEPOS(i));
}
QUEUE_TAIL = min;
oldtailpage = QUEUE_STOP_PAGE;
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-16 18:16 Joel Jacobson <[email protected]>
parent: Chao Li <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-16 18:16 UTC (permalink / raw)
To: Chao Li <[email protected]>; +Cc: Tom Lane <[email protected]>; pgsql-hackers
On Thu, Oct 16, 2025, at 04:54, Chao Li wrote:
>> On Oct 15, 2025, at 23:36, Joel Jacobson <[email protected]> wrote:
>> The latest version gets rid of GetPendingNotifyChannels()
>> and replaces it with the local list pendingNotifyChannels.
>
> Sorry for the typo, Yes, I meant to dynahash” that you have already
> been using it.
...
> My suggestion of using dynahah was for the same purpose. Because
> list_member_ptr() iterates through all list nodes until find the
> target, so this code is still O(n^2).
>
> Using a hash will make it faster. I used to work on project Concourse
> [1]. The system is heavily using the LISTEN/NOTIFY mechanism. There
> would be thousands of channels at runtime. In that case, hash search
> would be much faster than linear search.
>
> [1] https://github.com/concourse/concourse
Building pendingNotifyChannels is O(N^2) yes, but how large N is
realistic here?
Note that pendingNotifyChannels is only the unique channels for the
notifications in the *current transaction*. At Concourse, did you really
do thousands of NOTIFY, with unique channel names, within the same
transaction?
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-16 20:06 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-16 20:06 UTC (permalink / raw)
To: Chao Li <[email protected]>; +Cc: Tom Lane <[email protected]>; pgsql-hackers
On Thu, Oct 16, 2025, at 20:16, Joel Jacobson wrote:
> On Thu, Oct 16, 2025, at 04:54, Chao Li wrote:
>>> On Oct 15, 2025, at 23:36, Joel Jacobson <[email protected]> wrote:
>>> The latest version gets rid of GetPendingNotifyChannels()
>>> and replaces it with the local list pendingNotifyChannels.
>>
>> Sorry for the typo, Yes, I meant to dynahash” that you have already
>> been using it.
> ...
>> My suggestion of using dynahah was for the same purpose. Because
>> list_member_ptr() iterates through all list nodes until find the
>> target, so this code is still O(n^2).
>>
>> Using a hash will make it faster. I used to work on project Concourse
>> [1]. The system is heavily using the LISTEN/NOTIFY mechanism. There
>> would be thousands of channels at runtime. In that case, hash search
>> would be much faster than linear search.
>>
>> [1] https://github.com/concourse/concourse
>
> Building pendingNotifyChannels is O(N^2) yes, but how large N is
> realistic here?
>
> Note that pendingNotifyChannels is only the unique channels for the
> notifications in the *current transaction*. At Concourse, did you really
> do thousands of NOTIFY, with unique channel names, within the same
> transaction?
I tested doing
LISTEN ch1;
LISTEN ch2;
...
LISTEN ch100000;
in one backend, and then
\timing on
BEGIN;
NOTIFY ch1;
NOTIFY ch2;
...
NOTIFY ch100000;
COMMIT;
in another backend.
Timing for the final COMMIT of the 100k NOTIFY:
2.127 ms (master)
1428.441 ms (0002-optimize_listen_notify-v19.patch)
I agree this looks like a real problem, since I guess it's not
completely unthinkable someone might have
some kind of trigger on a table, that could fire off NOTIFY
for each row, possibly causing hundreds of thousands of
notifies in the same db txn.
I tried changing pendingNotifyChannels from a list to dynahash,
which improved the timing, down to 15.169 ms.
Once we have decided which of the three alternatives to go forward with,
I will add the dynahash code for pendingNotifyChannels.
Nice catch, thanks.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-16 20:16 Tom Lane <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 2 replies; 120+ messages in thread
From: Tom Lane @ 2025-10-16 20:16 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Chao Li <[email protected]>; pgsql-hackers
"Joel Jacobson" <[email protected]> writes:
> On Thu, Oct 16, 2025, at 20:16, Joel Jacobson wrote:
>> Building pendingNotifyChannels is O(N^2) yes, but how large N is
>> realistic here?
> I agree this looks like a real problem, since I guess it's not
> completely unthinkable someone might have
> some kind of trigger on a table, that could fire off NOTIFY
> for each row, possibly causing hundreds of thousands of
> notifies in the same db txn.
We already de-duplicate identical NOTIFY operations for exactly that
reason (cf. AsyncExistsPendingNotify). However, non-identical NOTIFYs
obviously can't be merged.
I wonder whether we could adapt that de-duplication logic so that
it produces a list of unique channel names in addition to a list
of unique NOTIFY events. One way could be a list/hashtable of
channels used, and for each one a list/hashtable of unique payloads,
rather than the existing single-level list/hashtable.
regards, tom lane
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-18 16:41 Arseniy Mukhin <[email protected]>
parent: Tom Lane <[email protected]>
1 sibling, 0 replies; 120+ messages in thread
From: Arseniy Mukhin @ 2025-10-18 16:41 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Joel Jacobson <[email protected]>; Chao Li <[email protected]>; pgsql-hackers
On Thu, Oct 16, 2025 at 12:39 PM Joel Jacobson <[email protected]> wrote:
>
> On Wed, Oct 15, 2025, at 16:16, Tom Lane wrote:
> > Arseniy Mukhin <[email protected]> writes:
> >> I think "Direct advancement" is a good idea. But the way it's
> >> implemented now has a concurrency bug. Listeners store its current
> >> position in the local variable 'pos' during the reading in
> >> asyncQueueReadAllNotifications() and don't hold NotifyQueueLock. It
> >> means that some notifier can directly advance the listener's position
> >> while the listener has an old value in the local variable. The same
> >> time we use listener positions to find out the limit we can truncate
> >> the queue in asyncQueueAdvanceTail().
> >
> > Good catch!
>
> I've implemented the three ideas presented below, attached as .txt files
> that are diffs on top of v19, which has these changes since v17:
>
Thank you for the new version and all implementations!
> 0002-optimize_listen_notify-v19.patch:
> * Improve wording of top comment per request from Chao Li.
> * Add initChannelHash call to top of SignalBackends,
> to fix bug reported by Arseniy Mukhin.
>
> > I think we can perhaps salvage the idea if we invent a separate
> > "advisory" queue position field, which tells its backend "hey,
> > you could skip as far as here if you want", but is not used for
> > purposes of SLRU truncation.
>
> Above idea is implemented in 0002-optimize_listen_notify-v19-alt1.txt
pos = QUEUE_BACKEND_POS(i);
/* Direct advancement for idle backends at the old head */
if (pendingNotifies != NULL &&
QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
{
QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
If we have several notifying backends, it looks like only the first
one will be able to do direct advancement here. Next notifying backend
will fail on QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite) as we don't
wake up the listener and pos will be the same as it was for the first
notifying backend. It seems that to accumulate direct advancement from
several notifying backends we need to compare queueHeadBeforeWrite
with advisoryPos here. And we also need to advance advisoryPos to the
listener's position after reading if advisoryPos falls behind.
Minute of brainstorming
I also thought about a workload that probably frequently can be met.
Let's say we have sequence of notifications:
F F F T F F F T F F F T
Here F - notification from the channel we don't care about and T - the opposite.
It seems that after the first 'T' notification it will be more
difficult for notifying backends to do 'direct advancement' as there
will be some lag before the listener reads the notification and
advances its position. Not sure if it's a problem, probably it depends
on the intensity of notifications. But maybe we can use a bit more
sophisticated data structure here? Something like a list of skip
ranges. Every entry in the list is the range (pos1, pos2) that the
listener can skip during the reading. So instead of advancing
advisoryPos every notifying backend should add skip range to the list.
Notifying backends can merge neighbour ranges (pos1, pos2) & (pos2,
pos3) -> (pos1, pos3). We also can limit the number of entries to 5
for example. Listeners on their side should clear the list before
reading and skip all ranges from it. What do you think? Is it
overkill?
>
> > Alternatively, split the queue pos
> > into "this is where to read next" and "this is as much as I'm
> > definitively done with", where the second field gets advanced at
> > the end of asyncQueueReadAllNotifications. Not sure which
> > view would be less confusing (in the end I guess they're nearly
> > the same thing, differently explained).
>
> Above idea is implemented in 0002-optimize_listen_notify-v19-alt2.txt
>
IMHO it's a little bit more confusing than the first option. Two
points I noticed:
1) We have a fast path in asyncQueueReadAllNotifications()
if (QUEUE_POS_EQUAL(pos, head))
{
/* Nothing to do, we have read all notifications already. */
return;
}
Should we update donePos here? It looks like donePos may never be
updated without it.
2) In SignalBackends()
/* Signal backends that have fallen too far behind */
lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
QUEUE_POS_PAGE(pos));
if (lag >= QUEUE_CLEANUP_DELAY)
{
pid = QUEUE_BACKEND_PID(i);
Assert(pid != InvalidPid);
QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
pids[count] = pid;
procnos[count] = i;
count++;
}
Should we use donePos here as it is responsible for queue truncation now?
> > A different line of thought could be to get rid of
> > asyncQueueReadAllNotifications's optimization of moving the
> > queue pos only once, per
> >
> > * (We could alternatively retake NotifyQueueLock and move the position
> > * before handling each individual message, but that seems like too much
> > * lock traffic.)
> >
> > Since we only need shared lock to advance our own queue pos,
> > maybe that wouldn't be too awful. Not sure.
>
> Above idea is implemented in 0002-optimize_listen_notify-v19-alt3.txt
>
Hmm, it seems we still have the race when in the beginning of
asyncQueueReadAllNotifications we read pos into the local variable and
release the lock. IIUC to avoid the race without introducing another
field here, the listener needs to hold the lock until it updates its
position so that the notifying backend cannot change it concurrently.
Best regards,
Arseniy Mukhin
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-19 22:06 Joel Jacobson <[email protected]>
parent: Tom Lane <[email protected]>
1 sibling, 2 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-10-19 22:06 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Chao Li <[email protected]>; pgsql-hackers
On Thu, Oct 16, 2025, at 22:16, Tom Lane wrote:
> "Joel Jacobson" <[email protected]> writes:
>> On Thu, Oct 16, 2025, at 20:16, Joel Jacobson wrote:
>>> Building pendingNotifyChannels is O(N^2) yes, but how large N is
>>> realistic here?
>
>> I agree this looks like a real problem, since I guess it's not
>> completely unthinkable someone might have
>> some kind of trigger on a table, that could fire off NOTIFY
>> for each row, possibly causing hundreds of thousands of
>> notifies in the same db txn.
>
> We already de-duplicate identical NOTIFY operations for exactly that
> reason (cf. AsyncExistsPendingNotify). However, non-identical NOTIFYs
> obviously can't be merged.
>
> I wonder whether we could adapt that de-duplication logic so that
> it produces a list of unique channel names in addition to a list
> of unique NOTIFY events. One way could be a list/hashtable of
> channels used, and for each one a list/hashtable of unique payloads,
> rather than the existing single-level list/hashtable.
Thanks for the great idea! Yes, this was indeed possible.
0002-optimize_listen_notify-v20.patch:
* Added channelHashtab field, created and updated together with hashtab.
If we have channelHashtab, it's used within PreCommit_Notify to
quickly build pendingNotifyChannelsl.
In this email, I'm also answering to the feedback from Arseniy Mukhin,
and I've based the alt1, alt2, alt3 .txt patches on top of v20.
On Sat, Oct 18, 2025, at 18:41, Arseniy Mukhin wrote:
> Thank you for the new version and all implementations!
Thanks for review and great ideas!
>> > I think we can perhaps salvage the idea if we invent a separate
>> > "advisory" queue position field, which tells its backend "hey,
>> > you could skip as far as here if you want", but is not used for
>> > purposes of SLRU truncation.
>>
>> Above idea is implemented in 0002-optimize_listen_notify-v19-alt1.txt
>
> pos = QUEUE_BACKEND_POS(i);
>
> /* Direct advancement for idle backends at the old head */
> if (pendingNotifies != NULL &&
> QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
> {
> QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
>
> If we have several notifying backends, it looks like only the first
> one will be able to do direct advancement here. Next notifying backend
> will fail on QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite) as we don't
> wake up the listener and pos will be the same as it was for the first
> notifying backend.
Right.
> It seems that to accumulate direct advancement from
> several notifying backends we need to compare queueHeadBeforeWrite
> with advisoryPos here.
*** 0002-optimize_listen_notify-v20-alt1.txt:
* Fixed; compare advisoryPos with queueHeadBeforeWrite instead of pos.
> And we also need to advance advisoryPos to the
> listener's position after reading if advisoryPos falls behind.
* Fixed; set advisoryPos to max(max,advisoryPos) in PG_FINALLY block.
* Also noted Exec_ListenPreCommit didn't set advisoryPos to max
for the first LISTEN, now fixed.
> Minute of brainstorming
>
> I also thought about a workload that probably frequently can be met.
> Let's say we have sequence of notifications:
>
> F F F T F F F T F F F T
>
> Here F - notification from the channel we don't care about and T - the opposite.
> It seems that after the first 'T' notification it will be more
> difficult for notifying backends to do 'direct advancement' as there
> will be some lag before the listener reads the notification and
> advances its position. Not sure if it's a problem, probably it depends
> on the intensity of notifications.
Hmm, I realize both the advisoryPos and donePos ideas share a problem;
they both require listening backends to wakeup eventually anyway,
just to advance the 'pos'.
The holy grail would be to avoid this context switching cost entirely,
and only need to wakeup listening backends when they are actually
interested in the queued notifications. I think the third idea,
alt3, is most promising in achieving this goal.
> But maybe we can use a bit more
> sophisticated data structure here? Something like a list of skip
> ranges. Every entry in the list is the range (pos1, pos2) that the
> listener can skip during the reading. So instead of advancing
> advisoryPos every notifying backend should add skip range to the list.
> Notifying backends can merge neighbour ranges (pos1, pos2) & (pos2,
> pos3) -> (pos1, pos3). We also can limit the number of entries to 5
> for example. Listeners on their side should clear the list before
> reading and skip all ranges from it. What do you think? Is it
> overkill?
Hmm, maybe, but I'm a bit wary about too much complication.
Hopefully there is a simpler solution that avoids the need for this,
but sure, if we can't find one, then I'm positive to try this skip ranges idea.
>> > Alternatively, split the queue pos
>> > into "this is where to read next" and "this is as much as I'm
>> > definitively done with", where the second field gets advanced at
>> > the end of asyncQueueReadAllNotifications. Not sure which
>> > view would be less confusing (in the end I guess they're nearly
>> > the same thing, differently explained).
>>
>> Above idea is implemented in 0002-optimize_listen_notify-v19-alt2.txt
>>
>
> IMHO it's a little bit more confusing than the first option. Two
> points I noticed:
>
> 1) We have a fast path in asyncQueueReadAllNotifications()
>
> if (QUEUE_POS_EQUAL(pos, head))
> {
> /* Nothing to do, we have read all notifications already. */
> return;
> }
>
> Should we update donePos here? It looks like donePos may never be
> updated without it.
*** 0002-optimize_listen_notify-v20-alt2.txt:
* Fixed; update donePos here
> 2) In SignalBackends()
>
> /* Signal backends that have fallen too far behind */
> lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
> QUEUE_POS_PAGE(pos));
>
> if (lag >= QUEUE_CLEANUP_DELAY)
> {
> pid = QUEUE_BACKEND_PID(i);
> Assert(pid != InvalidPid);
>
> QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
> pids[count] = pid;
> procnos[count] = i;
> count++;
> }
>
> Should we use donePos here as it is responsible for queue truncation now?
* Fixed; use donePos here
>> > A different line of thought could be to get rid of
>> > asyncQueueReadAllNotifications's optimization of moving the
>> > queue pos only once, per
>> >
>> > * (We could alternatively retake NotifyQueueLock and move the position
>> > * before handling each individual message, but that seems like too much
>> > * lock traffic.)
>> >
>> > Since we only need shared lock to advance our own queue pos,
>> > maybe that wouldn't be too awful. Not sure.
>>
>> Above idea is implemented in 0002-optimize_listen_notify-v19-alt3.txt
>>
>
> Hmm, it seems we still have the race when in the beginning of
> asyncQueueReadAllNotifications we read pos into the local variable and
> release the lock. IIUC to avoid the race without introducing another
> field here, the listener needs to hold the lock until it updates its
> position so that the notifying backend cannot change it concurrently.
*** 0002-optimize_listen_notify-v20-alt3.txt:
* Fixed; the shared 'pos' is now only updated if the new position is ahead.
To me, it looks like alt3 is the winner in terms of simplicity, and is
also the winner in my ping-pong benchmark, due to avoiding context
switches more effectively than alt1 and alt2.
Eager to hear your thoughts!
/Joel
From afff0f3f8b01cfde369c564025313e6acc9a610a Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 19 Oct 2025 08:08:05 +0200
Subject: [PATCH] Implements idea #1: advisoryPos
---
src/backend/commands/async.c | 63 +++++++++++++++++++++++++++++++++---
1 file changed, 58 insertions(+), 5 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4e6556fb8d1..6a02f5e3acc 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -264,6 +264,11 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
@@ -286,6 +291,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ QueuePosition advisoryPos; /* backend could skip queue to here */
bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
@@ -347,6 +353,7 @@ static dshash_table *channelHash = NULL;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_ADVISORY_POS(i) (asyncQueueControl->backend[i].advisoryPos)
#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
@@ -674,6 +681,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVISORY_POS(i), 0, 0);
QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1312,6 +1320,7 @@ Exec_ListenPreCommit(void)
prevListener = i;
}
QUEUE_BACKEND_POS(MyProcNumber) = max;
+ QUEUE_BACKEND_ADVISORY_POS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
/* Insert backend into list of listeners at correct position */
@@ -2031,9 +2040,13 @@ SignalBackends(void)
* Even though we may take and release NotifyQueueLock multiple times
* while writing, the heavyweight lock guarantees this region contains
* only our messages. Therefore, any backend still positioned at the
- * queue head from before our write can be safely advanced to the current
+ * queue head from before our write can be advised to skip to the current
* queue head without waking it.
*
+ * We use the advisoryPos field rather than directly modifying pos.
+ * The backend controls its own pos field and will check advisoryPos
+ * when it's safe to do so.
+ *
* False-positive possibility: if a backend was previously signaled but
* hasn't yet awoken, we'll skip advancing it (because wakeupPending is
* true). This is safe - the backend will advance its pointer when it
@@ -2048,6 +2061,7 @@ SignalBackends(void)
i = QUEUE_NEXT_LISTENER(i))
{
QueuePosition pos;
+ QueuePosition advisoryPos;
int64 lag;
int32 pid;
@@ -2055,15 +2069,31 @@ SignalBackends(void)
continue;
pos = QUEUE_BACKEND_POS(i);
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(i);
- /* Direct advancement for idle backends at the old head */
+ /*
+ * Direct advancement for idle backends at the old head.
+ *
+ * We check advisoryPos rather than pos to allow accumulating advances
+ * from multiple consecutive notifying backends. If we checked pos,
+ * only the first notifier could advance idle backends; subsequent
+ * notifiers would find pos unchanged (since the backend hasn't woken
+ * up yet) and fail to advance further.
+ */
if (pendingNotifies != NULL &&
- QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
+ QUEUE_POS_EQUAL(advisoryPos, queueHeadBeforeWrite))
{
- QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
- pos = queueHeadAfterWrite;
+ QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
+ advisoryPos = queueHeadAfterWrite;
}
+ /*
+ * For lag calculation, use whichever position is further ahead.
+ * This ensures we don't spuriously wake a backend that has been
+ * directly advanced.
+ */
+ pos = QUEUE_POS_MAX(pos, advisoryPos);
+
/* Signal backends that have fallen too far behind */
lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
QUEUE_POS_PAGE(pos));
@@ -2302,6 +2332,7 @@ static void
asyncQueueReadAllNotifications(void)
{
volatile QueuePosition pos;
+ QueuePosition advisoryPos;
QueuePosition head;
Snapshot snapshot;
@@ -2319,6 +2350,21 @@ asyncQueueReadAllNotifications(void)
QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
+
+ /*
+ * Check if another backend has set an advisory position for us.
+ * If so, and if we haven't yet read past that point, we can safely
+ * adopt the advisory position and skip the intervening notifications.
+ */
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(MyProcNumber);
+
+ if (!QUEUE_POS_EQUAL(advisoryPos, pos) &&
+ QUEUE_POS_PRECEDES(pos, advisoryPos))
+ {
+ pos = advisoryPos;
+ QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ }
+
LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
@@ -2440,6 +2486,13 @@ asyncQueueReadAllNotifications(void)
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ /*
+ * Advance advisoryPos to our current position if it has fallen behind,
+ * but preserve any newer advisory position that may have been set by
+ * another backend while we were processing notifications.
+ */
+ QUEUE_BACKEND_ADVISORY_POS(MyProcNumber) =
+ QUEUE_POS_MAX(pos, QUEUE_BACKEND_ADVISORY_POS(MyProcNumber));
LWLockRelease(NotifyQueueLock);
}
PG_END_TRY();
--
2.50.1
From c403098ae4e4d06f109eb6292a67c6577e123010 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 19 Oct 2025 08:35:44 +0200
Subject: [PATCH] Implement idea #3
---
src/backend/commands/async.c | 150 ++++++++++++++++++++---------------
1 file changed, 85 insertions(+), 65 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4e6556fb8d1..b34e4a2247b 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -264,6 +264,11 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
@@ -2304,6 +2309,7 @@ asyncQueueReadAllNotifications(void)
volatile QueuePosition pos;
QueuePosition head;
Snapshot snapshot;
+ bool reachedStop;
/* page_buffer must be adequately aligned, so use a union */
union
@@ -2372,77 +2378,69 @@ asyncQueueReadAllNotifications(void)
* It is possible that we fail while trying to send a message to our
* frontend (for example, because of encoding conversion failure). If
* that happens it is critical that we not try to send the same message
- * over and over again. Therefore, we place a PG_TRY block here that will
- * forcibly advance our queue position before we lose control to an error.
- * (We could alternatively retake NotifyQueueLock and move the position
- * before handling each individual message, but that seems like too much
- * lock traffic.)
+ * over and over again. Therefore, we must advance our queue position
+ * regularly as we process messages.
+ *
+ * We must also be careful about concurrency: SignalBackends() can
+ * directly advance our position while we're reading. To preserve such
+ * advancement, asyncQueueProcessPageEntries updates our position in
+ * shared memory for each message, only writing if our position is ahead.
+ * Shared lock is sufficient since we're only updating our own position.
*/
- PG_TRY();
+ do
{
- bool reachedStop;
+ int64 curpage = QUEUE_POS_PAGE(pos);
+ int curoffset = QUEUE_POS_OFFSET(pos);
+ int slotno;
+ int copysize;
- do
+ /*
+ * We copy the data from SLRU into a local buffer, so as to avoid
+ * holding the SLRU lock while we are examining the entries and
+ * possibly transmitting them to our frontend. Copy only the part
+ * of the page we will actually inspect.
+ */
+ slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
+ InvalidTransactionId);
+ if (curpage == QUEUE_POS_PAGE(head))
{
- int64 curpage = QUEUE_POS_PAGE(pos);
- int curoffset = QUEUE_POS_OFFSET(pos);
- int slotno;
- int copysize;
+ /* we only want to read as far as head */
+ copysize = QUEUE_POS_OFFSET(head) - curoffset;
+ if (copysize < 0)
+ copysize = 0; /* just for safety */
+ }
+ else
+ {
+ /* fetch all the rest of the page */
+ copysize = QUEUE_PAGESIZE - curoffset;
+ }
+ memcpy(page_buffer.buf + curoffset,
+ NotifyCtl->shared->page_buffer[slotno] + curoffset,
+ copysize);
+ /* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+ LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
- /*
- * We copy the data from SLRU into a local buffer, so as to avoid
- * holding the SLRU lock while we are examining the entries and
- * possibly transmitting them to our frontend. Copy only the part
- * of the page we will actually inspect.
- */
- slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
- InvalidTransactionId);
- if (curpage == QUEUE_POS_PAGE(head))
- {
- /* we only want to read as far as head */
- copysize = QUEUE_POS_OFFSET(head) - curoffset;
- if (copysize < 0)
- copysize = 0; /* just for safety */
- }
- else
- {
- /* fetch all the rest of the page */
- copysize = QUEUE_PAGESIZE - curoffset;
- }
- memcpy(page_buffer.buf + curoffset,
- NotifyCtl->shared->page_buffer[slotno] + curoffset,
- copysize);
- /* Release lock that we got from SimpleLruReadPage_ReadOnly() */
- LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+ /*
+ * Process messages up to the stop position, end of page, or an
+ * uncommitted message.
+ *
+ * Our stop position is what we found to be the head's position
+ * when we entered this function. It might have changed already.
+ * But if it has, we will receive (or have already received and
+ * queued) another signal and come here again.
+ *
+ * We are not holding NotifyQueueLock here! The queue can only
+ * extend beyond the head pointer (see above).
+ * asyncQueueProcessPageEntries will update our backend's position
+ * for each message to ensure we don't reprocess messages if we fail
+ * partway through, and to preserve any direct advancement that
+ * SignalBackends() might perform concurrently.
+ */
+ reachedStop = asyncQueueProcessPageEntries(&pos, head,
+ page_buffer.buf,
+ snapshot);
- /*
- * Process messages up to the stop position, end of page, or an
- * uncommitted message.
- *
- * Our stop position is what we found to be the head's position
- * when we entered this function. It might have changed already.
- * But if it has, we will receive (or have already received and
- * queued) another signal and come here again.
- *
- * We are not holding NotifyQueueLock here! The queue can only
- * extend beyond the head pointer (see above) and we leave our
- * backend's pointer where it is so nobody will truncate or
- * rewrite pages under us. Especially we don't want to hold a lock
- * while sending the notifications to the frontend.
- */
- reachedStop = asyncQueueProcessPageEntries(&pos, head,
- page_buffer.buf,
- snapshot);
- } while (!reachedStop);
- }
- PG_FINALLY();
- {
- /* Update shared state */
- LWLockAcquire(NotifyQueueLock, LW_SHARED);
- QUEUE_BACKEND_POS(MyProcNumber) = pos;
- LWLockRelease(NotifyQueueLock);
- }
- PG_END_TRY();
+ } while (!reachedStop);
/* Done with snapshot */
UnregisterSnapshot(snapshot);
@@ -2490,6 +2488,24 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
*/
reachedEndOfPage = asyncQueueAdvance(current, qe->length);
+ /*
+ * Update our position in shared memory immediately after advancing,
+ * before we attempt to process the message. This ensures we won't
+ * reprocess this message if NotifyMyFrontEnd fails.
+ *
+ * Only write if our position is ahead of the shared position.
+ * If the shared position is already ahead (due to direct advancement
+ * by SignalBackends), preserve it by not overwriting.
+ */
+ LWLockAcquire(NotifyQueueLock, LW_SHARED);
+ {
+ QueuePosition sharedPos = QUEUE_BACKEND_POS(MyProcNumber);
+
+ if (QUEUE_POS_PRECEDES(sharedPos, *current))
+ QUEUE_BACKEND_POS(MyProcNumber) = *current;
+ }
+ LWLockRelease(NotifyQueueLock);
+
/* Ignore messages destined for other databases */
if (qe->dboid == MyDatabaseId)
{
@@ -2515,6 +2531,10 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
* messages.
*/
*current = thisentry;
+ /* Update shared memory to reflect the backed-up position */
+ LWLockAcquire(NotifyQueueLock, LW_SHARED);
+ QUEUE_BACKEND_POS(MyProcNumber) = *current;
+ LWLockRelease(NotifyQueueLock);
reachedStop = true;
break;
}
--
2.50.1
From 928cc032706ac154153279adbdfba95f6af2fae4 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 19 Oct 2025 08:12:47 +0200
Subject: [PATCH] Implement idea #2: donePos
---
src/backend/commands/async.c | 57 +++++++++++++++++++++++++++++++-----
1 file changed, 49 insertions(+), 8 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4e6556fb8d1..c81807107d1 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -285,7 +285,8 @@ typedef struct QueueBackendStatus
int32 pid; /* either a PID or InvalidPid */
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
- QueuePosition pos; /* backend has read queue up to here */
+ QueuePosition pos; /* next position to read from */
+ QueuePosition donePos; /* backend has definitively processed up to here */
bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
@@ -347,6 +348,7 @@ static dshash_table *channelHash = NULL;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_DONEPOS(i) (asyncQueueControl->backend[i].donePos)
#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
@@ -674,6 +676,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_DONEPOS(i), 0, 0);
QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1312,6 +1315,7 @@ Exec_ListenPreCommit(void)
prevListener = i;
}
QUEUE_BACKEND_POS(MyProcNumber) = max;
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
/* Insert backend into list of listeners at correct position */
@@ -2048,6 +2052,7 @@ SignalBackends(void)
i = QUEUE_NEXT_LISTENER(i))
{
QueuePosition pos;
+ QueuePosition donePos;
int64 lag;
int32 pid;
@@ -2055,6 +2060,7 @@ SignalBackends(void)
continue;
pos = QUEUE_BACKEND_POS(i);
+ donePos = QUEUE_BACKEND_DONEPOS(i);
/* Direct advancement for idle backends at the old head */
if (pendingNotifies != NULL &&
@@ -2064,9 +2070,17 @@ SignalBackends(void)
pos = queueHeadAfterWrite;
}
- /* Signal backends that have fallen too far behind */
+ /*
+ * Signal backends that have fallen too far behind.
+ *
+ * We use donePos rather than pos for the lag check because donePos
+ * is what matters for queue truncation (see asyncQueueAdvanceTail).
+ * A backend may have been directly advanced (pos is recent) while
+ * donePos is still far behind, holding back the tail. We need to
+ * wake such backends so they can advance their donePos.
+ */
lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos));
+ QUEUE_POS_PAGE(donePos));
if (lag >= QUEUE_CLEANUP_DELAY)
{
@@ -2319,14 +2333,25 @@ asyncQueueReadAllNotifications(void)
QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
- LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
{
- /* Nothing to do, we have read all notifications already. */
+ /*
+ * Nothing to do, we have read all notifications already.
+ *
+ * Update donePos to match pos before returning. This is important
+ * when our position was advanced via direct advancement: we need to
+ * update donePos so the queue tail can advance. Without this,
+ * backends that have been directly advanced would hold back queue
+ * truncation indefinitely.
+ */
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = pos;
+ LWLockRelease(NotifyQueueLock);
return;
}
+ LWLockRelease(NotifyQueueLock);
+
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
* This is trickier than it might seem, because of race conditions.
@@ -2437,9 +2462,19 @@ asyncQueueReadAllNotifications(void)
}
PG_FINALLY();
{
- /* Update shared state */
+ /*
+ * Update shared state.
+ *
+ * We update donePos to what we actually read (the local pos variable),
+ * as this is used for truncation safety. For the read position (pos),
+ * we use the maximum of our local position and the current shared
+ * position, in case another backend used direct advancement to skip us
+ * ahead while we were reading. This prevents us from going backwards
+ * and potentially pointing to a truncated page.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
- QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = pos;
+ QUEUE_BACKEND_POS(MyProcNumber) = QUEUE_POS_MAX(pos, QUEUE_BACKEND_POS(MyProcNumber));
LWLockRelease(NotifyQueueLock);
}
PG_END_TRY();
@@ -2589,7 +2624,13 @@ asyncQueueAdvanceTail(void)
for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
{
Assert(QUEUE_BACKEND_PID(i) != InvalidPid);
- min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
+ /*
+ * Use donePos rather than pos for truncation safety. The donePos
+ * field represents what the backend has definitively processed, while
+ * pos can be advanced by other backends via direct advancement. This
+ * prevents truncating pages that a backend is still reading from.
+ */
+ min = QUEUE_POS_MIN(min, QUEUE_BACKEND_DONEPOS(i));
}
QUEUE_TAIL = min;
oldtailpage = QUEUE_STOP_PAGE;
--
2.50.1
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v20.patch (9.3K, 2-0001-optimize_listen_notify-v20.patch)
download | inline diff:
From f37095250521d0a29d812997b7b79d938ed9c894 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 114 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 68 +++++++++++
2 files changed, 181 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..443a6eb669f 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 7 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
@@ -104,6 +205,17 @@ step l2commit: COMMIT;
listener2: NOTIFY "c1" with payload "" from notifier
step l2stop: UNLISTEN *;
+starting permutation: lch_listen nch_notify lch_check
+step lch_listen: LISTEN ch;
+step nch_notify: NOTIFY ch, 'aa';
+step lch_check: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener_ch: NOTIFY "ch" with payload "aa" from notifier_ch
+
starting permutation: llisten lbegin usage bignotify usage
step llisten: LISTEN c1; LISTEN c2;
step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..0a01e777b98 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,38 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Listener session for cross-session notification test with channel 'ch'.
+
+session listener_ch
+step lch_listen { LISTEN ch; }
+step lch_check { SELECT 1 AS x; }
+teardown { UNLISTEN *; }
+
+# Notifier session for cross-session notification test with channel 'ch'.
+
+session notifier_ch
+step nch_notify { NOTIFY ch, 'aa'; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +106,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
@@ -73,6 +137,10 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
# and notify queue is not empty
permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
+# Check that notifications sent from a backend that has not done LISTEN
+# are properly delivered to a listener in another backend.
+permutation lch_listen nch_notify lch_check
+
# Verify that pg_notification_queue_usage correctly reports a non-zero result,
# after submitting notifications while another connection is listening for
# those notifications and waiting inside an active transaction. We have to
--
2.50.1
[text/plain] 0002-optimize_listen_notify-v20-alt1.txt (6.1K, 3-0002-optimize_listen_notify-v20-alt1.txt)
download | inline diff:
From afff0f3f8b01cfde369c564025313e6acc9a610a Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 19 Oct 2025 08:08:05 +0200
Subject: [PATCH] Implements idea #1: advisoryPos
---
src/backend/commands/async.c | 63 +++++++++++++++++++++++++++++++++---
1 file changed, 58 insertions(+), 5 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4e6556fb8d1..6a02f5e3acc 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -264,6 +264,11 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
@@ -286,6 +291,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ QueuePosition advisoryPos; /* backend could skip queue to here */
bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
@@ -347,6 +353,7 @@ static dshash_table *channelHash = NULL;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_ADVISORY_POS(i) (asyncQueueControl->backend[i].advisoryPos)
#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
@@ -674,6 +681,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVISORY_POS(i), 0, 0);
QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1312,6 +1320,7 @@ Exec_ListenPreCommit(void)
prevListener = i;
}
QUEUE_BACKEND_POS(MyProcNumber) = max;
+ QUEUE_BACKEND_ADVISORY_POS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
/* Insert backend into list of listeners at correct position */
@@ -2031,9 +2040,13 @@ SignalBackends(void)
* Even though we may take and release NotifyQueueLock multiple times
* while writing, the heavyweight lock guarantees this region contains
* only our messages. Therefore, any backend still positioned at the
- * queue head from before our write can be safely advanced to the current
+ * queue head from before our write can be advised to skip to the current
* queue head without waking it.
*
+ * We use the advisoryPos field rather than directly modifying pos.
+ * The backend controls its own pos field and will check advisoryPos
+ * when it's safe to do so.
+ *
* False-positive possibility: if a backend was previously signaled but
* hasn't yet awoken, we'll skip advancing it (because wakeupPending is
* true). This is safe - the backend will advance its pointer when it
@@ -2048,6 +2061,7 @@ SignalBackends(void)
i = QUEUE_NEXT_LISTENER(i))
{
QueuePosition pos;
+ QueuePosition advisoryPos;
int64 lag;
int32 pid;
@@ -2055,15 +2069,31 @@ SignalBackends(void)
continue;
pos = QUEUE_BACKEND_POS(i);
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(i);
- /* Direct advancement for idle backends at the old head */
+ /*
+ * Direct advancement for idle backends at the old head.
+ *
+ * We check advisoryPos rather than pos to allow accumulating advances
+ * from multiple consecutive notifying backends. If we checked pos,
+ * only the first notifier could advance idle backends; subsequent
+ * notifiers would find pos unchanged (since the backend hasn't woken
+ * up yet) and fail to advance further.
+ */
if (pendingNotifies != NULL &&
- QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
+ QUEUE_POS_EQUAL(advisoryPos, queueHeadBeforeWrite))
{
- QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
- pos = queueHeadAfterWrite;
+ QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
+ advisoryPos = queueHeadAfterWrite;
}
+ /*
+ * For lag calculation, use whichever position is further ahead.
+ * This ensures we don't spuriously wake a backend that has been
+ * directly advanced.
+ */
+ pos = QUEUE_POS_MAX(pos, advisoryPos);
+
/* Signal backends that have fallen too far behind */
lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
QUEUE_POS_PAGE(pos));
@@ -2302,6 +2332,7 @@ static void
asyncQueueReadAllNotifications(void)
{
volatile QueuePosition pos;
+ QueuePosition advisoryPos;
QueuePosition head;
Snapshot snapshot;
@@ -2319,6 +2350,21 @@ asyncQueueReadAllNotifications(void)
QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
+
+ /*
+ * Check if another backend has set an advisory position for us.
+ * If so, and if we haven't yet read past that point, we can safely
+ * adopt the advisory position and skip the intervening notifications.
+ */
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(MyProcNumber);
+
+ if (!QUEUE_POS_EQUAL(advisoryPos, pos) &&
+ QUEUE_POS_PRECEDES(pos, advisoryPos))
+ {
+ pos = advisoryPos;
+ QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ }
+
LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
@@ -2440,6 +2486,13 @@ asyncQueueReadAllNotifications(void)
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ /*
+ * Advance advisoryPos to our current position if it has fallen behind,
+ * but preserve any newer advisory position that may have been set by
+ * another backend while we were processing notifications.
+ */
+ QUEUE_BACKEND_ADVISORY_POS(MyProcNumber) =
+ QUEUE_POS_MAX(pos, QUEUE_BACKEND_ADVISORY_POS(MyProcNumber));
LWLockRelease(NotifyQueueLock);
}
PG_END_TRY();
--
2.50.1
[text/plain] 0002-optimize_listen_notify-v20-alt3.txt (7.6K, 4-0002-optimize_listen_notify-v20-alt3.txt)
download | inline diff:
From c403098ae4e4d06f109eb6292a67c6577e123010 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 19 Oct 2025 08:35:44 +0200
Subject: [PATCH] Implement idea #3
---
src/backend/commands/async.c | 150 ++++++++++++++++++++---------------
1 file changed, 85 insertions(+), 65 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4e6556fb8d1..b34e4a2247b 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -264,6 +264,11 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
@@ -2304,6 +2309,7 @@ asyncQueueReadAllNotifications(void)
volatile QueuePosition pos;
QueuePosition head;
Snapshot snapshot;
+ bool reachedStop;
/* page_buffer must be adequately aligned, so use a union */
union
@@ -2372,77 +2378,69 @@ asyncQueueReadAllNotifications(void)
* It is possible that we fail while trying to send a message to our
* frontend (for example, because of encoding conversion failure). If
* that happens it is critical that we not try to send the same message
- * over and over again. Therefore, we place a PG_TRY block here that will
- * forcibly advance our queue position before we lose control to an error.
- * (We could alternatively retake NotifyQueueLock and move the position
- * before handling each individual message, but that seems like too much
- * lock traffic.)
+ * over and over again. Therefore, we must advance our queue position
+ * regularly as we process messages.
+ *
+ * We must also be careful about concurrency: SignalBackends() can
+ * directly advance our position while we're reading. To preserve such
+ * advancement, asyncQueueProcessPageEntries updates our position in
+ * shared memory for each message, only writing if our position is ahead.
+ * Shared lock is sufficient since we're only updating our own position.
*/
- PG_TRY();
+ do
{
- bool reachedStop;
+ int64 curpage = QUEUE_POS_PAGE(pos);
+ int curoffset = QUEUE_POS_OFFSET(pos);
+ int slotno;
+ int copysize;
- do
+ /*
+ * We copy the data from SLRU into a local buffer, so as to avoid
+ * holding the SLRU lock while we are examining the entries and
+ * possibly transmitting them to our frontend. Copy only the part
+ * of the page we will actually inspect.
+ */
+ slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
+ InvalidTransactionId);
+ if (curpage == QUEUE_POS_PAGE(head))
{
- int64 curpage = QUEUE_POS_PAGE(pos);
- int curoffset = QUEUE_POS_OFFSET(pos);
- int slotno;
- int copysize;
+ /* we only want to read as far as head */
+ copysize = QUEUE_POS_OFFSET(head) - curoffset;
+ if (copysize < 0)
+ copysize = 0; /* just for safety */
+ }
+ else
+ {
+ /* fetch all the rest of the page */
+ copysize = QUEUE_PAGESIZE - curoffset;
+ }
+ memcpy(page_buffer.buf + curoffset,
+ NotifyCtl->shared->page_buffer[slotno] + curoffset,
+ copysize);
+ /* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+ LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
- /*
- * We copy the data from SLRU into a local buffer, so as to avoid
- * holding the SLRU lock while we are examining the entries and
- * possibly transmitting them to our frontend. Copy only the part
- * of the page we will actually inspect.
- */
- slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
- InvalidTransactionId);
- if (curpage == QUEUE_POS_PAGE(head))
- {
- /* we only want to read as far as head */
- copysize = QUEUE_POS_OFFSET(head) - curoffset;
- if (copysize < 0)
- copysize = 0; /* just for safety */
- }
- else
- {
- /* fetch all the rest of the page */
- copysize = QUEUE_PAGESIZE - curoffset;
- }
- memcpy(page_buffer.buf + curoffset,
- NotifyCtl->shared->page_buffer[slotno] + curoffset,
- copysize);
- /* Release lock that we got from SimpleLruReadPage_ReadOnly() */
- LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+ /*
+ * Process messages up to the stop position, end of page, or an
+ * uncommitted message.
+ *
+ * Our stop position is what we found to be the head's position
+ * when we entered this function. It might have changed already.
+ * But if it has, we will receive (or have already received and
+ * queued) another signal and come here again.
+ *
+ * We are not holding NotifyQueueLock here! The queue can only
+ * extend beyond the head pointer (see above).
+ * asyncQueueProcessPageEntries will update our backend's position
+ * for each message to ensure we don't reprocess messages if we fail
+ * partway through, and to preserve any direct advancement that
+ * SignalBackends() might perform concurrently.
+ */
+ reachedStop = asyncQueueProcessPageEntries(&pos, head,
+ page_buffer.buf,
+ snapshot);
- /*
- * Process messages up to the stop position, end of page, or an
- * uncommitted message.
- *
- * Our stop position is what we found to be the head's position
- * when we entered this function. It might have changed already.
- * But if it has, we will receive (or have already received and
- * queued) another signal and come here again.
- *
- * We are not holding NotifyQueueLock here! The queue can only
- * extend beyond the head pointer (see above) and we leave our
- * backend's pointer where it is so nobody will truncate or
- * rewrite pages under us. Especially we don't want to hold a lock
- * while sending the notifications to the frontend.
- */
- reachedStop = asyncQueueProcessPageEntries(&pos, head,
- page_buffer.buf,
- snapshot);
- } while (!reachedStop);
- }
- PG_FINALLY();
- {
- /* Update shared state */
- LWLockAcquire(NotifyQueueLock, LW_SHARED);
- QUEUE_BACKEND_POS(MyProcNumber) = pos;
- LWLockRelease(NotifyQueueLock);
- }
- PG_END_TRY();
+ } while (!reachedStop);
/* Done with snapshot */
UnregisterSnapshot(snapshot);
@@ -2490,6 +2488,24 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
*/
reachedEndOfPage = asyncQueueAdvance(current, qe->length);
+ /*
+ * Update our position in shared memory immediately after advancing,
+ * before we attempt to process the message. This ensures we won't
+ * reprocess this message if NotifyMyFrontEnd fails.
+ *
+ * Only write if our position is ahead of the shared position.
+ * If the shared position is already ahead (due to direct advancement
+ * by SignalBackends), preserve it by not overwriting.
+ */
+ LWLockAcquire(NotifyQueueLock, LW_SHARED);
+ {
+ QueuePosition sharedPos = QUEUE_BACKEND_POS(MyProcNumber);
+
+ if (QUEUE_POS_PRECEDES(sharedPos, *current))
+ QUEUE_BACKEND_POS(MyProcNumber) = *current;
+ }
+ LWLockRelease(NotifyQueueLock);
+
/* Ignore messages destined for other databases */
if (qe->dboid == MyDatabaseId)
{
@@ -2515,6 +2531,10 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
* messages.
*/
*current = thisentry;
+ /* Update shared memory to reflect the backed-up position */
+ LWLockAcquire(NotifyQueueLock, LW_SHARED);
+ QUEUE_BACKEND_POS(MyProcNumber) = *current;
+ LWLockRelease(NotifyQueueLock);
reachedStop = true;
break;
}
--
2.50.1
[text/plain] 0002-optimize_listen_notify-v20-alt2.txt (5.6K, 5-0002-optimize_listen_notify-v20-alt2.txt)
download | inline diff:
From 928cc032706ac154153279adbdfba95f6af2fae4 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 19 Oct 2025 08:12:47 +0200
Subject: [PATCH] Implement idea #2: donePos
---
src/backend/commands/async.c | 57 +++++++++++++++++++++++++++++++-----
1 file changed, 49 insertions(+), 8 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4e6556fb8d1..c81807107d1 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -285,7 +285,8 @@ typedef struct QueueBackendStatus
int32 pid; /* either a PID or InvalidPid */
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
- QueuePosition pos; /* backend has read queue up to here */
+ QueuePosition pos; /* next position to read from */
+ QueuePosition donePos; /* backend has definitively processed up to here */
bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
@@ -347,6 +348,7 @@ static dshash_table *channelHash = NULL;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_DONEPOS(i) (asyncQueueControl->backend[i].donePos)
#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
@@ -674,6 +676,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_DONEPOS(i), 0, 0);
QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1312,6 +1315,7 @@ Exec_ListenPreCommit(void)
prevListener = i;
}
QUEUE_BACKEND_POS(MyProcNumber) = max;
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
/* Insert backend into list of listeners at correct position */
@@ -2048,6 +2052,7 @@ SignalBackends(void)
i = QUEUE_NEXT_LISTENER(i))
{
QueuePosition pos;
+ QueuePosition donePos;
int64 lag;
int32 pid;
@@ -2055,6 +2060,7 @@ SignalBackends(void)
continue;
pos = QUEUE_BACKEND_POS(i);
+ donePos = QUEUE_BACKEND_DONEPOS(i);
/* Direct advancement for idle backends at the old head */
if (pendingNotifies != NULL &&
@@ -2064,9 +2070,17 @@ SignalBackends(void)
pos = queueHeadAfterWrite;
}
- /* Signal backends that have fallen too far behind */
+ /*
+ * Signal backends that have fallen too far behind.
+ *
+ * We use donePos rather than pos for the lag check because donePos
+ * is what matters for queue truncation (see asyncQueueAdvanceTail).
+ * A backend may have been directly advanced (pos is recent) while
+ * donePos is still far behind, holding back the tail. We need to
+ * wake such backends so they can advance their donePos.
+ */
lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos));
+ QUEUE_POS_PAGE(donePos));
if (lag >= QUEUE_CLEANUP_DELAY)
{
@@ -2319,14 +2333,25 @@ asyncQueueReadAllNotifications(void)
QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
- LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
{
- /* Nothing to do, we have read all notifications already. */
+ /*
+ * Nothing to do, we have read all notifications already.
+ *
+ * Update donePos to match pos before returning. This is important
+ * when our position was advanced via direct advancement: we need to
+ * update donePos so the queue tail can advance. Without this,
+ * backends that have been directly advanced would hold back queue
+ * truncation indefinitely.
+ */
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = pos;
+ LWLockRelease(NotifyQueueLock);
return;
}
+ LWLockRelease(NotifyQueueLock);
+
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
* This is trickier than it might seem, because of race conditions.
@@ -2437,9 +2462,19 @@ asyncQueueReadAllNotifications(void)
}
PG_FINALLY();
{
- /* Update shared state */
+ /*
+ * Update shared state.
+ *
+ * We update donePos to what we actually read (the local pos variable),
+ * as this is used for truncation safety. For the read position (pos),
+ * we use the maximum of our local position and the current shared
+ * position, in case another backend used direct advancement to skip us
+ * ahead while we were reading. This prevents us from going backwards
+ * and potentially pointing to a truncated page.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
- QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = pos;
+ QUEUE_BACKEND_POS(MyProcNumber) = QUEUE_POS_MAX(pos, QUEUE_BACKEND_POS(MyProcNumber));
LWLockRelease(NotifyQueueLock);
}
PG_END_TRY();
@@ -2589,7 +2624,13 @@ asyncQueueAdvanceTail(void)
for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
{
Assert(QUEUE_BACKEND_PID(i) != InvalidPid);
- min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
+ /*
+ * Use donePos rather than pos for truncation safety. The donePos
+ * field represents what the backend has definitively processed, while
+ * pos can be advanced by other backends via direct advancement. This
+ * prevents truncating pages that a backend is still reading from.
+ */
+ min = QUEUE_POS_MIN(min, QUEUE_BACKEND_DONEPOS(i));
}
QUEUE_TAIL = min;
oldtailpage = QUEUE_STOP_PAGE;
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-19 22:10 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
1 sibling, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-19 22:10 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Chao Li <[email protected]>; pgsql-hackers
On Mon, Oct 20, 2025, at 00:06, Joel Jacobson wrote:
> Attachments:
> * 0001-optimize_listen_notify-v20.patch
> * 0002-optimize_listen_notify-v20-alt1.txt
> * 0002-optimize_listen_notify-v20-alt3.txt
> * 0002-optimize_listen_notify-v20-alt2.txt
My apologies, I forgot to attach 0002-optimize_listen_notify-v20.patch.
/Joel
From afff0f3f8b01cfde369c564025313e6acc9a610a Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 19 Oct 2025 08:08:05 +0200
Subject: [PATCH] Implements idea #1: advisoryPos
---
src/backend/commands/async.c | 63 +++++++++++++++++++++++++++++++++---
1 file changed, 58 insertions(+), 5 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4e6556fb8d1..6a02f5e3acc 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -264,6 +264,11 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
@@ -286,6 +291,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ QueuePosition advisoryPos; /* backend could skip queue to here */
bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
@@ -347,6 +353,7 @@ static dshash_table *channelHash = NULL;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_ADVISORY_POS(i) (asyncQueueControl->backend[i].advisoryPos)
#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
@@ -674,6 +681,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVISORY_POS(i), 0, 0);
QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1312,6 +1320,7 @@ Exec_ListenPreCommit(void)
prevListener = i;
}
QUEUE_BACKEND_POS(MyProcNumber) = max;
+ QUEUE_BACKEND_ADVISORY_POS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
/* Insert backend into list of listeners at correct position */
@@ -2031,9 +2040,13 @@ SignalBackends(void)
* Even though we may take and release NotifyQueueLock multiple times
* while writing, the heavyweight lock guarantees this region contains
* only our messages. Therefore, any backend still positioned at the
- * queue head from before our write can be safely advanced to the current
+ * queue head from before our write can be advised to skip to the current
* queue head without waking it.
*
+ * We use the advisoryPos field rather than directly modifying pos.
+ * The backend controls its own pos field and will check advisoryPos
+ * when it's safe to do so.
+ *
* False-positive possibility: if a backend was previously signaled but
* hasn't yet awoken, we'll skip advancing it (because wakeupPending is
* true). This is safe - the backend will advance its pointer when it
@@ -2048,6 +2061,7 @@ SignalBackends(void)
i = QUEUE_NEXT_LISTENER(i))
{
QueuePosition pos;
+ QueuePosition advisoryPos;
int64 lag;
int32 pid;
@@ -2055,15 +2069,31 @@ SignalBackends(void)
continue;
pos = QUEUE_BACKEND_POS(i);
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(i);
- /* Direct advancement for idle backends at the old head */
+ /*
+ * Direct advancement for idle backends at the old head.
+ *
+ * We check advisoryPos rather than pos to allow accumulating advances
+ * from multiple consecutive notifying backends. If we checked pos,
+ * only the first notifier could advance idle backends; subsequent
+ * notifiers would find pos unchanged (since the backend hasn't woken
+ * up yet) and fail to advance further.
+ */
if (pendingNotifies != NULL &&
- QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
+ QUEUE_POS_EQUAL(advisoryPos, queueHeadBeforeWrite))
{
- QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
- pos = queueHeadAfterWrite;
+ QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
+ advisoryPos = queueHeadAfterWrite;
}
+ /*
+ * For lag calculation, use whichever position is further ahead.
+ * This ensures we don't spuriously wake a backend that has been
+ * directly advanced.
+ */
+ pos = QUEUE_POS_MAX(pos, advisoryPos);
+
/* Signal backends that have fallen too far behind */
lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
QUEUE_POS_PAGE(pos));
@@ -2302,6 +2332,7 @@ static void
asyncQueueReadAllNotifications(void)
{
volatile QueuePosition pos;
+ QueuePosition advisoryPos;
QueuePosition head;
Snapshot snapshot;
@@ -2319,6 +2350,21 @@ asyncQueueReadAllNotifications(void)
QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
+
+ /*
+ * Check if another backend has set an advisory position for us.
+ * If so, and if we haven't yet read past that point, we can safely
+ * adopt the advisory position and skip the intervening notifications.
+ */
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(MyProcNumber);
+
+ if (!QUEUE_POS_EQUAL(advisoryPos, pos) &&
+ QUEUE_POS_PRECEDES(pos, advisoryPos))
+ {
+ pos = advisoryPos;
+ QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ }
+
LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
@@ -2440,6 +2486,13 @@ asyncQueueReadAllNotifications(void)
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ /*
+ * Advance advisoryPos to our current position if it has fallen behind,
+ * but preserve any newer advisory position that may have been set by
+ * another backend while we were processing notifications.
+ */
+ QUEUE_BACKEND_ADVISORY_POS(MyProcNumber) =
+ QUEUE_POS_MAX(pos, QUEUE_BACKEND_ADVISORY_POS(MyProcNumber));
LWLockRelease(NotifyQueueLock);
}
PG_END_TRY();
--
2.50.1
From c403098ae4e4d06f109eb6292a67c6577e123010 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 19 Oct 2025 08:35:44 +0200
Subject: [PATCH] Implement idea #3
---
src/backend/commands/async.c | 150 ++++++++++++++++++++---------------
1 file changed, 85 insertions(+), 65 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4e6556fb8d1..b34e4a2247b 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -264,6 +264,11 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
@@ -2304,6 +2309,7 @@ asyncQueueReadAllNotifications(void)
volatile QueuePosition pos;
QueuePosition head;
Snapshot snapshot;
+ bool reachedStop;
/* page_buffer must be adequately aligned, so use a union */
union
@@ -2372,77 +2378,69 @@ asyncQueueReadAllNotifications(void)
* It is possible that we fail while trying to send a message to our
* frontend (for example, because of encoding conversion failure). If
* that happens it is critical that we not try to send the same message
- * over and over again. Therefore, we place a PG_TRY block here that will
- * forcibly advance our queue position before we lose control to an error.
- * (We could alternatively retake NotifyQueueLock and move the position
- * before handling each individual message, but that seems like too much
- * lock traffic.)
+ * over and over again. Therefore, we must advance our queue position
+ * regularly as we process messages.
+ *
+ * We must also be careful about concurrency: SignalBackends() can
+ * directly advance our position while we're reading. To preserve such
+ * advancement, asyncQueueProcessPageEntries updates our position in
+ * shared memory for each message, only writing if our position is ahead.
+ * Shared lock is sufficient since we're only updating our own position.
*/
- PG_TRY();
+ do
{
- bool reachedStop;
+ int64 curpage = QUEUE_POS_PAGE(pos);
+ int curoffset = QUEUE_POS_OFFSET(pos);
+ int slotno;
+ int copysize;
- do
+ /*
+ * We copy the data from SLRU into a local buffer, so as to avoid
+ * holding the SLRU lock while we are examining the entries and
+ * possibly transmitting them to our frontend. Copy only the part
+ * of the page we will actually inspect.
+ */
+ slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
+ InvalidTransactionId);
+ if (curpage == QUEUE_POS_PAGE(head))
{
- int64 curpage = QUEUE_POS_PAGE(pos);
- int curoffset = QUEUE_POS_OFFSET(pos);
- int slotno;
- int copysize;
+ /* we only want to read as far as head */
+ copysize = QUEUE_POS_OFFSET(head) - curoffset;
+ if (copysize < 0)
+ copysize = 0; /* just for safety */
+ }
+ else
+ {
+ /* fetch all the rest of the page */
+ copysize = QUEUE_PAGESIZE - curoffset;
+ }
+ memcpy(page_buffer.buf + curoffset,
+ NotifyCtl->shared->page_buffer[slotno] + curoffset,
+ copysize);
+ /* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+ LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
- /*
- * We copy the data from SLRU into a local buffer, so as to avoid
- * holding the SLRU lock while we are examining the entries and
- * possibly transmitting them to our frontend. Copy only the part
- * of the page we will actually inspect.
- */
- slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
- InvalidTransactionId);
- if (curpage == QUEUE_POS_PAGE(head))
- {
- /* we only want to read as far as head */
- copysize = QUEUE_POS_OFFSET(head) - curoffset;
- if (copysize < 0)
- copysize = 0; /* just for safety */
- }
- else
- {
- /* fetch all the rest of the page */
- copysize = QUEUE_PAGESIZE - curoffset;
- }
- memcpy(page_buffer.buf + curoffset,
- NotifyCtl->shared->page_buffer[slotno] + curoffset,
- copysize);
- /* Release lock that we got from SimpleLruReadPage_ReadOnly() */
- LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+ /*
+ * Process messages up to the stop position, end of page, or an
+ * uncommitted message.
+ *
+ * Our stop position is what we found to be the head's position
+ * when we entered this function. It might have changed already.
+ * But if it has, we will receive (or have already received and
+ * queued) another signal and come here again.
+ *
+ * We are not holding NotifyQueueLock here! The queue can only
+ * extend beyond the head pointer (see above).
+ * asyncQueueProcessPageEntries will update our backend's position
+ * for each message to ensure we don't reprocess messages if we fail
+ * partway through, and to preserve any direct advancement that
+ * SignalBackends() might perform concurrently.
+ */
+ reachedStop = asyncQueueProcessPageEntries(&pos, head,
+ page_buffer.buf,
+ snapshot);
- /*
- * Process messages up to the stop position, end of page, or an
- * uncommitted message.
- *
- * Our stop position is what we found to be the head's position
- * when we entered this function. It might have changed already.
- * But if it has, we will receive (or have already received and
- * queued) another signal and come here again.
- *
- * We are not holding NotifyQueueLock here! The queue can only
- * extend beyond the head pointer (see above) and we leave our
- * backend's pointer where it is so nobody will truncate or
- * rewrite pages under us. Especially we don't want to hold a lock
- * while sending the notifications to the frontend.
- */
- reachedStop = asyncQueueProcessPageEntries(&pos, head,
- page_buffer.buf,
- snapshot);
- } while (!reachedStop);
- }
- PG_FINALLY();
- {
- /* Update shared state */
- LWLockAcquire(NotifyQueueLock, LW_SHARED);
- QUEUE_BACKEND_POS(MyProcNumber) = pos;
- LWLockRelease(NotifyQueueLock);
- }
- PG_END_TRY();
+ } while (!reachedStop);
/* Done with snapshot */
UnregisterSnapshot(snapshot);
@@ -2490,6 +2488,24 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
*/
reachedEndOfPage = asyncQueueAdvance(current, qe->length);
+ /*
+ * Update our position in shared memory immediately after advancing,
+ * before we attempt to process the message. This ensures we won't
+ * reprocess this message if NotifyMyFrontEnd fails.
+ *
+ * Only write if our position is ahead of the shared position.
+ * If the shared position is already ahead (due to direct advancement
+ * by SignalBackends), preserve it by not overwriting.
+ */
+ LWLockAcquire(NotifyQueueLock, LW_SHARED);
+ {
+ QueuePosition sharedPos = QUEUE_BACKEND_POS(MyProcNumber);
+
+ if (QUEUE_POS_PRECEDES(sharedPos, *current))
+ QUEUE_BACKEND_POS(MyProcNumber) = *current;
+ }
+ LWLockRelease(NotifyQueueLock);
+
/* Ignore messages destined for other databases */
if (qe->dboid == MyDatabaseId)
{
@@ -2515,6 +2531,10 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
* messages.
*/
*current = thisentry;
+ /* Update shared memory to reflect the backed-up position */
+ LWLockAcquire(NotifyQueueLock, LW_SHARED);
+ QUEUE_BACKEND_POS(MyProcNumber) = *current;
+ LWLockRelease(NotifyQueueLock);
reachedStop = true;
break;
}
--
2.50.1
From 928cc032706ac154153279adbdfba95f6af2fae4 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 19 Oct 2025 08:12:47 +0200
Subject: [PATCH] Implement idea #2: donePos
---
src/backend/commands/async.c | 57 +++++++++++++++++++++++++++++++-----
1 file changed, 49 insertions(+), 8 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4e6556fb8d1..c81807107d1 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -285,7 +285,8 @@ typedef struct QueueBackendStatus
int32 pid; /* either a PID or InvalidPid */
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
- QueuePosition pos; /* backend has read queue up to here */
+ QueuePosition pos; /* next position to read from */
+ QueuePosition donePos; /* backend has definitively processed up to here */
bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
@@ -347,6 +348,7 @@ static dshash_table *channelHash = NULL;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_DONEPOS(i) (asyncQueueControl->backend[i].donePos)
#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
@@ -674,6 +676,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_DONEPOS(i), 0, 0);
QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1312,6 +1315,7 @@ Exec_ListenPreCommit(void)
prevListener = i;
}
QUEUE_BACKEND_POS(MyProcNumber) = max;
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
/* Insert backend into list of listeners at correct position */
@@ -2048,6 +2052,7 @@ SignalBackends(void)
i = QUEUE_NEXT_LISTENER(i))
{
QueuePosition pos;
+ QueuePosition donePos;
int64 lag;
int32 pid;
@@ -2055,6 +2060,7 @@ SignalBackends(void)
continue;
pos = QUEUE_BACKEND_POS(i);
+ donePos = QUEUE_BACKEND_DONEPOS(i);
/* Direct advancement for idle backends at the old head */
if (pendingNotifies != NULL &&
@@ -2064,9 +2070,17 @@ SignalBackends(void)
pos = queueHeadAfterWrite;
}
- /* Signal backends that have fallen too far behind */
+ /*
+ * Signal backends that have fallen too far behind.
+ *
+ * We use donePos rather than pos for the lag check because donePos
+ * is what matters for queue truncation (see asyncQueueAdvanceTail).
+ * A backend may have been directly advanced (pos is recent) while
+ * donePos is still far behind, holding back the tail. We need to
+ * wake such backends so they can advance their donePos.
+ */
lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos));
+ QUEUE_POS_PAGE(donePos));
if (lag >= QUEUE_CLEANUP_DELAY)
{
@@ -2319,14 +2333,25 @@ asyncQueueReadAllNotifications(void)
QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
- LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
{
- /* Nothing to do, we have read all notifications already. */
+ /*
+ * Nothing to do, we have read all notifications already.
+ *
+ * Update donePos to match pos before returning. This is important
+ * when our position was advanced via direct advancement: we need to
+ * update donePos so the queue tail can advance. Without this,
+ * backends that have been directly advanced would hold back queue
+ * truncation indefinitely.
+ */
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = pos;
+ LWLockRelease(NotifyQueueLock);
return;
}
+ LWLockRelease(NotifyQueueLock);
+
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
* This is trickier than it might seem, because of race conditions.
@@ -2437,9 +2462,19 @@ asyncQueueReadAllNotifications(void)
}
PG_FINALLY();
{
- /* Update shared state */
+ /*
+ * Update shared state.
+ *
+ * We update donePos to what we actually read (the local pos variable),
+ * as this is used for truncation safety. For the read position (pos),
+ * we use the maximum of our local position and the current shared
+ * position, in case another backend used direct advancement to skip us
+ * ahead while we were reading. This prevents us from going backwards
+ * and potentially pointing to a truncated page.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
- QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = pos;
+ QUEUE_BACKEND_POS(MyProcNumber) = QUEUE_POS_MAX(pos, QUEUE_BACKEND_POS(MyProcNumber));
LWLockRelease(NotifyQueueLock);
}
PG_END_TRY();
@@ -2589,7 +2624,13 @@ asyncQueueAdvanceTail(void)
for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
{
Assert(QUEUE_BACKEND_PID(i) != InvalidPid);
- min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
+ /*
+ * Use donePos rather than pos for truncation safety. The donePos
+ * field represents what the backend has definitively processed, while
+ * pos can be advanced by other backends via direct advancement. This
+ * prevents truncating pages that a backend is still reading from.
+ */
+ min = QUEUE_POS_MIN(min, QUEUE_BACKEND_DONEPOS(i));
}
QUEUE_TAIL = min;
oldtailpage = QUEUE_STOP_PAGE;
--
2.50.1
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v20.patch (9.3K, 2-0001-optimize_listen_notify-v20.patch)
download | inline diff:
From f37095250521d0a29d812997b7b79d938ed9c894 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 114 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 68 +++++++++++
2 files changed, 181 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..443a6eb669f 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 7 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
@@ -104,6 +205,17 @@ step l2commit: COMMIT;
listener2: NOTIFY "c1" with payload "" from notifier
step l2stop: UNLISTEN *;
+starting permutation: lch_listen nch_notify lch_check
+step lch_listen: LISTEN ch;
+step nch_notify: NOTIFY ch, 'aa';
+step lch_check: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener_ch: NOTIFY "ch" with payload "aa" from notifier_ch
+
starting permutation: llisten lbegin usage bignotify usage
step llisten: LISTEN c1; LISTEN c2;
step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..0a01e777b98 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,38 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Listener session for cross-session notification test with channel 'ch'.
+
+session listener_ch
+step lch_listen { LISTEN ch; }
+step lch_check { SELECT 1 AS x; }
+teardown { UNLISTEN *; }
+
+# Notifier session for cross-session notification test with channel 'ch'.
+
+session notifier_ch
+step nch_notify { NOTIFY ch, 'aa'; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +106,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
@@ -73,6 +137,10 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
# and notify queue is not empty
permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
+# Check that notifications sent from a backend that has not done LISTEN
+# are properly delivered to a listener in another backend.
+permutation lch_listen nch_notify lch_check
+
# Verify that pg_notification_queue_usage correctly reports a non-zero result,
# after submitting notifications while another connection is listening for
# those notifications and waiting inside an active transaction. We have to
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v20.patch (35.1K, 3-0002-optimize_listen_notify-v20.patch)
download | inline diff:
0002-optimize_listen_notify-v20.patchFrom 9e5b980ee2b4f054f458c30772c9463d09930fa4 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Tue, 14 Oct 2025 08:03:19 +0200
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of listening backends (ProcNumbers). This allows the sender to
target only those backends actually listening on the channels for which
it has queued notifications.
At commit time:
* AtCommit_Notify updates the shared channelHash to reflect any LISTEN
or UNLISTEN actions performed in the transaction.
* SignalBackends consults this hash to find the backends that are
listening on the channels being notified in the current database, and
signals only those.
Each backend's entry in AsyncQueueControl now includes a wakeupPending
flag to prevent duplicate signals while a previous wakeup is still being
processed.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
Queue health
------------
If a backend has fallen too far behind (lag >= QUEUE_CLEANUP_DELAY
pages), it is signaled to catch up so the global queue tail can advance.
Other notes
-----------
* Maintains dual data structures: a shared channelHash for determining
which backends to signal, and a local per-backend listenChannels list
for fast lock-free lookups during notification processing. This avoids
contention on the shared hash during the high-frequency IsListeningOn
checks that occur for every notification read from the queue.
* Backends remain registered in the global listener list as long as
listenChannels is non-empty.
* Adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
* No user-visible behavioral changes; this is an internal optimization
only.
---
src/backend/commands/async.c | 598 ++++++++++++++++--
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 553 insertions(+), 49 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..4e6556fb8d1 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,27 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
- * Then we signal any backends that may be interested in our messages
- * (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * make any actual updates to the local listen state (listenChannels) and
+ * shared channel hash table (channelHash). Then we signal any backends
+ * that may be interested in our messages (including our own backend,
+ * if listening). This is done by SignalBackends(), which consults the
+ * shared channel hash table to identify listeners for the channels that
+ * have pending notifications in the current database. Each selected
+ * backend is marked as having a wakeup pending to avoid duplicate signals,
+ * and a PROCSIG_NOTIFY_INTERRUPT signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, avoiding unnecessary
+ * wakeups for idle listeners that have nothing to read.
+ *
+ * To maintain queue health, SignalBackends() also wakes one backend
+ * positioned at the global queue tail to help advance it, and signals
+ * any backend that has fallen too far behind to catch up. These measures
+ * prevent the notification queue from growing indefinitely, while mostly
+ * limiting wakeups to the backends that actually need them.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +141,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,14 +151,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +179,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ProcNumber array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -227,8 +267,8 @@ typedef struct QueuePosition
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +286,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
/*
@@ -260,9 +301,9 @@ typedef struct QueueBackendStatus
* (since no other backend will inspect it).
*
* When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
- * entries of other backends and also change the head pointer. When holding
- * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointers.
+ * entries of other backends, change the head pointer, and advance other
+ * backends' queue positions. When holding both NotifyQueueLock and
+ * NotifyQueueTailLock in EXCLUSIVE mode, backends can change the tail pointers.
*
* SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
* the control lock for the pg_notify SLRU buffers.
@@ -288,11 +329,16 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +347,7 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
* The SLRU buffer area through which we access the notification queue
@@ -391,6 +438,7 @@ typedef struct NotificationList
int nestingLevel; /* current transaction nesting depth */
List *events; /* list of Notification structs */
HTAB *hashtab; /* hash of NotificationHash structs, or NULL */
+ HTAB *channelHashtab; /* hash of unique channel names, or NULL */
struct NotificationList *upper; /* details for upper transaction levels */
} NotificationList;
@@ -401,6 +449,11 @@ struct NotificationHash
Notification *event; /* => the actual Notification struct */
};
+struct ChannelHash
+{
+ char channel[NAMEDATALEN];
+};
+
static NotificationList *pendingNotifies = NULL;
/*
@@ -418,6 +471,20 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/*
+ * Queue head positions for direct advancement.
+ * These are captured during PreCommit_Notify while holding the heavyweight
+ * lock on database 0, ensuring no other backend can insert notifications
+ * between them. SignalBackends uses these to advance idle backends.
+ */
+static QueuePosition queueHeadBeforeWrite;
+static QueuePosition queueHeadAfterWrite;
+
+/*
+ * List of channels with pending notifications in the current transaction.
+ */
+static List *pendingNotifyChannels = NIL;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -457,6 +524,9 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+static void initChannelHash(void);
/*
* Compute the difference between two queue page numbers.
@@ -478,6 +548,80 @@ asyncQueuePagePrecedes(int64 p, int64 q)
return p < q;
}
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
/*
* Report space needed for our shared memory area
*/
@@ -521,12 +665,16 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -657,6 +805,7 @@ Async_Notify(const char *channel, const char *payload)
notifies->events = list_make1(n);
/* We certainly don't need a hashtable yet */
notifies->hashtab = NULL;
+ notifies->channelHashtab = NULL;
notifies->upper = pendingNotifies;
pendingNotifies = notifies;
}
@@ -894,6 +1043,36 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
+
+ /*
+ * Build list of unique channels for SignalBackends().
+ *
+ * If we have a channelHashtab, use it to efficiently get the unique
+ * channels. Otherwise, fall back to the linear approach.
+ */
+ pendingNotifyChannels = NIL;
+ if (pendingNotifies->channelHashtab != NULL)
+ {
+ HASH_SEQ_STATUS status;
+ struct ChannelHash *channelEntry;
+
+ hash_seq_init(&status, pendingNotifies->channelHashtab);
+ while ((channelEntry = (struct ChannelHash *) hash_seq_search(&status)) != NULL)
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channelEntry->channel);
+ }
+ else
+ {
+ /* Linear approach for small number of notifications */
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
+ }
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -921,6 +1100,21 @@ PreCommit_Notify(void)
*/
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /*
+ * For the direct advancement optimization in SignalBackends(), we
+ * need to ensure that no other backend can insert queue entries
+ * between queueHeadBeforeWrite and queueHeadAfterWrite. The
+ * heavyweight lock above provides this guarantee, since it serializes
+ * all writers.
+ *
+ * Note: if the heavyweight lock were ever removed for scalability
+ * reasons, we could achieve the same guarantee by holding
+ * NotifyQueueLock in EXCLUSIVE mode across all our insertions, rather
+ * than releasing and reacquiring it for each page as we do below.
+ */
+
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(queueHeadBeforeWrite, 0, 0);
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
@@ -939,12 +1133,33 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /*
+ * On the first iteration, save the queue head position before we
+ * write any notifications. This is used by SignalBackends() to
+ * identify backends that can be advanced directly without waking
+ * them up.
+ */
+ if (firstIteration)
+ {
+ queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many notifications in the NOTIFY queue")));
nextNotify = asyncQueueAddEntries(nextNotify);
+
+ /*
+ * Capture the queue head after each batch of entries. On the
+ * last iteration, this gives us the final queue head position for
+ * SignalBackends() to use when advancing idle backends.
+ */
+ queueHeadAfterWrite = QUEUE_HEAD;
+
LWLockRelease(NotifyQueueLock);
}
@@ -1135,6 +1350,10 @@ Exec_ListenPreCommit(void)
static void
Exec_ListenCommit(const char *channel)
{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
MemoryContext oldcontext;
/* Do nothing if we are already listening on this channel */
@@ -1152,21 +1371,84 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+
+ /* Now update the shared channelHash for SignalBackends() to use */
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listenersArray to InvalidDsaPointer as a
+ * marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners] = MyProcNumber;
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Remove the specified channel from the list of channels we are listening on.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
ListCell *q;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
+ /* Remove from our local cache */
foreach(q, listenChannels)
{
char *lchan = (char *) lfirst(q);
@@ -1179,6 +1461,46 @@ Exec_UnlistenCommit(const char *channel)
}
}
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ return;
+ }
+ }
+
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1193,11 +1515,51 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ /* Clear our local cache */
list_free_deep(listenChannels);
listenChannels = NIL;
+
+ /* Now clear from shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+ int i;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
}
/*
@@ -1565,12 +1927,19 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are still positioned at the queue head from before our
+ * commit can be safely advanced directly to the new head, since the
+ * queue region we wrote is known to contain only our own notifications.
+ * This avoids unnecessary wakeups when there is nothing of interest to
+ * them.
+ *
+ * In addition, if a backend has fallen too far behind in the queue, we
+ * signal it so that it will advance its position and allow the global
+ * tail pointer to move forward.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1952,13 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ ListCell *lc;
+
+ /*
+ * Attach to the channel hash if needed. We might not have one if this
+ * backend hasn't done LISTEN, but we need it to find listeners.
+ */
+ initChannelHash();
/*
* Identify backends that we need to signal. We don't want to send
@@ -1597,36 +1973,111 @@ SignalBackends(void)
count = 0;
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, pendingNotifyChannels)
+ {
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ProcNumber *listeners;
+
+ if (channelHash != NULL)
+ {
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
+ if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
+ continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ /*
+ * Direct advancement and lagging backend detection.
+ *
+ * Direct advancement: avoid waking backends still positioned at the old
+ * queue head that aren't interested in our notifications.
+ *
+ * The heavyweight lock on database 0 (held in PreCommit_Notify) ensures
+ * no other backend can insert notifications in the region we just wrote.
+ * Even though we may take and release NotifyQueueLock multiple times
+ * while writing, the heavyweight lock guarantees this region contains
+ * only our messages. Therefore, any backend still positioned at the
+ * queue head from before our write can be safely advanced to the current
+ * queue head without waking it.
+ *
+ * False-positive possibility: if a backend was previously signaled but
+ * hasn't yet awoken, we'll skip advancing it (because wakeupPending is
+ * true). This is safe - the backend will advance its pointer when it
+ * does wake up. The alternative (advancing it anyway) would risk
+ * advancing over notifications from whoever signaled it.
+ *
+ * Lagging backends: we also check if any backend has fallen too far
+ * behind and signal it to catch up, allowing the global tail to advance.
+ */
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- int32 pid = QUEUE_BACKEND_PID(i);
QueuePosition pos;
+ int64 lag;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
- Assert(pid != InvalidPid);
pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+
+ /* Direct advancement for idle backends at the old head */
+ if (pendingNotifies != NULL &&
+ QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
- if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
- continue;
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ pos = queueHeadAfterWrite;
}
- else
+
+ /* Signal backends that have fallen too far behind */
+ lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(pos));
+
+ if (lag >= QUEUE_CLEANUP_DELAY)
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
- continue;
+ pid = QUEUE_BACKEND_PID(i);
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1865,6 +2316,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2290,13 +2742,15 @@ AddEventToPendingNotifies(Notification *n)
{
Assert(pendingNotifies->events != NIL);
- /* Create the hash table if it's time to */
+ /* Create the hash tables if it's time to */
if (list_length(pendingNotifies->events) >= MIN_HASHABLE_NOTIFIES &&
pendingNotifies->hashtab == NULL)
{
HASHCTL hash_ctl;
ListCell *l;
+ Assert(pendingNotifies->channelHashtab == NULL);
+
/* Create the hash table */
hash_ctl.keysize = sizeof(Notification *);
hash_ctl.entrysize = sizeof(struct NotificationHash);
@@ -2309,10 +2763,22 @@ AddEventToPendingNotifies(Notification *n)
&hash_ctl,
HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
+ /* Create the channel hash table */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+ hash_ctl.hcxt = CurTransactionContext;
+ pendingNotifies->channelHashtab =
+ hash_create("Pending Notify Channels",
+ 64L,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
/* Insert all the already-existing events */
foreach(l, pendingNotifies->events)
{
Notification *oldn = (Notification *) lfirst(l);
+ char *channel = oldn->data;
bool found;
(void) hash_search(pendingNotifies->hashtab,
@@ -2320,22 +2786,42 @@ AddEventToPendingNotifies(Notification *n)
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Insert channel into channelHashtab */
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if multiple events on same channel */
}
}
/* Add new event to the list, in order */
pendingNotifies->events = lappend(pendingNotifies->events, n);
- /* Add event to the hash table if needed */
+ /* Add event to the hash tables if needed */
if (pendingNotifies->hashtab != NULL)
{
bool found;
+ Assert(pendingNotifies->channelHashtab != NULL);
+
(void) hash_search(pendingNotifies->hashtab,
&n,
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Add channel to channelHashtab */
+ {
+ char *channel = n->data;
+
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if we already have an event on this channel */
+ }
}
}
@@ -2373,7 +2859,7 @@ notification_match(const void *key1, const void *key2, Size keysize)
return 1; /* not equal */
}
-/* Clear the pendingActions and pendingNotifies lists. */
+/* Clear the pendingActions, pendingNotifies, and pendingNotifyChannels lists. */
static void
ClearPendingActionsAndNotifies(void)
{
@@ -2385,6 +2871,7 @@ ClearPendingActionsAndNotifies(void)
*/
pendingActions = NULL;
pendingNotifies = NULL;
+ pendingNotifyChannels = NIL;
}
/*
@@ -2395,3 +2882,16 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..5ccdd4043e8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
[text/plain] 0002-optimize_listen_notify-v20-alt1.txt (6.1K, 4-0002-optimize_listen_notify-v20-alt1.txt)
download | inline diff:
From afff0f3f8b01cfde369c564025313e6acc9a610a Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 19 Oct 2025 08:08:05 +0200
Subject: [PATCH] Implements idea #1: advisoryPos
---
src/backend/commands/async.c | 63 +++++++++++++++++++++++++++++++++---
1 file changed, 58 insertions(+), 5 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4e6556fb8d1..6a02f5e3acc 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -264,6 +264,11 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
@@ -286,6 +291,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ QueuePosition advisoryPos; /* backend could skip queue to here */
bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
@@ -347,6 +353,7 @@ static dshash_table *channelHash = NULL;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_ADVISORY_POS(i) (asyncQueueControl->backend[i].advisoryPos)
#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
@@ -674,6 +681,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVISORY_POS(i), 0, 0);
QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1312,6 +1320,7 @@ Exec_ListenPreCommit(void)
prevListener = i;
}
QUEUE_BACKEND_POS(MyProcNumber) = max;
+ QUEUE_BACKEND_ADVISORY_POS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
/* Insert backend into list of listeners at correct position */
@@ -2031,9 +2040,13 @@ SignalBackends(void)
* Even though we may take and release NotifyQueueLock multiple times
* while writing, the heavyweight lock guarantees this region contains
* only our messages. Therefore, any backend still positioned at the
- * queue head from before our write can be safely advanced to the current
+ * queue head from before our write can be advised to skip to the current
* queue head without waking it.
*
+ * We use the advisoryPos field rather than directly modifying pos.
+ * The backend controls its own pos field and will check advisoryPos
+ * when it's safe to do so.
+ *
* False-positive possibility: if a backend was previously signaled but
* hasn't yet awoken, we'll skip advancing it (because wakeupPending is
* true). This is safe - the backend will advance its pointer when it
@@ -2048,6 +2061,7 @@ SignalBackends(void)
i = QUEUE_NEXT_LISTENER(i))
{
QueuePosition pos;
+ QueuePosition advisoryPos;
int64 lag;
int32 pid;
@@ -2055,15 +2069,31 @@ SignalBackends(void)
continue;
pos = QUEUE_BACKEND_POS(i);
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(i);
- /* Direct advancement for idle backends at the old head */
+ /*
+ * Direct advancement for idle backends at the old head.
+ *
+ * We check advisoryPos rather than pos to allow accumulating advances
+ * from multiple consecutive notifying backends. If we checked pos,
+ * only the first notifier could advance idle backends; subsequent
+ * notifiers would find pos unchanged (since the backend hasn't woken
+ * up yet) and fail to advance further.
+ */
if (pendingNotifies != NULL &&
- QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
+ QUEUE_POS_EQUAL(advisoryPos, queueHeadBeforeWrite))
{
- QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
- pos = queueHeadAfterWrite;
+ QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
+ advisoryPos = queueHeadAfterWrite;
}
+ /*
+ * For lag calculation, use whichever position is further ahead.
+ * This ensures we don't spuriously wake a backend that has been
+ * directly advanced.
+ */
+ pos = QUEUE_POS_MAX(pos, advisoryPos);
+
/* Signal backends that have fallen too far behind */
lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
QUEUE_POS_PAGE(pos));
@@ -2302,6 +2332,7 @@ static void
asyncQueueReadAllNotifications(void)
{
volatile QueuePosition pos;
+ QueuePosition advisoryPos;
QueuePosition head;
Snapshot snapshot;
@@ -2319,6 +2350,21 @@ asyncQueueReadAllNotifications(void)
QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
+
+ /*
+ * Check if another backend has set an advisory position for us.
+ * If so, and if we haven't yet read past that point, we can safely
+ * adopt the advisory position and skip the intervening notifications.
+ */
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(MyProcNumber);
+
+ if (!QUEUE_POS_EQUAL(advisoryPos, pos) &&
+ QUEUE_POS_PRECEDES(pos, advisoryPos))
+ {
+ pos = advisoryPos;
+ QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ }
+
LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
@@ -2440,6 +2486,13 @@ asyncQueueReadAllNotifications(void)
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ /*
+ * Advance advisoryPos to our current position if it has fallen behind,
+ * but preserve any newer advisory position that may have been set by
+ * another backend while we were processing notifications.
+ */
+ QUEUE_BACKEND_ADVISORY_POS(MyProcNumber) =
+ QUEUE_POS_MAX(pos, QUEUE_BACKEND_ADVISORY_POS(MyProcNumber));
LWLockRelease(NotifyQueueLock);
}
PG_END_TRY();
--
2.50.1
[text/plain] 0002-optimize_listen_notify-v20-alt3.txt (7.6K, 5-0002-optimize_listen_notify-v20-alt3.txt)
download | inline diff:
From c403098ae4e4d06f109eb6292a67c6577e123010 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 19 Oct 2025 08:35:44 +0200
Subject: [PATCH] Implement idea #3
---
src/backend/commands/async.c | 150 ++++++++++++++++++++---------------
1 file changed, 85 insertions(+), 65 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4e6556fb8d1..b34e4a2247b 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -264,6 +264,11 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
@@ -2304,6 +2309,7 @@ asyncQueueReadAllNotifications(void)
volatile QueuePosition pos;
QueuePosition head;
Snapshot snapshot;
+ bool reachedStop;
/* page_buffer must be adequately aligned, so use a union */
union
@@ -2372,77 +2378,69 @@ asyncQueueReadAllNotifications(void)
* It is possible that we fail while trying to send a message to our
* frontend (for example, because of encoding conversion failure). If
* that happens it is critical that we not try to send the same message
- * over and over again. Therefore, we place a PG_TRY block here that will
- * forcibly advance our queue position before we lose control to an error.
- * (We could alternatively retake NotifyQueueLock and move the position
- * before handling each individual message, but that seems like too much
- * lock traffic.)
+ * over and over again. Therefore, we must advance our queue position
+ * regularly as we process messages.
+ *
+ * We must also be careful about concurrency: SignalBackends() can
+ * directly advance our position while we're reading. To preserve such
+ * advancement, asyncQueueProcessPageEntries updates our position in
+ * shared memory for each message, only writing if our position is ahead.
+ * Shared lock is sufficient since we're only updating our own position.
*/
- PG_TRY();
+ do
{
- bool reachedStop;
+ int64 curpage = QUEUE_POS_PAGE(pos);
+ int curoffset = QUEUE_POS_OFFSET(pos);
+ int slotno;
+ int copysize;
- do
+ /*
+ * We copy the data from SLRU into a local buffer, so as to avoid
+ * holding the SLRU lock while we are examining the entries and
+ * possibly transmitting them to our frontend. Copy only the part
+ * of the page we will actually inspect.
+ */
+ slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
+ InvalidTransactionId);
+ if (curpage == QUEUE_POS_PAGE(head))
{
- int64 curpage = QUEUE_POS_PAGE(pos);
- int curoffset = QUEUE_POS_OFFSET(pos);
- int slotno;
- int copysize;
+ /* we only want to read as far as head */
+ copysize = QUEUE_POS_OFFSET(head) - curoffset;
+ if (copysize < 0)
+ copysize = 0; /* just for safety */
+ }
+ else
+ {
+ /* fetch all the rest of the page */
+ copysize = QUEUE_PAGESIZE - curoffset;
+ }
+ memcpy(page_buffer.buf + curoffset,
+ NotifyCtl->shared->page_buffer[slotno] + curoffset,
+ copysize);
+ /* Release lock that we got from SimpleLruReadPage_ReadOnly() */
+ LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
- /*
- * We copy the data from SLRU into a local buffer, so as to avoid
- * holding the SLRU lock while we are examining the entries and
- * possibly transmitting them to our frontend. Copy only the part
- * of the page we will actually inspect.
- */
- slotno = SimpleLruReadPage_ReadOnly(NotifyCtl, curpage,
- InvalidTransactionId);
- if (curpage == QUEUE_POS_PAGE(head))
- {
- /* we only want to read as far as head */
- copysize = QUEUE_POS_OFFSET(head) - curoffset;
- if (copysize < 0)
- copysize = 0; /* just for safety */
- }
- else
- {
- /* fetch all the rest of the page */
- copysize = QUEUE_PAGESIZE - curoffset;
- }
- memcpy(page_buffer.buf + curoffset,
- NotifyCtl->shared->page_buffer[slotno] + curoffset,
- copysize);
- /* Release lock that we got from SimpleLruReadPage_ReadOnly() */
- LWLockRelease(SimpleLruGetBankLock(NotifyCtl, curpage));
+ /*
+ * Process messages up to the stop position, end of page, or an
+ * uncommitted message.
+ *
+ * Our stop position is what we found to be the head's position
+ * when we entered this function. It might have changed already.
+ * But if it has, we will receive (or have already received and
+ * queued) another signal and come here again.
+ *
+ * We are not holding NotifyQueueLock here! The queue can only
+ * extend beyond the head pointer (see above).
+ * asyncQueueProcessPageEntries will update our backend's position
+ * for each message to ensure we don't reprocess messages if we fail
+ * partway through, and to preserve any direct advancement that
+ * SignalBackends() might perform concurrently.
+ */
+ reachedStop = asyncQueueProcessPageEntries(&pos, head,
+ page_buffer.buf,
+ snapshot);
- /*
- * Process messages up to the stop position, end of page, or an
- * uncommitted message.
- *
- * Our stop position is what we found to be the head's position
- * when we entered this function. It might have changed already.
- * But if it has, we will receive (or have already received and
- * queued) another signal and come here again.
- *
- * We are not holding NotifyQueueLock here! The queue can only
- * extend beyond the head pointer (see above) and we leave our
- * backend's pointer where it is so nobody will truncate or
- * rewrite pages under us. Especially we don't want to hold a lock
- * while sending the notifications to the frontend.
- */
- reachedStop = asyncQueueProcessPageEntries(&pos, head,
- page_buffer.buf,
- snapshot);
- } while (!reachedStop);
- }
- PG_FINALLY();
- {
- /* Update shared state */
- LWLockAcquire(NotifyQueueLock, LW_SHARED);
- QUEUE_BACKEND_POS(MyProcNumber) = pos;
- LWLockRelease(NotifyQueueLock);
- }
- PG_END_TRY();
+ } while (!reachedStop);
/* Done with snapshot */
UnregisterSnapshot(snapshot);
@@ -2490,6 +2488,24 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
*/
reachedEndOfPage = asyncQueueAdvance(current, qe->length);
+ /*
+ * Update our position in shared memory immediately after advancing,
+ * before we attempt to process the message. This ensures we won't
+ * reprocess this message if NotifyMyFrontEnd fails.
+ *
+ * Only write if our position is ahead of the shared position.
+ * If the shared position is already ahead (due to direct advancement
+ * by SignalBackends), preserve it by not overwriting.
+ */
+ LWLockAcquire(NotifyQueueLock, LW_SHARED);
+ {
+ QueuePosition sharedPos = QUEUE_BACKEND_POS(MyProcNumber);
+
+ if (QUEUE_POS_PRECEDES(sharedPos, *current))
+ QUEUE_BACKEND_POS(MyProcNumber) = *current;
+ }
+ LWLockRelease(NotifyQueueLock);
+
/* Ignore messages destined for other databases */
if (qe->dboid == MyDatabaseId)
{
@@ -2515,6 +2531,10 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
* messages.
*/
*current = thisentry;
+ /* Update shared memory to reflect the backed-up position */
+ LWLockAcquire(NotifyQueueLock, LW_SHARED);
+ QUEUE_BACKEND_POS(MyProcNumber) = *current;
+ LWLockRelease(NotifyQueueLock);
reachedStop = true;
break;
}
--
2.50.1
[text/plain] 0002-optimize_listen_notify-v20-alt2.txt (5.6K, 6-0002-optimize_listen_notify-v20-alt2.txt)
download | inline diff:
From 928cc032706ac154153279adbdfba95f6af2fae4 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 19 Oct 2025 08:12:47 +0200
Subject: [PATCH] Implement idea #2: donePos
---
src/backend/commands/async.c | 57 +++++++++++++++++++++++++++++++-----
1 file changed, 49 insertions(+), 8 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4e6556fb8d1..c81807107d1 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -285,7 +285,8 @@ typedef struct QueueBackendStatus
int32 pid; /* either a PID or InvalidPid */
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
- QueuePosition pos; /* backend has read queue up to here */
+ QueuePosition pos; /* next position to read from */
+ QueuePosition donePos; /* backend has definitively processed up to here */
bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
@@ -347,6 +348,7 @@ static dshash_table *channelHash = NULL;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_DONEPOS(i) (asyncQueueControl->backend[i].donePos)
#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
@@ -674,6 +676,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_DONEPOS(i), 0, 0);
QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1312,6 +1315,7 @@ Exec_ListenPreCommit(void)
prevListener = i;
}
QUEUE_BACKEND_POS(MyProcNumber) = max;
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
/* Insert backend into list of listeners at correct position */
@@ -2048,6 +2052,7 @@ SignalBackends(void)
i = QUEUE_NEXT_LISTENER(i))
{
QueuePosition pos;
+ QueuePosition donePos;
int64 lag;
int32 pid;
@@ -2055,6 +2060,7 @@ SignalBackends(void)
continue;
pos = QUEUE_BACKEND_POS(i);
+ donePos = QUEUE_BACKEND_DONEPOS(i);
/* Direct advancement for idle backends at the old head */
if (pendingNotifies != NULL &&
@@ -2064,9 +2070,17 @@ SignalBackends(void)
pos = queueHeadAfterWrite;
}
- /* Signal backends that have fallen too far behind */
+ /*
+ * Signal backends that have fallen too far behind.
+ *
+ * We use donePos rather than pos for the lag check because donePos
+ * is what matters for queue truncation (see asyncQueueAdvanceTail).
+ * A backend may have been directly advanced (pos is recent) while
+ * donePos is still far behind, holding back the tail. We need to
+ * wake such backends so they can advance their donePos.
+ */
lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos));
+ QUEUE_POS_PAGE(donePos));
if (lag >= QUEUE_CLEANUP_DELAY)
{
@@ -2319,14 +2333,25 @@ asyncQueueReadAllNotifications(void)
QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
- LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
{
- /* Nothing to do, we have read all notifications already. */
+ /*
+ * Nothing to do, we have read all notifications already.
+ *
+ * Update donePos to match pos before returning. This is important
+ * when our position was advanced via direct advancement: we need to
+ * update donePos so the queue tail can advance. Without this,
+ * backends that have been directly advanced would hold back queue
+ * truncation indefinitely.
+ */
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = pos;
+ LWLockRelease(NotifyQueueLock);
return;
}
+ LWLockRelease(NotifyQueueLock);
+
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
* This is trickier than it might seem, because of race conditions.
@@ -2437,9 +2462,19 @@ asyncQueueReadAllNotifications(void)
}
PG_FINALLY();
{
- /* Update shared state */
+ /*
+ * Update shared state.
+ *
+ * We update donePos to what we actually read (the local pos variable),
+ * as this is used for truncation safety. For the read position (pos),
+ * we use the maximum of our local position and the current shared
+ * position, in case another backend used direct advancement to skip us
+ * ahead while we were reading. This prevents us from going backwards
+ * and potentially pointing to a truncated page.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
- QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ QUEUE_BACKEND_DONEPOS(MyProcNumber) = pos;
+ QUEUE_BACKEND_POS(MyProcNumber) = QUEUE_POS_MAX(pos, QUEUE_BACKEND_POS(MyProcNumber));
LWLockRelease(NotifyQueueLock);
}
PG_END_TRY();
@@ -2589,7 +2624,13 @@ asyncQueueAdvanceTail(void)
for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
{
Assert(QUEUE_BACKEND_PID(i) != InvalidPid);
- min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
+ /*
+ * Use donePos rather than pos for truncation safety. The donePos
+ * field represents what the backend has definitively processed, while
+ * pos can be advanced by other backends via direct advancement. This
+ * prevents truncating pages that a backend is still reading from.
+ */
+ min = QUEUE_POS_MIN(min, QUEUE_BACKEND_DONEPOS(i));
}
QUEUE_TAIL = min;
oldtailpage = QUEUE_STOP_PAGE;
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-20 05:12 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 0 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-10-20 05:12 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Chao Li <[email protected]>; pgsql-hackers
On Mon, Oct 20, 2025, at 00:10, Joel Jacobson wrote:
> Attachments:
> * 0001-optimize_listen_notify-v20.patch
> * 0002-optimize_listen_notify-v20.patch
> * 0002-optimize_listen_notify-v20-alt1.txt
> * 0002-optimize_listen_notify-v20-alt3.txt
> * 0002-optimize_listen_notify-v20-alt2.txt
Attaching a new alt1 version, that fixes the mistake of using max(pos,
advisoryPos) for lag calculation, which is wrong, since in alt1 it's the
backend itself that updates its 'pos' when it wakes up, and it's 'pos'
that asyncQueueAdvanceTail looks at, in alt1.
/Joel
From 493f05130febbd8c4bc0bc2533e22ec6ddf6d5f5 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 19 Oct 2025 08:08:05 +0200
Subject: [PATCH] Implements idea #1: advisoryPos
---
src/backend/commands/async.c | 67 ++++++++++++++++++++++++++++++++----
1 file changed, 61 insertions(+), 6 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4e6556fb8d1..4a8a6f5bf1b 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -264,6 +264,11 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
@@ -286,6 +291,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ QueuePosition advisoryPos; /* backend could skip queue to here */
bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
@@ -347,6 +353,7 @@ static dshash_table *channelHash = NULL;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_ADVISORY_POS(i) (asyncQueueControl->backend[i].advisoryPos)
#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
@@ -674,6 +681,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVISORY_POS(i), 0, 0);
QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1312,6 +1320,7 @@ Exec_ListenPreCommit(void)
prevListener = i;
}
QUEUE_BACKEND_POS(MyProcNumber) = max;
+ QUEUE_BACKEND_ADVISORY_POS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
/* Insert backend into list of listeners at correct position */
@@ -2031,9 +2040,13 @@ SignalBackends(void)
* Even though we may take and release NotifyQueueLock multiple times
* while writing, the heavyweight lock guarantees this region contains
* only our messages. Therefore, any backend still positioned at the
- * queue head from before our write can be safely advanced to the current
+ * queue head from before our write can be advised to skip to the current
* queue head without waking it.
*
+ * We use the advisoryPos field rather than directly modifying pos.
+ * The backend controls its own pos field and will check advisoryPos
+ * when it's safe to do so.
+ *
* False-positive possibility: if a backend was previously signaled but
* hasn't yet awoken, we'll skip advancing it (because wakeupPending is
* true). This is safe - the backend will advance its pointer when it
@@ -2048,6 +2061,7 @@ SignalBackends(void)
i = QUEUE_NEXT_LISTENER(i))
{
QueuePosition pos;
+ QueuePosition advisoryPos;
int64 lag;
int32 pid;
@@ -2055,13 +2069,22 @@ SignalBackends(void)
continue;
pos = QUEUE_BACKEND_POS(i);
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(i);
- /* Direct advancement for idle backends at the old head */
+ /*
+ * Direct advancement for idle backends at the old head.
+ *
+ * We check advisoryPos rather than pos to allow accumulating advances
+ * from multiple consecutive notifying backends. If we checked pos,
+ * only the first notifier could advance idle backends; subsequent
+ * notifiers would find pos unchanged (since the backend hasn't woken
+ * up yet) and fail to advance further.
+ */
if (pendingNotifies != NULL &&
- QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
+ QUEUE_POS_EQUAL(advisoryPos, queueHeadBeforeWrite))
{
- QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
- pos = queueHeadAfterWrite;
+ QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
+ advisoryPos = queueHeadAfterWrite;
}
/* Signal backends that have fallen too far behind */
@@ -2302,6 +2325,7 @@ static void
asyncQueueReadAllNotifications(void)
{
volatile QueuePosition pos;
+ QueuePosition advisoryPos;
QueuePosition head;
Snapshot snapshot;
@@ -2319,11 +2343,35 @@ asyncQueueReadAllNotifications(void)
QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
+
+ /*
+ * Check if another backend has set an advisory position for us.
+ * If so, and if we haven't yet read past that point, we can safely
+ * adopt the advisory position and skip the intervening notifications.
+ */
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(MyProcNumber);
+
+ if (!QUEUE_POS_EQUAL(advisoryPos, pos) &&
+ QUEUE_POS_PRECEDES(pos, advisoryPos))
+ {
+ pos = advisoryPos;
+ QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ }
+
LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
{
- /* Nothing to do, we have read all notifications already. */
+ /*
+ * Nothing to do, we have read all notifications already.
+ * Before returning, update advisoryPos if it has fallen behind our
+ * current position, since we're bypassing the PG_FINALLY block that
+ * would normally do this.
+ */
+ LWLockAcquire(NotifyQueueLock, LW_SHARED);
+ QUEUE_BACKEND_ADVISORY_POS(MyProcNumber) =
+ QUEUE_POS_MAX(pos, QUEUE_BACKEND_ADVISORY_POS(MyProcNumber));
+ LWLockRelease(NotifyQueueLock);
return;
}
@@ -2440,6 +2488,13 @@ asyncQueueReadAllNotifications(void)
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ /*
+ * Advance advisoryPos to our current position if it has fallen behind,
+ * but preserve any newer advisory position that may have been set by
+ * another backend while we were processing notifications.
+ */
+ QUEUE_BACKEND_ADVISORY_POS(MyProcNumber) =
+ QUEUE_POS_MAX(pos, QUEUE_BACKEND_ADVISORY_POS(MyProcNumber));
LWLockRelease(NotifyQueueLock);
}
PG_END_TRY();
--
2.50.1
Attachments:
[text/plain] 0002-optimize_listen_notify-v20-alt1-v2.txt (6.3K, 2-0002-optimize_listen_notify-v20-alt1-v2.txt)
download | inline diff:
From 493f05130febbd8c4bc0bc2533e22ec6ddf6d5f5 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sun, 19 Oct 2025 08:08:05 +0200
Subject: [PATCH] Implements idea #1: advisoryPos
---
src/backend/commands/async.c | 67 ++++++++++++++++++++++++++++++++----
1 file changed, 61 insertions(+), 6 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4e6556fb8d1..4a8a6f5bf1b 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -264,6 +264,11 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
@@ -286,6 +291,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ QueuePosition advisoryPos; /* backend could skip queue to here */
bool wakeupPending; /* signal sent but not yet processed */
} QueueBackendStatus;
@@ -347,6 +353,7 @@ static dshash_table *channelHash = NULL;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_ADVISORY_POS(i) (asyncQueueControl->backend[i].advisoryPos)
#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
/*
@@ -674,6 +681,7 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVISORY_POS(i), 0, 0);
QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
}
}
@@ -1312,6 +1320,7 @@ Exec_ListenPreCommit(void)
prevListener = i;
}
QUEUE_BACKEND_POS(MyProcNumber) = max;
+ QUEUE_BACKEND_ADVISORY_POS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
/* Insert backend into list of listeners at correct position */
@@ -2031,9 +2040,13 @@ SignalBackends(void)
* Even though we may take and release NotifyQueueLock multiple times
* while writing, the heavyweight lock guarantees this region contains
* only our messages. Therefore, any backend still positioned at the
- * queue head from before our write can be safely advanced to the current
+ * queue head from before our write can be advised to skip to the current
* queue head without waking it.
*
+ * We use the advisoryPos field rather than directly modifying pos.
+ * The backend controls its own pos field and will check advisoryPos
+ * when it's safe to do so.
+ *
* False-positive possibility: if a backend was previously signaled but
* hasn't yet awoken, we'll skip advancing it (because wakeupPending is
* true). This is safe - the backend will advance its pointer when it
@@ -2048,6 +2061,7 @@ SignalBackends(void)
i = QUEUE_NEXT_LISTENER(i))
{
QueuePosition pos;
+ QueuePosition advisoryPos;
int64 lag;
int32 pid;
@@ -2055,13 +2069,22 @@ SignalBackends(void)
continue;
pos = QUEUE_BACKEND_POS(i);
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(i);
- /* Direct advancement for idle backends at the old head */
+ /*
+ * Direct advancement for idle backends at the old head.
+ *
+ * We check advisoryPos rather than pos to allow accumulating advances
+ * from multiple consecutive notifying backends. If we checked pos,
+ * only the first notifier could advance idle backends; subsequent
+ * notifiers would find pos unchanged (since the backend hasn't woken
+ * up yet) and fail to advance further.
+ */
if (pendingNotifies != NULL &&
- QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
+ QUEUE_POS_EQUAL(advisoryPos, queueHeadBeforeWrite))
{
- QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
- pos = queueHeadAfterWrite;
+ QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
+ advisoryPos = queueHeadAfterWrite;
}
/* Signal backends that have fallen too far behind */
@@ -2302,6 +2325,7 @@ static void
asyncQueueReadAllNotifications(void)
{
volatile QueuePosition pos;
+ QueuePosition advisoryPos;
QueuePosition head;
Snapshot snapshot;
@@ -2319,11 +2343,35 @@ asyncQueueReadAllNotifications(void)
QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
+
+ /*
+ * Check if another backend has set an advisory position for us.
+ * If so, and if we haven't yet read past that point, we can safely
+ * adopt the advisory position and skip the intervening notifications.
+ */
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(MyProcNumber);
+
+ if (!QUEUE_POS_EQUAL(advisoryPos, pos) &&
+ QUEUE_POS_PRECEDES(pos, advisoryPos))
+ {
+ pos = advisoryPos;
+ QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ }
+
LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
{
- /* Nothing to do, we have read all notifications already. */
+ /*
+ * Nothing to do, we have read all notifications already.
+ * Before returning, update advisoryPos if it has fallen behind our
+ * current position, since we're bypassing the PG_FINALLY block that
+ * would normally do this.
+ */
+ LWLockAcquire(NotifyQueueLock, LW_SHARED);
+ QUEUE_BACKEND_ADVISORY_POS(MyProcNumber) =
+ QUEUE_POS_MAX(pos, QUEUE_BACKEND_ADVISORY_POS(MyProcNumber));
+ LWLockRelease(NotifyQueueLock);
return;
}
@@ -2440,6 +2488,13 @@ asyncQueueReadAllNotifications(void)
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ /*
+ * Advance advisoryPos to our current position if it has fallen behind,
+ * but preserve any newer advisory position that may have been set by
+ * another backend while we were processing notifications.
+ */
+ QUEUE_BACKEND_ADVISORY_POS(MyProcNumber) =
+ QUEUE_POS_MAX(pos, QUEUE_BACKEND_ADVISORY_POS(MyProcNumber));
LWLockRelease(NotifyQueueLock);
}
PG_END_TRY();
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-20 16:43 Arseniy Mukhin <[email protected]>
parent: Joel Jacobson <[email protected]>
1 sibling, 6 replies; 120+ messages in thread
From: Arseniy Mukhin @ 2025-10-20 16:43 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Tom Lane <[email protected]>; Chao Li <[email protected]>; pgsql-hackers
On Mon, Oct 20, 2025 at 1:07 AM Joel Jacobson <[email protected]> wrote:
>
> > Minute of brainstorming
> >
> > I also thought about a workload that probably frequently can be met.
> > Let's say we have sequence of notifications:
> >
> > F F F T F F F T F F F T
> >
> > Here F - notification from the channel we don't care about and T - the opposite.
> > It seems that after the first 'T' notification it will be more
> > difficult for notifying backends to do 'direct advancement' as there
> > will be some lag before the listener reads the notification and
> > advances its position. Not sure if it's a problem, probably it depends
> > on the intensity of notifications.
>
> Hmm, I realize both the advisoryPos and donePos ideas share a problem;
> they both require listening backends to wakeup eventually anyway,
> just to advance the 'pos'.
>
> The holy grail would be to avoid this context switching cost entirely,
> and only need to wakeup listening backends when they are actually
> interested in the queued notifications. I think the third idea,
> alt3, is most promising in achieving this goal.
>
Yeah, it would be great.
> > But maybe we can use a bit more
> > sophisticated data structure here? Something like a list of skip
> > ranges. Every entry in the list is the range (pos1, pos2) that the
> > listener can skip during the reading. So instead of advancing
> > advisoryPos every notifying backend should add skip range to the list.
> > Notifying backends can merge neighbour ranges (pos1, pos2) & (pos2,
> > pos3) -> (pos1, pos3). We also can limit the number of entries to 5
> > for example. Listeners on their side should clear the list before
> > reading and skip all ranges from it. What do you think? Is it
> > overkill?
>
> Hmm, maybe, but I'm a bit wary about too much complication.
> Hopefully there is a simpler solution that avoids the need for this,
> but sure, if we can't find one, then I'm positive to try this skip ranges idea.
>
Yes, and it's probably worth doing a benchmarking to see if it's a
problem at all before implementing anything.
> >> > A different line of thought could be to get rid of
> >> > asyncQueueReadAllNotifications's optimization of moving the
> >> > queue pos only once, per
> >> >
> >> > * (We could alternatively retake NotifyQueueLock and move the position
> >> > * before handling each individual message, but that seems like too much
> >> > * lock traffic.)
> >> >
> >> > Since we only need shared lock to advance our own queue pos,
> >> > maybe that wouldn't be too awful. Not sure.
> >>
> >> Above idea is implemented in 0002-optimize_listen_notify-v19-alt3.txt
> >>
> >
> > Hmm, it seems we still have the race when in the beginning of
> > asyncQueueReadAllNotifications we read pos into the local variable and
> > release the lock. IIUC to avoid the race without introducing another
> > field here, the listener needs to hold the lock until it updates its
> > position so that the notifying backend cannot change it concurrently.
>
> *** 0002-optimize_listen_notify-v20-alt3.txt:
>
> * Fixed; the shared 'pos' is now only updated if the new position is ahead.
>
I managed to reproduce the race with v20-alt3. I tried to write a TAP
test reproducing the issue, so it was easier to validate changes.
Please find the attached TAP test. I added it to some random package
for simplicity.
Best regards,
Arseniy Mukhin
Attachments:
[application/octet-stream] 0001-TAP-test-with-listener-pos-race.patch.nocfbot (5.1K, 2-0001-TAP-test-with-listener-pos-race.patch.nocfbot)
download
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-23 08:16 Chao Li <[email protected]>
parent: Arseniy Mukhin <[email protected]>
5 siblings, 1 reply; 120+ messages in thread
From: Chao Li @ 2025-10-23 08:16 UTC (permalink / raw)
To: Arseniy Mukhin <[email protected]>; +Cc: Joel Jacobson <[email protected]>; Tom Lane <[email protected]>; pgsql-hackers
> On Oct 21, 2025, at 00:43, Arseniy Mukhin <[email protected]> wrote:
>
>
> I managed to reproduce the race with v20-alt3. I tried to write a TAP
> test reproducing the issue, so it was easier to validate changes.
> Please find the attached TAP test. I added it to some random package
> for simplicity.
>
With alt3, as we have acquired the notification lock after reading every message to update the POS, I think we can do a little bit more optimization:
The notifier: in SignalBackend()
* Now we check if a listener’s pos equals to beforeWritePos, then we do “directly advancement”
* We can change to if a listener’s pos is between beforeWritePos and afterWritePos, then we can do the advancement.
The listener: in asyncQueueReadAllNotifications():
* With alt3, we only lock and update pos
* We can do more. If current pos in shared memory is after that local pos, then meaning some notifier has done an advancement, so it can stop reading.
I tried to run your TAP test on my MacBook, but failed:
```
t/008_listen-pos-race.pl .. Dubious, test returned 32 (wstat 8192, 0x2000)
No subtests run
Test Summary Report
-------------------
t/008_listen-pos-race.pl (Wstat: 8192 (exited 32) Tests: 0 Failed: 0)
Non-zero exit status: 32
Parse errors: No plan found in TAP output
Files=1, Tests=0, 3 wallclock secs ( 0.01 usr 0.01 sys + 0.10 cusr 0.29 csys = 0.41 CPU)
Result: FAIL
```
I didn’t spend time debugging the problem. If you can figure the problem, maybe I can run the test from my side.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-23 10:02 Arseniy Mukhin <[email protected]>
parent: Chao Li <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Arseniy Mukhin @ 2025-10-23 10:02 UTC (permalink / raw)
To: Chao Li <[email protected]>; +Cc: Joel Jacobson <[email protected]>; Tom Lane <[email protected]>; pgsql-hackers
Hi,
On Thu, Oct 23, 2025 at 11:17 AM Chao Li <[email protected]> wrote:
>
>
>
> > On Oct 21, 2025, at 00:43, Arseniy Mukhin <[email protected]> wrote:
> >
> >
> > I managed to reproduce the race with v20-alt3. I tried to write a TAP
> > test reproducing the issue, so it was easier to validate changes.
> > Please find the attached TAP test. I added it to some random package
> > for simplicity.
> >
>
> With alt3, as we have acquired the notification lock after reading every message to update the POS, I think we can do a little bit more optimization:
>
> The notifier: in SignalBackend()
> * Now we check if a listener’s pos equals to beforeWritePos, then we do “directly advancement”
> * We can change to if a listener’s pos is between beforeWritePos and afterWritePos, then we can do the advancement.
>
> The listener: in asyncQueueReadAllNotifications():
> * With alt3, we only lock and update pos
> * We can do more. If current pos in shared memory is after that local pos, then meaning some notifier has done an advancement, so it can stop reading.
>
I think this would be a reasonable optimization if it weren't for the
race condition mentioned above. The problem is that if the local pos
lags behind the shared memory pos, it could point to a truncated queue
segment, so we shouldn't allow that.
> I tried to run your TAP test on my MacBook, but failed:
>
> ```
> t/008_listen-pos-race.pl .. Dubious, test returned 32 (wstat 8192, 0x2000)
> No subtests run
>
> Test Summary Report
> -------------------
> t/008_listen-pos-race.pl (Wstat: 8192 (exited 32) Tests: 0 Failed: 0)
> Non-zero exit status: 32
> Parse errors: No plan found in TAP output
> Files=1, Tests=0, 3 wallclock secs ( 0.01 usr 0.01 sys + 0.10 cusr 0.29 csys = 0.41 CPU)
> Result: FAIL
> ```
>
> I didn’t spend time debugging the problem. If you can figure the problem, maybe I can run the test from my side.
>
Thank you for trying the test. I think the test works for you as
expected, it should fail with error and I have the same error status.
Sorry, I failed to realize it could be confusing, probably it was
better to fail on some assert instead, but I thought error is enough
for temp reproducer. Please see 008_listen-pos-race_test.log for
details.
Best regards,
Arseniy Mukhin
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-26 04:11 Chao Li <[email protected]>
parent: Arseniy Mukhin <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Chao Li @ 2025-10-26 04:11 UTC (permalink / raw)
To: Arseniy Mukhin <[email protected]>; +Cc: Joel Jacobson <[email protected]>; Tom Lane <[email protected]>; pgsql-hackers
> On Oct 23, 2025, at 18:02, Arseniy Mukhin <[email protected]> wrote:
>
> Hi,
>
> On Thu, Oct 23, 2025 at 11:17 AM Chao Li <[email protected]> wrote:
>>
>>
>>
>>> On Oct 21, 2025, at 00:43, Arseniy Mukhin <[email protected]> wrote:
>>>
>>>
>>> I managed to reproduce the race with v20-alt3. I tried to write a TAP
>>> test reproducing the issue, so it was easier to validate changes.
>>> Please find the attached TAP test. I added it to some random package
>>> for simplicity.
>>>
>>
>> With alt3, as we have acquired the notification lock after reading every message to update the POS, I think we can do a little bit more optimization:
>>
>> The notifier: in SignalBackend()
>> * Now we check if a listener’s pos equals to beforeWritePos, then we do “directly advancement”
>> * We can change to if a listener’s pos is between beforeWritePos and afterWritePos, then we can do the advancement.
>>
>> The listener: in asyncQueueReadAllNotifications():
>> * With alt3, we only lock and update pos
>> * We can do more. If current pos in shared memory is after that local pos, then meaning some notifier has done an advancement, so it can stop reading.
>>
>
> I think this would be a reasonable optimization if it weren't for the
> race condition mentioned above. The problem is that if the local pos
> lags behind the shared memory pos, it could point to a truncated queue
> segment, so we shouldn't allow that.
>
I figured out a way to resolve the race condition for alt3:
* add an awakening flag for every listener, this flag is only set by listeners
* add an advisory pos for every listener, similar to alt1
* if a listener is awaken, notify only updates the listener’s advisory pos; otherwise directly advance its position.
* If a running listener see current pos is behind advisory pos, then stop reading
See more details in attach patch file, I added code comments for my changes. Now the TAP test won’t hit the race condition.
```
# +++ tap check in src/test/authentication +++
t/008_listen-pos-race.pl .. skipped: Injection points not supported by this build
Files=1, Tests=0, 0 wallclock secs ( 0.00 usr 0.00 sys + 0.03 cusr 0.01 csys = 0.04 CPU)
Result: NOTESTS
```
And with my solution, listeners longer will still use local pos, so that no longer need to acquire notification lock in every loop.
The patch stack is: v20 patch -> alt3 patch -> tap patch -> my patch. Please see if my solution works.
I also made a tiny change in the TAP script to allow it to terminate gracefully.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Attachments:
[application/octet-stream] fix-race.patch (7.2K, 2-fix-race.patch)
download | inline diff:
commit c7daefa51118d2041623b14b7f26c9177ac0b6cd
Author: Chao Li (Evan) <[email protected]>
Date: Sat Oct 25 15:42:26 2025 +0800
fix race condition
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 6e8d728e9ce..3c8a640ebed 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -250,6 +250,11 @@ typedef struct QueuePosition
#define QUEUE_POS_EQUAL(x,y) \
((x).page == (y).page && (x).offset == (y).offset)
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
#define QUEUE_POS_IS_ZERO(x) \
((x).page == 0 && (x).offset == 0)
@@ -287,7 +292,9 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ QueuePosition advisoryPos; /* advisory position for this backend */
bool wakeupPending; /* signal sent but not yet processed */
+ bool awakening; /* backend is awakening */
} QueueBackendStatus;
/*
@@ -348,7 +355,9 @@ static dshash_table *channelHash = NULL;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_ADVISORY_POS(i) (asyncQueueControl->backend[i].advisoryPos)
#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
+#define QUEUE_BACKEND_AWAKENING(i) (asyncQueueControl->backend[i].awakening)
/*
* The SLRU buffer area through which we access the notification queue
@@ -675,7 +684,9 @@ AsyncShmemInit(void)
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVISORY_POS(i), 0, 0);
QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
+ QUEUE_BACKEND_AWAKENING(i) = false;
}
}
@@ -1954,6 +1965,7 @@ SignalBackends(void)
ProcNumber *procnos;
int count;
ListCell *lc;
+ List *interestedProcs = NIL;
INJECTION_POINT("listen-notify-signal-backends", NULL);
@@ -2002,6 +2014,12 @@ SignalBackends(void)
int32 pid;
QueuePosition pos;
+ // XXX: Use a list to record listeners interested in any of the pending channels.
+ // List is not the best choice, so it we decide to take this apprach, we
+ // can optimize it later by using a hash or bitmap.
+ if (!list_member_int(interestedProcs, i))
+ interestedProcs = lappend_int(interestedProcs, i);
+
if (QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
@@ -2054,19 +2072,41 @@ SignalBackends(void)
int64 lag;
int32 pid;
- if (QUEUE_BACKEND_WAKEUP_PENDING(i))
- continue;
+ /* XXX we cannot rely on wakeupPending here, because the flag might be set by another notifier. */
+ //if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ // continue;
pos = QUEUE_BACKEND_POS(i);
- /* Direct advancement for idle backends at the old head */
- if (pendingNotifies != NULL &&
- QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
+ if (!list_member_int(interestedProcs, i))
{
- QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
- pos = queueHeadAfterWrite;
+ /* Direct advancement for idle backends at the old head */
+ if (pendingNotifies != NULL)
+ {
+ if (QUEUE_BACKEND_AWAKENING(i))
+ {
+ // For awakening backend, advice a new position.
+ if ((QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite) ||
+ (QUEUE_POS_PRECEDES(queueHeadBeforeWrite, pos) && QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))) &&
+ QUEUE_POS_EQUAL(QUEUE_BACKEND_ADVISORY_POS(i), queueHeadBeforeWrite))
+ QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
+ }
+ else
+ {
+ // For non-awakening backend, directly advance its position.
+ if (QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
+ {
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVISORY_POS(i), 0, 0);
+ }
+ }
+ pos = queueHeadAfterWrite;
+ }
}
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
/* Signal backends that have fallen too far behind */
lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
QUEUE_POS_PAGE(pos));
@@ -2321,6 +2361,7 @@ asyncQueueReadAllNotifications(void)
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ QUEUE_BACKEND_AWAKENING(MyProcNumber) = true;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
LWLockRelease(NotifyQueueLock);
@@ -2330,6 +2371,9 @@ asyncQueueReadAllNotifications(void)
if (QUEUE_POS_EQUAL(pos, head))
{
/* Nothing to do, we have read all notifications already. */
+ LWLockAcquire(NotifyQueueLock, LW_SHARED);
+ QUEUE_BACKEND_AWAKENING(MyProcNumber) = false;
+ LWLockRelease(NotifyQueueLock);
return;
}
@@ -2441,21 +2485,29 @@ asyncQueueReadAllNotifications(void)
page_buffer.buf,
snapshot);
- /*
- * Update our position in shared memory. The 'pos' variable now
- * holds our new position (advanced past all messages we just
- * processed). This ensures that if we fail while processing
- * messages from the next page, we won't reprocess the ones we
- * just handled. It also prevents us from overwriting any direct
- * advancement that another backend might have done while we were
- * processing messages.
- */
- LWLockAcquire(NotifyQueueLock, LW_SHARED);
- QUEUE_BACKEND_POS(MyProcNumber) = pos;
- LWLockRelease(NotifyQueueLock);
-
+ // If there is a direct advancement, let's stop reading.
+ // We don't need to lock here because even if the position
+ // changes right after we read it, we just do one more loop.
+ //LWLockAcquire(NotifyQueueLock, LW_SHARED);
+ if (QUEUE_POS_PRECEDES(pos, QUEUE_BACKEND_ADVISORY_POS(MyProcNumber)))
+ {
+ reachedStop = true;
+ }
+ //QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ //LWLockRelease(NotifyQueueLock);
} while (!reachedStop);
+ LWLockAcquire(NotifyQueueLock, LW_SHARED);
+ if (QUEUE_POS_PRECEDES(pos, QUEUE_BACKEND_ADVISORY_POS(MyProcNumber)))
+ {
+ /* respect direct advancement */
+ pos = QUEUE_BACKEND_ADVISORY_POS(MyProcNumber);
+ }
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVISORY_POS(MyProcNumber), 0, 0);
+ QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ QUEUE_BACKEND_AWAKENING(MyProcNumber) = false;
+ LWLockRelease(NotifyQueueLock);
+
/* Done with snapshot */
UnregisterSnapshot(snapshot);
}
diff --git a/src/test/authentication/t/008_listen-pos-race.pl b/src/test/authentication/t/008_listen-pos-race.pl
index 060e33ed391..c858e8da524 100644
--- a/src/test/authentication/t/008_listen-pos-race.pl
+++ b/src/test/authentication/t/008_listen-pos-race.pl
@@ -8,7 +8,7 @@ use PostgreSQL::Test::Utils;
use Time::HiRes qw(usleep);
use Test::More;
-if ($ENV{enable_injection_points} ne 'yes') {
+if ($ENV{enable_injection_points} // '' ne 'yes') {
plan skip_all => 'Injection points not supported by this build';
}
@@ -18,6 +18,7 @@ $node->start;
if (!$node->check_extension('injection_points')) {
+ $node->stop;
plan skip_all => 'Extension injection_points not installed';
}
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-26 06:33 Joel Jacobson <[email protected]>
parent: Chao Li <[email protected]>
0 siblings, 2 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-10-26 06:33 UTC (permalink / raw)
To: Chao Li <[email protected]>; Arseniy Mukhin <[email protected]>; +Cc: Tom Lane <[email protected]>; pgsql-hackers
On Sun, Oct 26, 2025, at 05:11, Chao Li wrote:
> I figured out a way to resolve the race condition for alt3:
>
> * add an awakening flag for every listener, this flag is only set by
> listeners
> * add an advisory pos for every listener, similar to alt1
> * if a listener is awaken, notify only updates the listener’s advisory
> pos; otherwise directly advance its position.
> * If a running listener see current pos is behind advisory pos, then
> stop reading
>
> See more details in attach patch file, I added code comments for my
> changes. Now the TAP test won’t hit the race condition.
> ```
> # +++ tap check in src/test/authentication +++
> t/008_listen-pos-race.pl .. skipped: Injection points not supported by
> this build
> Files=1, Tests=0, 0 wallclock secs ( 0.00 usr 0.00 sys + 0.03 cusr
> 0.01 csys = 0.04 CPU)
> Result: NOTESTS
> ```
>
> And with my solution, listeners longer will still use local pos, so
> that no longer need to acquire notification lock in every loop.
This sounds promising, similar to what I had in mind. I was thinking
about the idea of using the advisoryPos only when the listening backend
is known to be running (which felt like it would need another shared
boolean field), and to move its pos field directly only when it's not
running, since if it's running we don't need to optimize for context
switching, since it's by definition already running.
What I wanted to investigate what all the concurrency situations
that we can imagine, i.e. to permutate all possible differences
we can think of into a truth table, and reason about each case.
The ones I can think of are, from the perspective of SignalBackends,
reasoning about a specific listening backend:
{is interested in the notifications, is not interested in the notifications} x
{wakeupPending=false, wakeupPending=true} x
{pos < queueHeadBeforeWrite, pos == queueHeadBeforeWrite, pos > queueHeadBeforeWrite, pos == queueHeadAfterWrite, pos > queueHeadAfterWrite} x
{is running, is not running}
This gives 2x2x5x2=40 states to reason about. Some of these combinations
are probably impossible, I still think it would be good to include them
and explain why they are impossible.
> The patch stack is: v20 patch -> alt3 patch -> tap patch -> my patch.
> Please see if my solution works.
>
> I also made a tiny change in the TAP script to allow it to terminate gracefully.
I haven't looked at the code yet, tried to apply the patch but it fails:
shasum of files:
```
ca54ffa02ac54efd65acce0d09b18e630b5d7982 0001-optimize_listen_notify-v20.patch
5755701bb0e7ac7a0cea3abab9d74a0001b7b63a 0002-optimize_listen_notify-v20.patch
5819e23b5760023be70d2582207b72164904e952 0002-optimize_listen_notify-v20-alt3.txt
33d700dc0b3288d46705e85d381cb564d99079d1 0001-TAP-test-with-listener-pos-race.patch.nocfbot
8ee716451bd5f85761b666712bdfd8b5d936f92d fix-race.patch
```
Trying to apply them on top of current master (39dcfda2d23ac39f14ecf4b83e01eae85d07d9e5):
```
% git apply 0001-optimize_listen_notify-v20.patch
% git apply 0002-optimize_listen_notify-v20.patch
% git apply 0002-optimize_listen_notify-v20-alt3.txt
% git apply 0001-TAP-test-with-listener-pos-race.patch.nocfbot
% git apply fix-race.patch
fix-race.patch:100: indent with spaces.
(QUEUE_POS_PRECEDES(queueHeadBeforeWrite, pos) && QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))) &&
error: patch failed: src/backend/commands/async.c:250
error: src/backend/commands/async.c: patch does not apply
error: patch failed: src/test/authentication/t/008_listen-pos-race.pl:8
error: src/test/authentication/t/008_listen-pos-race.pl: patch does not apply
```
I'll try to resolve it manually, but in case you're quicker to reply, I'm sending this now.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-26 07:08 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
1 sibling, 0 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-10-26 07:08 UTC (permalink / raw)
To: Chao Li <[email protected]>; Arseniy Mukhin <[email protected]>; +Cc: Tom Lane <[email protected]>; pgsql-hackers
On Sun, Oct 26, 2025, at 07:33, Joel Jacobson wrote:
> Trying to apply them on top of current master
> (39dcfda2d23ac39f14ecf4b83e01eae85d07d9e5):
>
> ```
> % git apply 0001-optimize_listen_notify-v20.patch
> % git apply 0002-optimize_listen_notify-v20.patch
> % git apply 0002-optimize_listen_notify-v20-alt3.txt
> % git apply 0001-TAP-test-with-listener-pos-race.patch.nocfbot
> % git apply fix-race.patch
> fix-race.patch:100: indent with spaces.
> (QUEUE_POS_PRECEDES(queueHeadBeforeWrite, pos) &&
> QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))) &&
> error: patch failed: src/backend/commands/async.c:250
> error: src/backend/commands/async.c: patch does not apply
> error: patch failed: src/test/authentication/t/008_listen-pos-race.pl:8
> error: src/test/authentication/t/008_listen-pos-race.pl: patch does not
> apply
> ```
>
> I'll try to resolve it manually, but in case you're quicker to reply,
> I'm sending this now.
I see the problem; seems like you based fix-race.patch on top of
0002-optimize_listen_notify-v19-alt3.txt because fix-race.patch contains
this diff block which is only present in that version:
```
@@ -2441,21 +2485,29 @@ asyncQueueReadAllNotifications(void)
page_buffer.buf,
snapshot);
- /*
- * Update our position in shared memory. The 'pos' variable now
- * holds our new position (advanced past all messages we just
- * processed). This ensures that if we fail while processing
```
I've compared 0002-optimize_listen_notify-v19-alt3.txt with
0002-optimize_listen_notify-v20-alt3.txt and it's only the addition of
QUEUE_POS_PRECEDES which fix-race.patch also adds, and some locking and
pos handling differences.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-26 23:24 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
1 sibling, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-26 23:24 UTC (permalink / raw)
To: pgsql-hackers
On Sun, Oct 26, 2025, at 07:33, Joel Jacobson wrote:
> On Sun, Oct 26, 2025, at 05:11, Chao Li wrote:
>> I figured out a way to resolve the race condition for alt3:
>>
>> * add an awakening flag for every listener, this flag is only set by
>> listeners
>> * add an advisory pos for every listener, similar to alt1
>> * if a listener is awaken, notify only updates the listener’s advisory
>> pos; otherwise directly advance its position.
>> * If a running listener see current pos is behind advisory pos, then
>> stop reading
...
> This sounds promising, similar to what I had in mind. I was thinking
> about the idea of using the advisoryPos only when the listening backend
> is known to be running (which felt like it would need another shared
> boolean field), and to move its pos field directly only when it's not
> running, since if it's running we don't need to optimize for context
> switching, since it's by definition already running.
Write-up of changes since v20:
Two new fields have been added to QueueBackendStatus:
+ QueuePosition advisoryPos; /* safe skip-ahead position */
+ bool advancingPos; /* backend is reading the queue */
These are used SignalBackends and asyncQueueReadAllNotifications to
handle the empheral state of the shared queue position, since we don't
take a lock while advancing it in asyncQueueReadAllNotifications.
In SignalBackends, we now don't signal laggers in other databases,
instead we will signal any listening backend that could possibly be
behind the old queue head, since we can't know if such backend is
interested in the notifications before the old queue head. Realistic
benchmarks will be needed to determine if this happens often enough to
warrant a more complex optimization, such as the ranges idea suggested
by Arseniy Mukhin.
In SignalBackends, if a backend that is uninterested in our
notifications, has a shared pos that is at the old queue head, then we
will check if it's not currently advancing its pos, in which case we can
set its shared pos to the new queue head, i.e. "direct advance" it,
otherwise, if it's currently advancing its pos, and if its advisory pos
is behind our new queue head, we will update its advisory pos to our new
queue head.
In asyncQueueReadAllNotifications, we start by setting wakupPending to
false and advisoryPos to true, to indicate that we've woken up, and that
we will now start advancing the pos. We also check if the pos is behind
the advisory pos, and if so use the advisory pos to update the pos.
In asyncQueueReadAllNotifications's PG_FINALLY block, we reset
advancingPos to false, and detect if the advisoryPos was set by
SignalBackends while we were processing messages on the queue, and if
so, and if the advisoryPos is ahead of our pos, we update our shared pos
with the advisoryPos, and otherwise update the shared pos with the new
pos.
/Joel
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v21.patch (9.3K, 2-0001-optimize_listen_notify-v21.patch)
download | inline diff:
From 44484a2ad88a5532d2e7f28b2e8ed9095634f084 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 114 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 68 +++++++++++
2 files changed, 181 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..443a6eb669f 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 7 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
@@ -104,6 +205,17 @@ step l2commit: COMMIT;
listener2: NOTIFY "c1" with payload "" from notifier
step l2stop: UNLISTEN *;
+starting permutation: lch_listen nch_notify lch_check
+step lch_listen: LISTEN ch;
+step nch_notify: NOTIFY ch, 'aa';
+step lch_check: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener_ch: NOTIFY "ch" with payload "aa" from notifier_ch
+
starting permutation: llisten lbegin usage bignotify usage
step llisten: LISTEN c1; LISTEN c2;
step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..0a01e777b98 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,38 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Listener session for cross-session notification test with channel 'ch'.
+
+session listener_ch
+step lch_listen { LISTEN ch; }
+step lch_check { SELECT 1 AS x; }
+teardown { UNLISTEN *; }
+
+# Notifier session for cross-session notification test with channel 'ch'.
+
+session notifier_ch
+step nch_notify { NOTIFY ch, 'aa'; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +106,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
@@ -73,6 +137,10 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
# and notify queue is not empty
permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
+# Check that notifications sent from a backend that has not done LISTEN
+# are properly delivered to a listener in another backend.
+permutation lch_listen nch_notify lch_check
+
# Verify that pg_notification_queue_usage correctly reports a non-zero result,
# after submitting notifications while another connection is listening for
# those notifications and waiting inside an active transaction. We have to
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v21.patch (35.7K, 3-0002-optimize_listen_notify-v21.patch)
download | inline diff:
From 455c00a1cd63813f62f1bf16a2ec095cb027fff7 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Tue, 14 Oct 2025 08:03:19 +0200
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of listening backends (ProcNumbers). This allows the sender to
target only those backends actually listening on the channels for which
it has queued notifications.
At commit time:
* AtCommit_Notify updates the shared channelHash to reflect any LISTEN
or UNLISTEN actions performed in the transaction.
* SignalBackends consults this hash to find the backends that are
listening on the channels being notified in the current database, and
signals only those.
Each backend's entry in AsyncQueueControl now includes a wakeupPending
flag to prevent duplicate signals while a previous wakeup is still being
processed.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
Other notes
-----------
* Maintains dual data structures: a shared channelHash for determining
which backends to signal, and a local per-backend listenChannels list
for fast lock-free lookups during notification processing. This avoids
contention on the shared hash during the high-frequency IsListeningOn
checks that occur for every notification read from the queue.
* Backends remain registered in the global listener list as long as
listenChannels is non-empty.
* Adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
* No user-visible behavioral changes; this is an internal optimization
only.
---
src/backend/commands/async.c | 613 ++++++++++++++++--
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 566 insertions(+), 51 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..308f310c68c 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,21 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
- * Then we signal any backends that may be interested in our messages
- * (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * make any actual updates to the local listen state (listenChannels) and
+ * shared channel hash table (channelHash). Then we signal any backends
+ * that may be interested in our messages (including our own backend,
+ * if listening). This is done by SignalBackends(), which consults the
+ * shared channel hash table to identify listeners for the channels that
+ * have pending notifications in the current database. Each selected
+ * backend is marked as having a wakeup pending to avoid duplicate signals,
+ * and a PROCSIG_NOTIFY_INTERRUPT signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, avoiding unnecessary
+ * wakeups for idle listeners that have nothing to read.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +135,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,14 +145,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +173,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ProcNumber array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -224,11 +258,16 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +285,9 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
+ QueuePosition advisoryPos; /* safe skip-ahead position */
+ bool advancingPos; /* backend is reading the queue */
} QueueBackendStatus;
/*
@@ -260,9 +302,10 @@ typedef struct QueueBackendStatus
* (since no other backend will inspect it).
*
* When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
- * entries of other backends and also change the head pointer. When holding
- * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointers.
+ * entries of other backends and also change the head pointer. They can
+ * also advance other backends' queue positions, unless they are not
+ * in the process of doing that themselves. When holding both NotifyQueueLock and
+ * NotifyQueueTailLock in EXCLUSIVE mode, backends can change the tail pointers.
*
* SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
* the control lock for the pg_notify SLRU buffers.
@@ -288,11 +331,16 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +349,9 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
+#define QUEUE_BACKEND_ADVISORY_POS(i) (asyncQueueControl->backend[i].advisoryPos)
+#define QUEUE_BACKEND_ADVANCING_POS(i) (asyncQueueControl->backend[i].advancingPos)
/*
* The SLRU buffer area through which we access the notification queue
@@ -391,6 +442,7 @@ typedef struct NotificationList
int nestingLevel; /* current transaction nesting depth */
List *events; /* list of Notification structs */
HTAB *hashtab; /* hash of NotificationHash structs, or NULL */
+ HTAB *channelHashtab; /* hash of unique channel names, or NULL */
struct NotificationList *upper; /* details for upper transaction levels */
} NotificationList;
@@ -401,6 +453,11 @@ struct NotificationHash
Notification *event; /* => the actual Notification struct */
};
+struct ChannelHash
+{
+ char channel[NAMEDATALEN];
+};
+
static NotificationList *pendingNotifies = NULL;
/*
@@ -418,6 +475,20 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/*
+ * Queue head positions for direct advancement.
+ * These are captured during PreCommit_Notify while holding the heavyweight
+ * lock on database 0, ensuring no other backend can insert notifications
+ * between them. SignalBackends uses these to advance idle backends.
+ */
+static QueuePosition queueHeadBeforeWrite;
+static QueuePosition queueHeadAfterWrite;
+
+/*
+ * List of channels with pending notifications in the current transaction.
+ */
+static List *pendingNotifyChannels = NIL;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -457,6 +528,9 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+static void initChannelHash(void);
/*
* Compute the difference between two queue page numbers.
@@ -478,6 +552,80 @@ asyncQueuePagePrecedes(int64 p, int64 q)
return p < q;
}
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
/*
* Report space needed for our shared memory area
*/
@@ -521,12 +669,18 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVISORY_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
+ QUEUE_BACKEND_ADVANCING_POS(i) = false;
}
}
@@ -657,6 +811,7 @@ Async_Notify(const char *channel, const char *payload)
notifies->events = list_make1(n);
/* We certainly don't need a hashtable yet */
notifies->hashtab = NULL;
+ notifies->channelHashtab = NULL;
notifies->upper = pendingNotifies;
pendingNotifies = notifies;
}
@@ -894,6 +1049,36 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
+
+ /*
+ * Build list of unique channels for SignalBackends().
+ *
+ * If we have a channelHashtab, use it to efficiently get the unique
+ * channels. Otherwise, fall back to the linear approach.
+ */
+ pendingNotifyChannels = NIL;
+ if (pendingNotifies->channelHashtab != NULL)
+ {
+ HASH_SEQ_STATUS status;
+ struct ChannelHash *channelEntry;
+
+ hash_seq_init(&status, pendingNotifies->channelHashtab);
+ while ((channelEntry = (struct ChannelHash *) hash_seq_search(&status)) != NULL)
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channelEntry->channel);
+ }
+ else
+ {
+ /* Linear approach for small number of notifications */
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
+ }
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -921,6 +1106,21 @@ PreCommit_Notify(void)
*/
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /*
+ * For the direct advancement optimization in SignalBackends(), we
+ * need to ensure that no other backend can insert queue entries
+ * between queueHeadBeforeWrite and queueHeadAfterWrite. The
+ * heavyweight lock above provides this guarantee, since it serializes
+ * all writers.
+ *
+ * Note: if the heavyweight lock were ever removed for scalability
+ * reasons, we could achieve the same guarantee by holding
+ * NotifyQueueLock in EXCLUSIVE mode across all our insertions, rather
+ * than releasing and reacquiring it for each page as we do below.
+ */
+
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(queueHeadBeforeWrite, 0, 0);
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
@@ -939,12 +1139,20 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ if (firstIteration)
+ {
+ queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many notifications in the NOTIFY queue")));
nextNotify = asyncQueueAddEntries(nextNotify);
+ queueHeadAfterWrite = QUEUE_HEAD;
+
LWLockRelease(NotifyQueueLock);
}
@@ -1135,6 +1343,10 @@ Exec_ListenPreCommit(void)
static void
Exec_ListenCommit(const char *channel)
{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
MemoryContext oldcontext;
/* Do nothing if we are already listening on this channel */
@@ -1152,21 +1364,84 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+
+ /* Now update the shared channelHash for SignalBackends() to use */
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listenersArray to InvalidDsaPointer as a
+ * marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners] = MyProcNumber;
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Unlisten the specified channel for this backend.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
ListCell *q;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
+ /* Remove from our local cache */
foreach(q, listenChannels)
{
char *lchan = (char *) lfirst(q);
@@ -1179,6 +1454,46 @@ Exec_UnlistenCommit(const char *channel)
}
}
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ return;
+ }
+ }
+
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1193,11 +1508,51 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ /* Clear our local cache */
list_free_deep(listenChannels);
listenChannels = NIL;
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+ int i;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
}
/*
@@ -1565,12 +1920,15 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are known to still be positioned at the queue head
+ * from before our commit can be safely advanced directly to the new
+ * head, since the queue region we wrote is known to contain only our
+ * own notifications. This avoids unnecessary wakeups when there is
+ * nothing of interest to them.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1941,13 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ ListCell *lc;
+
+ /*
+ * Attach to the channel hash if needed. We might not have one if this
+ * backend hasn't done LISTEN, but we need it to find listeners.
+ */
+ initChannelHash();
/*
* Identify backends that we need to signal. We don't want to send
@@ -1597,36 +1962,110 @@ SignalBackends(void)
count = 0;
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, pendingNotifyChannels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ProcNumber *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ /*
+ * Direct advancement: avoid waking non-caught up backends that
+ * aren't interested in our notifications.
+ */
+ if (pendingNotifies != NULL)
+ {
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int32 pid;
+ QueuePosition advisoryPos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(i);
+
+ if (QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
+ {
+ /*
+ * Safe to directly update a backend's shared pos if it isn't
+ * currently advancing its position. Otherwise, set
+ * the advisory pos if it's behind our new queue head.
+ */
+ if (!QUEUE_BACKEND_ADVANCING_POS(i))
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ else if (QUEUE_POS_PRECEDES(advisoryPos, queueHeadAfterWrite))
+ QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
+ }
+ else if (QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
+ {
+ /*
+ * Need to signal, cannot skip over, since we don't
+ * know if the notifications between pos and the queue
+ * head before our write are of interest for this
+ * backend or not.
+ */
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
+ else
+ {
+ /*
+ * The backend is already ahead of the notifications
+ * we wrote. No need to do anything.
+ */
+ Assert(QUEUE_POS_PRECEDES(queueHeadBeforeWrite, pos));
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1851,6 +2290,7 @@ static void
asyncQueueReadAllNotifications(void)
{
volatile QueuePosition pos;
+ QueuePosition advisoryPos;
QueuePosition head;
Snapshot snapshot;
@@ -1861,19 +2301,34 @@ asyncQueueReadAllNotifications(void)
AsyncQueueEntry align;
} page_buffer;
- /* Fetch current state */
+ /*
+ * Fetch current state, indicate to others that we have woken up,
+ * and that we now will be advancing our position.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = true;
pos = QUEUE_BACKEND_POS(MyProcNumber);
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(MyProcNumber);
head = QUEUE_HEAD;
- LWLockRelease(NotifyQueueLock);
+
+ if (QUEUE_POS_PRECEDES(pos, advisoryPos))
+ {
+ /* Advisory position is ahead, use it */
+ pos = advisoryPos;
+ QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ }
if (QUEUE_POS_EQUAL(pos, head))
{
/* Nothing to do, we have read all notifications already. */
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = false;
+ LWLockRelease(NotifyQueueLock);
return;
}
+ LWLockRelease(NotifyQueueLock);
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
@@ -1987,7 +2442,15 @@ asyncQueueReadAllNotifications(void)
{
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
- QUEUE_BACKEND_POS(MyProcNumber) = pos;
+
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = false;
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(MyProcNumber);
+
+ if (QUEUE_POS_PRECEDES(pos, advisoryPos))
+ QUEUE_BACKEND_POS(MyProcNumber) = advisoryPos;
+ else
+ QUEUE_BACKEND_POS(MyProcNumber) = pos;
+
LWLockRelease(NotifyQueueLock);
}
PG_END_TRY();
@@ -2290,13 +2753,15 @@ AddEventToPendingNotifies(Notification *n)
{
Assert(pendingNotifies->events != NIL);
- /* Create the hash table if it's time to */
+ /* Create the hash tables if it's time to */
if (list_length(pendingNotifies->events) >= MIN_HASHABLE_NOTIFIES &&
pendingNotifies->hashtab == NULL)
{
HASHCTL hash_ctl;
ListCell *l;
+ Assert(pendingNotifies->channelHashtab == NULL);
+
/* Create the hash table */
hash_ctl.keysize = sizeof(Notification *);
hash_ctl.entrysize = sizeof(struct NotificationHash);
@@ -2309,10 +2774,22 @@ AddEventToPendingNotifies(Notification *n)
&hash_ctl,
HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
+ /* Create the channel hash table */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+ hash_ctl.hcxt = CurTransactionContext;
+ pendingNotifies->channelHashtab =
+ hash_create("Pending Notify Channels",
+ 64L,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
/* Insert all the already-existing events */
foreach(l, pendingNotifies->events)
{
Notification *oldn = (Notification *) lfirst(l);
+ char *channel = oldn->data;
bool found;
(void) hash_search(pendingNotifies->hashtab,
@@ -2320,22 +2797,42 @@ AddEventToPendingNotifies(Notification *n)
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Insert channel into channelHashtab */
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if multiple events on same channel */
}
}
/* Add new event to the list, in order */
pendingNotifies->events = lappend(pendingNotifies->events, n);
- /* Add event to the hash table if needed */
+ /* Add event to the hash tables if needed */
if (pendingNotifies->hashtab != NULL)
{
bool found;
+ Assert(pendingNotifies->channelHashtab != NULL);
+
(void) hash_search(pendingNotifies->hashtab,
&n,
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Add channel to channelHashtab */
+ {
+ char *channel = n->data;
+
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if we already have an event on this channel */
+ }
}
}
@@ -2373,7 +2870,7 @@ notification_match(const void *key1, const void *key2, Size keysize)
return 1; /* not equal */
}
-/* Clear the pendingActions and pendingNotifies lists. */
+/* Clear the pendingActions, pendingNotifies, and pendingNotifyChannels lists. */
static void
ClearPendingActionsAndNotifies(void)
{
@@ -2385,6 +2882,7 @@ ClearPendingActionsAndNotifies(void)
*/
pendingActions = NULL;
pendingNotifies = NULL;
+ pendingNotifyChannels = NIL;
}
/*
@@ -2395,3 +2893,16 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 43fe3bcd593..aa91c47fcb5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-27 01:27 Chao Li <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Chao Li @ 2025-10-27 01:27 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: pgsql-hackers
> On Oct 27, 2025, at 07:24, Joel Jacobson <[email protected]> wrote:
>
> Write-up of changes since v20:
>
> Two new fields have been added to QueueBackendStatus:
> + QueuePosition advisoryPos; /* safe skip-ahead position */
> + bool advancingPos; /* backend is reading the queue */
>
> These are used SignalBackends and asyncQueueReadAllNotifications to
> handle the empheral state of the shared queue position, since we don't
> take a lock while advancing it in asyncQueueReadAllNotifications.
>
> In SignalBackends, we now don't signal laggers in other databases,
> instead we will signal any listening backend that could possibly be
> behind the old queue head, since we can't know if such backend is
> interested in the notifications before the old queue head. Realistic
> benchmarks will be needed to determine if this happens often enough to
> warrant a more complex optimization, such as the ranges idea suggested
> by Arseniy Mukhin.
>
> In SignalBackends, if a backend that is uninterested in our
> notifications, has a shared pos that is at the old queue head, then we
> will check if it's not currently advancing its pos, in which case we can
> set its shared pos to the new queue head, i.e. "direct advance" it,
> otherwise, if it's currently advancing its pos, and if its advisory pos
> is behind our new queue head, we will update its advisory pos to our new
> queue head.
>
> In asyncQueueReadAllNotifications, we start by setting wakupPending to
> false and advisoryPos to true, to indicate that we've woken up, and that
> we will now start advancing the pos. We also check if the pos is behind
> the advisory pos, and if so use the advisory pos to update the pos.
>
> In asyncQueueReadAllNotifications's PG_FINALLY block, we reset
> advancingPos to false, and detect if the advisoryPos was set by
> SignalBackends while we were processing messages on the queue, and if
> so, and if the advisoryPos is ahead of our pos, we update our shared pos
> with the advisoryPos, and otherwise update the shared pos with the new
> pos.
>
> /Joel<0001-optimize_listen_notify-v21.patch><0002-optimize_listen_notify-v21.patch>
I did a quick review on v21 only focusing on the “direct advancement” logic.
In v21, you added advisoryPos and advancingPos which is same as my proposed solution. But you missed an important point from mine.
Let’s say listener L1 is doing a slow advancing, because the last notifier pushed a bunch of notifications and L1 is interesting in them, say current QUEUE_HEAD is QH1. So, L1 is reading till reaching QH1.
Now notifier N1 comes. To N1, posBeforeWrite is QH1, and say posAfterWrite is QH2. In this case, as L1 is reading, if N1 knows that L1 will read till QH1, then N1 can still set L1’s advisoryPos to QH2, right? From this perspective, we need to add a new field adviancingTillPos to QueueBackendStatus. (This field was also missing from my proposed patch).
Then notifier N2 comes after N1. To N2, posBeforeWrite is QH2, and say posAfterWrite is QH3. As L1 is still reading, and it’s advisoryPos is QH2, so N2 can also advance L1’s advisoryPos to QH3.
Finally, L1 finished reading and reached QH1. Now it sees advisoryPos is QH3, then it can directly bump its pos to QH3.
Do you think this logic is valid?
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-27 06:18 Joel Jacobson <[email protected]>
parent: Chao Li <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-27 06:18 UTC (permalink / raw)
To: pgsql-hackers
On Mon, Oct 27, 2025, at 02:27, Chao Li wrote:
>> On Oct 27, 2025, at 07:24, Joel Jacobson <[email protected]> wrote:
>>
>> Write-up of changes since v20:
>>
>> Two new fields have been added to QueueBackendStatus:
>> + QueuePosition advisoryPos; /* safe skip-ahead position */
>> + bool advancingPos; /* backend is reading the queue */
...
> I did a quick review on v21 only focusing on the “direct advancement” logic.
>
> In v21, you added advisoryPos and advancingPos which is same as my
> proposed solution. But you missed an important point from mine.
>
...
> From this perspective, we need to add a new field
> adviancingTillPos to QueueBackendStatus. (This field was also missing
> from my proposed patch).
I'm doubtful yet another field is worth the added complexity cost.
Before increasing the complexity further, I think we should first
try to simulate somewhat realistic workloads, to see if we actually
have a problem first.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-28 01:02 Chao Li <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Chao Li @ 2025-10-28 01:02 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: pgsql-hackers
> On Oct 27, 2025, at 14:18, Joel Jacobson <[email protected]> wrote:
>
> On Mon, Oct 27, 2025, at 02:27, Chao Li wrote:
>>> On Oct 27, 2025, at 07:24, Joel Jacobson <[email protected]> wrote:
>>>
>>> Write-up of changes since v20:
>>>
>>> Two new fields have been added to QueueBackendStatus:
>>> + QueuePosition advisoryPos; /* safe skip-ahead position */
>>> + bool advancingPos; /* backend is reading the queue */
> ...
>> I did a quick review on v21 only focusing on the “direct advancement” logic.
>>
>> In v21, you added advisoryPos and advancingPos which is same as my
>> proposed solution. But you missed an important point from mine.
>>
> ...
>> From this perspective, we need to add a new field
>> adviancingTillPos to QueueBackendStatus. (This field was also missing
>> from my proposed patch).
>
> I'm doubtful yet another field is worth the added complexity cost.
>
> Before increasing the complexity further, I think we should first
> try to simulate somewhat realistic workloads, to see if we actually
> have a problem first.
>
> /Joel
>
I don’t think that’s extra complexity, IMO, that just ensure “direct advancement” to be fully functional.
But anyway, we should run some load tests to verify every solution to see how much they really improve. Do you already have or plan to work on a load test script?
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-28 06:41 Joel Jacobson <[email protected]>
parent: Chao Li <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-28 06:41 UTC (permalink / raw)
To: Chao Li <[email protected]>; +Cc: pgsql-hackers
On Tue, Oct 28, 2025, at 02:02, Chao Li wrote:
>>> From this perspective, we need to add a new field
>>> adviancingTillPos to QueueBackendStatus. (This field was also missing
>>> from my proposed patch).
>>
>> I'm doubtful yet another field is worth the added complexity cost.
>>
>> Before increasing the complexity further, I think we should first
>> try to simulate somewhat realistic workloads, to see if we actually
>> have a problem first.
>>
>> /Joel
>>
>
> I don’t think that’s extra complexity, IMO, that just ensure “direct
> advancement” to be fully functional.
An extra field is by definition extra complexity;
If it's worth it depends on how much we would gain from it,
that's why I'm doubtful it's worth it.
The extra adviancingTillPos field would only avoid wakeups in some
scenarios, if you study the example given by Arseniy, it's easy to see
why we would really need something like a the list of skip ranges
Arseniy suggested, per backend, for it to be complete,
but that's even more complexity.
I don't think it's too bad for a backend to read through the entire
queue, even if it contains some entires that are not interesting, when a
backend is awaken, processing is fast, that's not the big cost here,
what really costs is the context switches. But I've been wrong before,
so could be wrong again of course. This is just based on my intuition.
> But anyway, we should run some load tests to verify every solution to
> see how much they really improve. Do you already have or plan to work
> on a load test script?
Yes, I'm currently working on a combined benchmark / correctness test suite.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-28 06:46 Chao Li <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Chao Li @ 2025-10-28 06:46 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: pgsql-hackers
> On Oct 28, 2025, at 14:41, Joel Jacobson <[email protected]> wrote:
>
> On Tue, Oct 28, 2025, at 02:02, Chao Li wrote:
>>>> From this perspective, we need to add a new field
>>>> adviancingTillPos to QueueBackendStatus. (This field was also missing
>>>> from my proposed patch).
>>>
>>> I'm doubtful yet another field is worth the added complexity cost.
>>>
>>> Before increasing the complexity further, I think we should first
>>> try to simulate somewhat realistic workloads, to see if we actually
>>> have a problem first.
>>>
>>> /Joel
>>>
>>
>> I don’t think that’s extra complexity, IMO, that just ensure “direct
>> advancement” to be fully functional.
>
> An extra field is by definition extra complexity;
> If it's worth it depends on how much we would gain from it,
> that's why I'm doubtful it's worth it.
>
> The extra adviancingTillPos field would only avoid wakeups in some
> scenarios, if you study the example given by Arseniy, it's easy to see
> why we would really need something like a the list of skip ranges
> Arseniy suggested, per backend, for it to be complete,
> but that's even more complexity.
>
> I don't think it's too bad for a backend to read through the entire
> queue, even if it contains some entires that are not interesting, when a
> backend is awaken, processing is fast, that's not the big cost here,
> what really costs is the context switches. But I've been wrong before,
> so could be wrong again of course. This is just based on my intuition.
>
>> But anyway, we should run some load tests to verify every solution to
>> see how much they really improve. Do you already have or plan to work
>> on a load test script?
>
> Yes, I'm currently working on a combined benchmark / correctness test suite.
>
Cool. Then we can run the benchmark and decide.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-28 21:45 Joel Jacobson <[email protected]>
parent: Chao Li <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-28 21:45 UTC (permalink / raw)
To: pgsql-hackers
On Tue, Oct 28, 2025, at 07:46, Chao Li wrote:
>>> But anyway, we should run some load tests to verify every solution to
>>> see how much they really improve. Do you already have or plan to work
>>> on a load test script?
>>
>> Yes, I'm currently working on a combined benchmark / correctness test suite.
>>
>
> Cool. Then we can run the benchmark and decide.
I found a concurrency bug in v21 that could cause missed wakeup when a
backend would UNLISTEN on the last channel, which called
asyncQueueUnregister, and if wakeupPending was at that time already set,
then it wouldn't get reset, since in ProcessIncomingNotify we return
early if (listenChannels == NIL), so we would never clear wakeupPending
which happens in asyncQueueReadAllNotifications.
Fixed by clearing wakeupPending in asyncQueueUnregister:
@@ -1597,6 +1597,7 @@ asyncQueueUnregister(void)
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
/* and remove it from the list */
if (QUEUE_FIRST_LISTENER == MyProcNumber)
QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);
/Joel
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v22.patch (9.3K, 2-0001-optimize_listen_notify-v22.patch)
download | inline diff:
From 44484a2ad88a5532d2e7f28b2e8ed9095634f084 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 114 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 68 +++++++++++
2 files changed, 181 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..443a6eb669f 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 7 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
@@ -104,6 +205,17 @@ step l2commit: COMMIT;
listener2: NOTIFY "c1" with payload "" from notifier
step l2stop: UNLISTEN *;
+starting permutation: lch_listen nch_notify lch_check
+step lch_listen: LISTEN ch;
+step nch_notify: NOTIFY ch, 'aa';
+step lch_check: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener_ch: NOTIFY "ch" with payload "aa" from notifier_ch
+
starting permutation: llisten lbegin usage bignotify usage
step llisten: LISTEN c1; LISTEN c2;
step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..0a01e777b98 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,38 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Listener session for cross-session notification test with channel 'ch'.
+
+session listener_ch
+step lch_listen { LISTEN ch; }
+step lch_check { SELECT 1 AS x; }
+teardown { UNLISTEN *; }
+
+# Notifier session for cross-session notification test with channel 'ch'.
+
+session notifier_ch
+step nch_notify { NOTIFY ch, 'aa'; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +106,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
@@ -73,6 +137,10 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
# and notify queue is not empty
permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
+# Check that notifications sent from a backend that has not done LISTEN
+# are properly delivered to a listener in another backend.
+permutation lch_listen nch_notify lch_check
+
# Verify that pg_notification_queue_usage correctly reports a non-zero result,
# after submitting notifications while another connection is listening for
# those notifications and waiting inside an active transaction. We have to
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v22.patch (36.1K, 3-0002-optimize_listen_notify-v22.patch)
download | inline diff:
From 86d8c83288647255efec5321f08726922c576e14 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Tue, 14 Oct 2025 08:03:19 +0200
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of listening backends (ProcNumbers). This allows the sender to
target only those backends actually listening on the channels for which
it has queued notifications.
At commit time:
* AtCommit_Notify updates the shared channelHash to reflect any LISTEN
or UNLISTEN actions performed in the transaction.
* SignalBackends consults this hash to find the backends that are
listening on the channels being notified in the current database, and
signals only those.
Each backend's entry in AsyncQueueControl now includes a wakeupPending
flag to prevent duplicate signals while a previous wakeup is still being
processed.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
Other notes
-----------
* Maintains dual data structures: a shared channelHash for determining
which backends to signal, and a local per-backend listenChannels list
for fast lock-free lookups during notification processing. This avoids
contention on the shared hash during the high-frequency IsListeningOn
checks that occur for every notification read from the queue.
* Backends remain registered in the global listener list as long as
listenChannels is non-empty.
* Adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
* No user-visible behavioral changes; this is an internal optimization
only.
---
src/backend/commands/async.c | 614 ++++++++++++++++--
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 567 insertions(+), 51 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..f145719779d 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,21 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
- * Then we signal any backends that may be interested in our messages
- * (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * make any actual updates to the local listen state (listenChannels) and
+ * shared channel hash table (channelHash). Then we signal any backends
+ * that may be interested in our messages (including our own backend,
+ * if listening). This is done by SignalBackends(), which consults the
+ * shared channel hash table to identify listeners for the channels that
+ * have pending notifications in the current database. Each selected
+ * backend is marked as having a wakeup pending to avoid duplicate signals,
+ * and a PROCSIG_NOTIFY_INTERRUPT signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, avoiding unnecessary
+ * wakeups for idle listeners that have nothing to read.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +135,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,14 +145,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +173,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ProcNumber array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -224,11 +258,16 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +285,9 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
+ QueuePosition advisoryPos; /* safe skip-ahead position */
+ bool advancingPos; /* backend is reading the queue */
} QueueBackendStatus;
/*
@@ -260,9 +302,10 @@ typedef struct QueueBackendStatus
* (since no other backend will inspect it).
*
* When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
- * entries of other backends and also change the head pointer. When holding
- * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointers.
+ * entries of other backends and also change the head pointer. They can
+ * also advance other backends' queue positions, unless they are not
+ * in the process of doing that themselves. When holding both NotifyQueueLock and
+ * NotifyQueueTailLock in EXCLUSIVE mode, backends can change the tail pointers.
*
* SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
* the control lock for the pg_notify SLRU buffers.
@@ -288,11 +331,16 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +349,9 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
+#define QUEUE_BACKEND_ADVISORY_POS(i) (asyncQueueControl->backend[i].advisoryPos)
+#define QUEUE_BACKEND_ADVANCING_POS(i) (asyncQueueControl->backend[i].advancingPos)
/*
* The SLRU buffer area through which we access the notification queue
@@ -391,6 +442,7 @@ typedef struct NotificationList
int nestingLevel; /* current transaction nesting depth */
List *events; /* list of Notification structs */
HTAB *hashtab; /* hash of NotificationHash structs, or NULL */
+ HTAB *channelHashtab; /* hash of unique channel names, or NULL */
struct NotificationList *upper; /* details for upper transaction levels */
} NotificationList;
@@ -401,6 +453,11 @@ struct NotificationHash
Notification *event; /* => the actual Notification struct */
};
+struct ChannelHash
+{
+ char channel[NAMEDATALEN];
+};
+
static NotificationList *pendingNotifies = NULL;
/*
@@ -418,6 +475,20 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/*
+ * Queue head positions for direct advancement.
+ * These are captured during PreCommit_Notify while holding the heavyweight
+ * lock on database 0, ensuring no other backend can insert notifications
+ * between them. SignalBackends uses these to advance idle backends.
+ */
+static QueuePosition queueHeadBeforeWrite;
+static QueuePosition queueHeadAfterWrite;
+
+/*
+ * List of channels with pending notifications in the current transaction.
+ */
+static List *pendingNotifyChannels = NIL;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -457,6 +528,9 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+static void initChannelHash(void);
/*
* Compute the difference between two queue page numbers.
@@ -478,6 +552,80 @@ asyncQueuePagePrecedes(int64 p, int64 q)
return p < q;
}
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
/*
* Report space needed for our shared memory area
*/
@@ -521,12 +669,18 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVISORY_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
+ QUEUE_BACKEND_ADVANCING_POS(i) = false;
}
}
@@ -657,6 +811,7 @@ Async_Notify(const char *channel, const char *payload)
notifies->events = list_make1(n);
/* We certainly don't need a hashtable yet */
notifies->hashtab = NULL;
+ notifies->channelHashtab = NULL;
notifies->upper = pendingNotifies;
pendingNotifies = notifies;
}
@@ -894,6 +1049,36 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
+
+ /*
+ * Build list of unique channels for SignalBackends().
+ *
+ * If we have a channelHashtab, use it to efficiently get the unique
+ * channels. Otherwise, fall back to the linear approach.
+ */
+ pendingNotifyChannels = NIL;
+ if (pendingNotifies->channelHashtab != NULL)
+ {
+ HASH_SEQ_STATUS status;
+ struct ChannelHash *channelEntry;
+
+ hash_seq_init(&status, pendingNotifies->channelHashtab);
+ while ((channelEntry = (struct ChannelHash *) hash_seq_search(&status)) != NULL)
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channelEntry->channel);
+ }
+ else
+ {
+ /* Linear approach for small number of notifications */
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
+ }
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -921,6 +1106,21 @@ PreCommit_Notify(void)
*/
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /*
+ * For the direct advancement optimization in SignalBackends(), we
+ * need to ensure that no other backend can insert queue entries
+ * between queueHeadBeforeWrite and queueHeadAfterWrite. The
+ * heavyweight lock above provides this guarantee, since it serializes
+ * all writers.
+ *
+ * Note: if the heavyweight lock were ever removed for scalability
+ * reasons, we could achieve the same guarantee by holding
+ * NotifyQueueLock in EXCLUSIVE mode across all our insertions, rather
+ * than releasing and reacquiring it for each page as we do below.
+ */
+
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(queueHeadBeforeWrite, 0, 0);
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
@@ -939,12 +1139,20 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ if (firstIteration)
+ {
+ queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many notifications in the NOTIFY queue")));
nextNotify = asyncQueueAddEntries(nextNotify);
+ queueHeadAfterWrite = QUEUE_HEAD;
+
LWLockRelease(NotifyQueueLock);
}
@@ -1135,6 +1343,10 @@ Exec_ListenPreCommit(void)
static void
Exec_ListenCommit(const char *channel)
{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
MemoryContext oldcontext;
/* Do nothing if we are already listening on this channel */
@@ -1152,21 +1364,84 @@ Exec_ListenCommit(const char *channel)
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
+
+ /* Now update the shared channelHash for SignalBackends() to use */
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listenersArray to InvalidDsaPointer as a
+ * marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners] = MyProcNumber;
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Unlisten the specified channel for this backend.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
ListCell *q;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
+ /* Remove from our local cache */
foreach(q, listenChannels)
{
char *lchan = (char *) lfirst(q);
@@ -1179,6 +1454,46 @@ Exec_UnlistenCommit(const char *channel)
}
}
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ return;
+ }
+ }
+
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1193,11 +1508,51 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ /* Clear our local cache */
list_free_deep(listenChannels);
listenChannels = NIL;
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+ int i;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
}
/*
@@ -1242,6 +1597,7 @@ asyncQueueUnregister(void)
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
/* and remove it from the list */
if (QUEUE_FIRST_LISTENER == MyProcNumber)
QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);
@@ -1565,12 +1921,15 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are known to still be positioned at the queue head
+ * from before our commit can be safely advanced directly to the new
+ * head, since the queue region we wrote is known to contain only our
+ * own notifications. This avoids unnecessary wakeups when there is
+ * nothing of interest to them.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1942,13 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ ListCell *lc;
+
+ /*
+ * Attach to the channel hash if needed. We might not have one if this
+ * backend hasn't done LISTEN, but we need it to find listeners.
+ */
+ initChannelHash();
/*
* Identify backends that we need to signal. We don't want to send
@@ -1597,36 +1963,110 @@ SignalBackends(void)
count = 0;
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, pendingNotifyChannels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ProcNumber *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ /*
+ * Direct advancement: avoid waking non-caught up backends that
+ * aren't interested in our notifications.
+ */
+ if (pendingNotifies != NULL)
+ {
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int32 pid;
+ QueuePosition advisoryPos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(i);
+
+ if (QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
+ {
+ /*
+ * Safe to directly update a backend's shared pos if it isn't
+ * currently advancing its position. Otherwise, set
+ * the advisory pos if it's behind our new queue head.
+ */
+ if (!QUEUE_BACKEND_ADVANCING_POS(i))
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ else if (QUEUE_POS_PRECEDES(advisoryPos, queueHeadAfterWrite))
+ QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
+ }
+ else if (QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
+ {
+ /*
+ * Need to signal, cannot skip over, since we don't
+ * know if the notifications between pos and the queue
+ * head before our write are of interest for this
+ * backend or not.
+ */
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
+ else
+ {
+ /*
+ * The backend is already ahead of the notifications
+ * we wrote. No need to do anything.
+ */
+ Assert(QUEUE_POS_PRECEDES(queueHeadBeforeWrite, pos));
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1851,6 +2291,7 @@ static void
asyncQueueReadAllNotifications(void)
{
volatile QueuePosition pos;
+ QueuePosition advisoryPos;
QueuePosition head;
Snapshot snapshot;
@@ -1861,19 +2302,34 @@ asyncQueueReadAllNotifications(void)
AsyncQueueEntry align;
} page_buffer;
- /* Fetch current state */
+ /*
+ * Fetch current state, indicate to others that we have woken up,
+ * and that we now will be advancing our position.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = true;
pos = QUEUE_BACKEND_POS(MyProcNumber);
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(MyProcNumber);
head = QUEUE_HEAD;
- LWLockRelease(NotifyQueueLock);
+
+ if (QUEUE_POS_PRECEDES(pos, advisoryPos))
+ {
+ /* Advisory position is ahead, use it */
+ pos = advisoryPos;
+ QUEUE_BACKEND_POS(MyProcNumber) = pos;
+ }
if (QUEUE_POS_EQUAL(pos, head))
{
/* Nothing to do, we have read all notifications already. */
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = false;
+ LWLockRelease(NotifyQueueLock);
return;
}
+ LWLockRelease(NotifyQueueLock);
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
@@ -1987,7 +2443,15 @@ asyncQueueReadAllNotifications(void)
{
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
- QUEUE_BACKEND_POS(MyProcNumber) = pos;
+
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = false;
+ advisoryPos = QUEUE_BACKEND_ADVISORY_POS(MyProcNumber);
+
+ if (QUEUE_POS_PRECEDES(pos, advisoryPos))
+ QUEUE_BACKEND_POS(MyProcNumber) = advisoryPos;
+ else
+ QUEUE_BACKEND_POS(MyProcNumber) = pos;
+
LWLockRelease(NotifyQueueLock);
}
PG_END_TRY();
@@ -2290,13 +2754,15 @@ AddEventToPendingNotifies(Notification *n)
{
Assert(pendingNotifies->events != NIL);
- /* Create the hash table if it's time to */
+ /* Create the hash tables if it's time to */
if (list_length(pendingNotifies->events) >= MIN_HASHABLE_NOTIFIES &&
pendingNotifies->hashtab == NULL)
{
HASHCTL hash_ctl;
ListCell *l;
+ Assert(pendingNotifies->channelHashtab == NULL);
+
/* Create the hash table */
hash_ctl.keysize = sizeof(Notification *);
hash_ctl.entrysize = sizeof(struct NotificationHash);
@@ -2309,10 +2775,22 @@ AddEventToPendingNotifies(Notification *n)
&hash_ctl,
HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
+ /* Create the channel hash table */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+ hash_ctl.hcxt = CurTransactionContext;
+ pendingNotifies->channelHashtab =
+ hash_create("Pending Notify Channels",
+ 64L,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
/* Insert all the already-existing events */
foreach(l, pendingNotifies->events)
{
Notification *oldn = (Notification *) lfirst(l);
+ char *channel = oldn->data;
bool found;
(void) hash_search(pendingNotifies->hashtab,
@@ -2320,22 +2798,42 @@ AddEventToPendingNotifies(Notification *n)
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Insert channel into channelHashtab */
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if multiple events on same channel */
}
}
/* Add new event to the list, in order */
pendingNotifies->events = lappend(pendingNotifies->events, n);
- /* Add event to the hash table if needed */
+ /* Add event to the hash tables if needed */
if (pendingNotifies->hashtab != NULL)
{
bool found;
+ Assert(pendingNotifies->channelHashtab != NULL);
+
(void) hash_search(pendingNotifies->hashtab,
&n,
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Add channel to channelHashtab */
+ {
+ char *channel = n->data;
+
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if we already have an event on this channel */
+ }
}
}
@@ -2373,7 +2871,7 @@ notification_match(const void *key1, const void *key2, Size keysize)
return 1; /* not equal */
}
-/* Clear the pendingActions and pendingNotifies lists. */
+/* Clear the pendingActions, pendingNotifies, and pendingNotifyChannels lists. */
static void
ClearPendingActionsAndNotifies(void)
{
@@ -2385,6 +2883,7 @@ ClearPendingActionsAndNotifies(void)
*/
pendingActions = NULL;
pendingNotifies = NULL;
+ pendingNotifyChannels = NIL;
}
/*
@@ -2395,3 +2894,16 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 43fe3bcd593..aa91c47fcb5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-29 07:05 Chao Li <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Chao Li @ 2025-10-29 07:05 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: pgsql-hackers
> On Oct 29, 2025, at 05:45, Joel Jacobson <[email protected]> wrote:
>
> On Tue, Oct 28, 2025, at 07:46, Chao Li wrote:
>>>> But anyway, we should run some load tests to verify every solution to
>>>> see how much they really improve. Do you already have or plan to work
>>>> on a load test script?
>>>
>>> Yes, I'm currently working on a combined benchmark / correctness test suite.
>>>
>>
>> Cool. Then we can run the benchmark and decide.
>
> I found a concurrency bug in v21 that could cause missed wakeup when a
> backend would UNLISTEN on the last channel, which called
> asyncQueueUnregister, and if wakeupPending was at that time already set,
> then it wouldn't get reset, since in ProcessIncomingNotify we return
> early if (listenChannels == NIL), so we would never clear wakeupPending
> which happens in asyncQueueReadAllNotifications.
>
> Fixed by clearing wakeupPending in asyncQueueUnregister:
>
> @@ -1597,6 +1597,7 @@ asyncQueueUnregister(void)
> /* Mark our entry as invalid */
> QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
> QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
> + QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
> /* and remove it from the list */
> if (QUEUE_FIRST_LISTENER == MyProcNumber)
> QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);
>
> /Joel<0001-optimize_listen_notify-v22.patch><0002-optimize_listen_notify-v22.patch>
I think the current implementation still has a race problem.
Let’s say notifier N1 notifies listener’s L1 to read message.
L1 starts to read: it acquires the look, gets reading range, then releases the lock, start performs reading without holding the lock.
Notifier N2 comes, N2 doesn’t have anything L1 is interested in. N2 now holds the look, when it checks "if (QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))”, here comes the race. Because the lock is in N2’s hand, L1 cannot get the lock to update its pos, so "if (QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))” will not be satisfied, so direct advancement won’t happen.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-29 10:33 Joel Jacobson <[email protected]>
parent: Chao Li <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-10-29 10:33 UTC (permalink / raw)
To: pgsql-hackers
On Wed, Oct 29, 2025, at 08:05, Chao Li wrote:
>> On Oct 29, 2025, at 05:45, Joel Jacobson <[email protected]> wrote:
>> I found a concurrency bug in v21 that could cause missed wakeup when a
>> backend would UNLISTEN on the last channel, which called
>> asyncQueueUnregister, and if wakeupPending was at that time already set,
>> then it wouldn't get reset, since in ProcessIncomingNotify we return
>> early if (listenChannels == NIL), so we would never clear wakeupPending
>> which happens in asyncQueueReadAllNotifications.
>>
>> Fixed by clearing wakeupPending in asyncQueueUnregister:
>>
>> @@ -1597,6 +1597,7 @@ asyncQueueUnregister(void)
>> /* Mark our entry as invalid */
>> QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
>> QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
>> + QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
>> /* and remove it from the list */
>> if (QUEUE_FIRST_LISTENER == MyProcNumber)
>> QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);
>>
>> /Joel<0001-optimize_listen_notify-v22.patch><0002-optimize_listen_notify-v22.patch>
>
> I think the current implementation still has a race problem.
>
> Let’s say notifier N1 notifies listener’s L1 to read message.
> L1 starts to read: it acquires the look, gets reading range, then
> releases the lock, start performs reading without holding the lock.
> Notifier N2 comes, N2 doesn’t have anything L1 is interested in. N2 now
> holds the look, when it checks "if (QUEUE_POS_EQUAL(pos,
> queueHeadBeforeWrite))”, here comes the race. Because the lock is in
> N2’s hand, L1 cannot get the lock to update its pos, so "if
> (QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))” will not be satisfied, so
> direct advancement won’t happen.
I'm not sure I agree that qualifies as a race "problem" per se, since I
think that just sounds like a case where we would do an unnecessary
wakeup, right?
Without more sophisticated data structures (e.g. skip ranges) and
increased code complexity, there will always be cases where we will by
do unnecessary wakeups, which IMO need not be a design goal to
completely avoid, until we have benchmark data that indicates otherwise.
I think we should iterate by first trying to reason about correctness of
the code, trying to prove/disprove if a notifications could ever end up
not being delivered. The bug I fixed in v22 is an example of such a
case, that would cause a listening backend to never be awaken, since
notifiers would not signal it due to the pending wake that was not
cleared.
I wonder if there could be more such serious bugs in the current code. I
will focus my efforts now trying to answer that question. Would be
really nice if we could find a way to reason formally about this. I've
been looking into the P programming language, which seems suitable for
modeling and verifying these kind of asynchronous concurrency protocols,
I will give it a try.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-10-30 03:22 Chao Li <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Chao Li @ 2025-10-30 03:22 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: pgsql-hackers
> On Oct 29, 2025, at 18:33, Joel Jacobson <[email protected]> wrote:
>
> On Wed, Oct 29, 2025, at 08:05, Chao Li wrote:
>>
>> I think the current implementation still has a race problem.
>>
>> Let’s say notifier N1 notifies listener’s L1 to read message.
>> L1 starts to read: it acquires the look, gets reading range, then
>> releases the lock, start performs reading without holding the lock.
>> Notifier N2 comes, N2 doesn’t have anything L1 is interested in. N2 now
>> holds the look, when it checks "if (QUEUE_POS_EQUAL(pos,
>> queueHeadBeforeWrite))”, here comes the race. Because the lock is in
>> N2’s hand, L1 cannot get the lock to update its pos, so "if
>> (QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))” will not be satisfied, so
>> direct advancement won’t happen.
>
> I'm not sure I agree that qualifies as a race "problem" per se, since I
> think that just sounds like a case where we would do an unnecessary
> wakeup, right?
>
Why unnecessary? Say there are 100 listeners L1 - L100. When N2 is checking state of L1, L100 has finished reading, ideally L100 should update its pos, then when N2 reaches L100, it should do direct advancement, right?
But now the problem is, we use a single notification lock to handle all notifiers and listeners. Assume if every backend process has a notification lock, then the race is no longer there. When N2 is checking state of L1, it just holds L1’s lock, so L100 can go ahead update its pos, then when N2 reaches L100, N2 can do direct advancement.
I ever thought to propose to use a lock for every backend process, but I didn’t, because a lock is underlying an expensive semaphore, if there are hundreds of backends, adding the same number of semaphores doesn’t seem a good thing, which would be a too many overheads to the system.
> Without more sophisticated data structures (e.g. skip ranges) and
> increased code complexity, there will always be cases where we will by
> do unnecessary wakeups, which IMO need not be a design goal to
> completely avoid, until we have benchmark data that indicates otherwise.
>
The other problem I see is that, we don’t have a way to evaluate if the “direct advancement” is really effective, such as 1) if a case that can perform “direct advancement” is really applied the advancement; 2) in a test model, how many “direct advancement” are applied.
> I think we should iterate by first trying to reason about correctness of
> the code, trying to prove/disprove if a notifications could ever end up
> not being delivered. The bug I fixed in v22 is an example of such a
> case, that would cause a listening backend to never be awaken, since
> notifiers would not signal it due to the pending wake that was not
> cleared.
>
> I wonder if there could be more such serious bugs in the current code. I
> will focus my efforts now trying to answer that question. Would be
> really nice if we could find a way to reason formally about this. I've
> been looking into the P programming language, which seems suitable for
> modeling and verifying these kind of asynchronous concurrency protocols,
> I will give it a try.
>
I don’t think we need to rush. From my observation, none of the “big” patches can get merged quickly anyway. Rather than hurrying to make it “ready,” I think it’s better to take the time to make it “perfect”. I have also spent a lot of time on this patch, and I don’t mind to spend more. If you need a hand, I will be happy to offer.
TBH, with all the problems I described earlier still in my brain, I just cannot convince myself to let this patch go yet. Sorry about that.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-01 20:41 Arseniy Mukhin <[email protected]>
parent: Chao Li <[email protected]>
0 siblings, 2 replies; 120+ messages in thread
From: Arseniy Mukhin @ 2025-11-01 20:41 UTC (permalink / raw)
To: Chao Li <[email protected]>; +Cc: Joel Jacobson <[email protected]>; pgsql-hackers
Hi,
On Mon, Oct 27, 2025 at 2:25 AM Joel Jacobson <[email protected]> wrote:
>
> On Sun, Oct 26, 2025, at 07:33, Joel Jacobson wrote:
> > On Sun, Oct 26, 2025, at 05:11, Chao Li wrote:
> >> I figured out a way to resolve the race condition for alt3:
> >>
> >> * add an awakening flag for every listener, this flag is only set by
> >> listeners
> >> * add an advisory pos for every listener, similar to alt1
> >> * if a listener is awaken, notify only updates the listener’s advisory
> >> pos; otherwise directly advance its position.
> >> * If a running listener see current pos is behind advisory pos, then
> >> stop reading
> ...
> > This sounds promising, similar to what I had in mind. I was thinking
> > about the idea of using the advisoryPos only when the listening backend
> > is known to be running (which felt like it would need another shared
> > boolean field), and to move its pos field directly only when it's not
> > running, since if it's running we don't need to optimize for context
> > switching, since it's by definition already running.
>
> Write-up of changes since v20:
>
Thank you for working on this! There are few points about 'direct
advancement' part:
> Two new fields have been added to QueueBackendStatus:
> + QueuePosition advisoryPos; /* safe skip-ahead position */
> + bool advancingPos; /* backend is reading the queue */
>
> ...
>
> In SignalBackends, if a backend that is uninterested in our
> notifications, has a shared pos that is at the old queue head, then we
> will check if it's not currently advancing its pos, in which case we can
> set its shared pos to the new queue head, i.e. "direct advance" it,
> otherwise, if it's currently advancing its pos, and if its advisory pos
> is behind our new queue head, we will update its advisory pos to our new
> queue head.
>
> In asyncQueueReadAllNotifications, we start by setting wakupPending to
> false and advisoryPos to true, to indicate that we've woken up, and that
> we will now start advancing the pos. We also check if the pos is behind
> the advisory pos, and if so use the advisory pos to update the pos.
>
> In asyncQueueReadAllNotifications's PG_FINALLY block, we reset
> advancingPos to false, and detect if the advisoryPos was set by
> SignalBackends while we were processing messages on the queue, and if
> so, and if the advisoryPos is ahead of our pos, we update our shared pos
> with the advisoryPos, and otherwise update the shared pos with the new
> pos.
Looks like the bug with truncating of the queue is gone, advancingPos
does the trick, great.
Maybe I missed something, but I failed to find an example where we can
take advantage of advisoryPos:
SignalBackends(void)
...
if (QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
{
...
if (!QUEUE_BACKEND_ADVANCING_POS(i))
QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
else if (QUEUE_POS_PRECEDES(advisoryPos, queueHeadAfterWrite))
QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
}
We update advisoryPos if:
1) listener's advancingPos is true
2) listener's pos equals queueHeadBeforeWrite
(1) means the listener is currently reading. (2) means notifications
that the listener is currently reading belong to us (or it's even
possible that the listener is reading notifications that were added in
the queue after ours). And since the listener is reading, it will only
see updated advancingPos in the PG_FINALLY, where listener's pos will
already be >= queueHeadAfterWrite (as result of reading).
This condition seems to be redundant. I would say it should always be
true, otherwise it would mean that somebody allowed the listener to
skip our notification.
else if (QUEUE_POS_PRECEDES(advisoryPos, queueHeadAfterWrite))
Best regards,
Arseniy Mukhin
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-05 00:58 Joel Jacobson <[email protected]>
parent: Arseniy Mukhin <[email protected]>
1 sibling, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-11-05 00:58 UTC (permalink / raw)
To: Arseniy Mukhin <[email protected]>; Chao Li <[email protected]>; +Cc: pgsql-hackers
On Sat, Nov 1, 2025, at 21:41, Arseniy Mukhin wrote:
> Thank you for working on this! There are few points about 'direct
> advancement' part:
Thanks for reviewing!
> Looks like the bug with truncating of the queue is gone, advancingPos
> does the trick, great.
>
> Maybe I missed something, but I failed to find an example where we can
> take advantage of advisoryPos:
>
> SignalBackends(void)
> ...
>
> if (QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
> {
> ...
> if (!QUEUE_BACKEND_ADVANCING_POS(i))
> QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
> else if (QUEUE_POS_PRECEDES(advisoryPos, queueHeadAfterWrite))
> QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
> }
>
> We update advisoryPos if:
> 1) listener's advancingPos is true
> 2) listener's pos equals queueHeadBeforeWrite
>
> (1) means the listener is currently reading. (2) means notifications
> that the listener is currently reading belong to us (or it's even
> possible that the listener is reading notifications that were added in
> the queue after ours). And since the listener is reading, it will only
> see updated advancingPos in the PG_FINALLY, where listener's pos will
> already be >= queueHeadAfterWrite (as result of reading).
>
>
> This condition seems to be redundant. I would say it should always be
> true, otherwise it would mean that somebody allowed the listener to
> skip our notification.
>
> else if (QUEUE_POS_PRECEDES(advisoryPos, queueHeadAfterWrite))
Ohhh, right! I agree with your reasoning; it's dead code.
This means we can remove the advisoryPos altogether, with
the benefit of making the code even simpler. That's what I've
done in v22, among some other changes.
Changes since v22:
* Optimize listening on thousands of channels per backend by replacing
the listenChannels List with a local hash table, renamed to
listenChannelsHash to avoid confusion.
* Removed advisoryPos, since it was not actually used. We only needed
advancingPos to fix the bug with truncation of the queue. It's possible
that the bottleneck in some workloads is no longer the wakeups, but I'm
not sure yet; I'll do some more benchmarking to get a better
understanding of whether it would be worthwhile to pursue further
optimization.
* Removed asyncQueuePageDiff, since it's no longer used.
Benchmark to demonstrate the effect of the listenChannelsHash:
% gcc -Wall -Wextra -O2 -pthread -I/Users/joel/pg19/include/postgresql/server -I/Users/joel/pg19/include -o async-notify-test-4 async-notify-test-4.c -L/Users/joel/pg19/lib -lpq -pthread -lm
v21:
% ./async-notify-test-4 --listeners 1 --notifiers 1 --channels 1 --extra-channels=10000
10 s: 1286593 sent (130036/s), 437822 received (44121/s)
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 1 (0.0%) avg: 0.099ms
0.10-1.00ms # 39 (0.0%) avg: 0.576ms
1.00-10.00ms # 395 (0.1%) avg: 6.005ms
10.00-100.00ms # 6186 (1.4%) avg: 52.880mss
>100.00ms ######### 431214 (98.5%) avg: 3379.928ms
v22:
% ./async-notify-test-4 --listeners 1 --notifiers 1 --channels 1 --extra-channels=10000
10 s: 879208 sent (87704/s), 879207 received (87703/s)
0.00-0.01ms # 31 (0.0%) avg: 0.009ms
0.01-0.10ms ######### 879012 (100.0%) avg: 0.016ms
0.10-1.00ms # 157 (0.0%) avg: 0.155ms
1.00-10.00ms # 7 (0.0%) avg: 2.913ms
10.00-100.00ms # 1 (0.0%) avg: 11.457ms
>100.00ms 0 (0.0%) avg: 0.000ms
/Joel
Attachments:
[application/octet-stream] async-notify-test-4.c (14.6K, 2-async-notify-test-4.c)
download
[application/octet-stream] 0001-optimize_listen_notify-v23.patch (9.3K, 3-0001-optimize_listen_notify-v23.patch)
download | inline diff:
From fb822108149ea01fa25a46f1a4c0ba71f86e1a2b Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 114 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 68 +++++++++++
2 files changed, 181 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..443a6eb669f 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 7 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
@@ -104,6 +205,17 @@ step l2commit: COMMIT;
listener2: NOTIFY "c1" with payload "" from notifier
step l2stop: UNLISTEN *;
+starting permutation: lch_listen nch_notify lch_check
+step lch_listen: LISTEN ch;
+step nch_notify: NOTIFY ch, 'aa';
+step lch_check: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener_ch: NOTIFY "ch" with payload "aa" from notifier_ch
+
starting permutation: llisten lbegin usage bignotify usage
step llisten: LISTEN c1; LISTEN c2;
step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..0a01e777b98 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,38 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Listener session for cross-session notification test with channel 'ch'.
+
+session listener_ch
+step lch_listen { LISTEN ch; }
+step lch_check { SELECT 1 AS x; }
+teardown { UNLISTEN *; }
+
+# Notifier session for cross-session notification test with channel 'ch'.
+
+session notifier_ch
+step nch_notify { NOTIFY ch, 'aa'; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +106,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
@@ -73,6 +137,10 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
# and notify queue is not empty
permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
+# Check that notifications sent from a backend that has not done LISTEN
+# are properly delivered to a listener in another backend.
+permutation lch_listen nch_notify lch_check
+
# Verify that pg_notification_queue_usage correctly reports a non-zero result,
# after submitting notifications while another connection is listening for
# those notifications and waiting inside an active transaction. We have to
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v23.patch (42.3K, 4-0002-optimize_listen_notify-v23.patch)
download | inline diff:
From 59eececc519ded498b352fe9b91dfd36f14998c3 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Tue, 14 Oct 2025 08:03:19 +0200
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of listening backends (ProcNumbers). This allows the sender to
target only those backends actually listening on the channels for which
it has queued notifications.
At commit time:
* AtCommit_Notify updates the shared channelHash to reflect any LISTEN
or UNLISTEN actions performed in the transaction.
* SignalBackends consults this hash to find the backends that are
listening on the channels being notified in the current database, and
signals only those.
Each backend's entry in AsyncQueueControl now includes a wakeupPending
flag to prevent duplicate signals while a previous wakeup is still being
processed.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
Other notes
-----------
* Maintains dual data structures: a shared channelHash for determining
which backends to signal, and a local per-backend listenChannels list
for fast lock-free lookups during notification processing. This avoids
contention on the shared hash during the high-frequency IsListeningOn
checks that occur for every notification read from the queue.
* Backends remain registered in the global listener list as long as
listenChannels is non-empty.
* Adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
* No user-visible behavioral changes; this is an internal optimization
only.
---
src/backend/commands/async.c | 714 +++++++++++++++---
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 614 insertions(+), 104 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..8dac12f8124 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,21 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
- * Then we signal any backends that may be interested in our messages
- * (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * make any actual updates to the local listen state (listenChannelsHash) and
+ * shared channel hash table (channelHash). Then we signal any backends
+ * that may be interested in our messages (including our own backend,
+ * if listening). This is done by SignalBackends(), which consults the
+ * shared channel hash table to identify listeners for the channels that
+ * have pending notifications in the current database. Each selected
+ * backend is marked as having a wakeup pending to avoid duplicate signals,
+ * and a PROCSIG_NOTIFY_INTERRUPT signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, avoiding unnecessary
+ * wakeups for idle listeners that have nothing to read.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -128,6 +135,7 @@
#include <limits.h>
#include <unistd.h>
#include <signal.h>
+#include <string.h>
#include "access/parallel.h"
#include "access/slru.h"
@@ -137,14 +145,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +173,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ProcNumber array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -224,11 +258,16 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +285,8 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
+ bool advancingPos; /* backend is reading the queue */
} QueueBackendStatus;
/*
@@ -260,9 +301,10 @@ typedef struct QueueBackendStatus
* (since no other backend will inspect it).
*
* When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
- * entries of other backends and also change the head pointer. When holding
- * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointers.
+ * entries of other backends and also change the head pointer. They can
+ * also advance other backends' queue positions, unless they are not
+ * in the process of doing that themselves. When holding both NotifyQueueLock and
+ * NotifyQueueTailLock in EXCLUSIVE mode, backends can change the tail pointers.
*
* SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
* the control lock for the pg_notify SLRU buffers.
@@ -288,11 +330,16 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +348,8 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
+#define QUEUE_BACKEND_ADVANCING_POS(i) (asyncQueueControl->backend[i].advancingPos)
/*
* The SLRU buffer area through which we access the notification queue
@@ -313,16 +362,16 @@ static SlruCtlData NotifyCtlData;
#define QUEUE_FULL_WARN_INTERVAL 5000 /* warn at most once every 5s */
/*
- * listenChannels identifies the channels we are actually listening to
- * (ie, have committed a LISTEN on). It is a simple list of channel names,
+ * listenChannelsHash identifies the channels we are actually listening to
+ * (ie, have committed a LISTEN on). It is a hash table of channel names,
* allocated in TopMemoryContext.
*/
-static List *listenChannels = NIL; /* list of C strings */
+static HTAB *listenChannelsHash = NULL;
/*
* State for pending LISTEN/UNLISTEN actions consists of an ordered list of
* all actions requested in the current transaction. As explained above,
- * we don't actually change listenChannels until we reach transaction commit.
+ * we don't actually change listenChannelsHash until we reach transaction commit.
*
* The list is kept in CurTransactionContext. In subtransactions, each
* subtransaction has its own list in its own CurTransactionContext, but
@@ -391,6 +440,7 @@ typedef struct NotificationList
int nestingLevel; /* current transaction nesting depth */
List *events; /* list of Notification structs */
HTAB *hashtab; /* hash of NotificationHash structs, or NULL */
+ HTAB *channelHashtab; /* hash of unique channel names, or NULL */
struct NotificationList *upper; /* details for upper transaction levels */
} NotificationList;
@@ -401,6 +451,11 @@ struct NotificationHash
Notification *event; /* => the actual Notification struct */
};
+struct ChannelHash
+{
+ char channel[NAMEDATALEN];
+};
+
static NotificationList *pendingNotifies = NULL;
/*
@@ -418,6 +473,20 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/*
+ * Queue head positions for direct advancement.
+ * These are captured during PreCommit_Notify while holding the heavyweight
+ * lock on database 0, ensuring no other backend can insert notifications
+ * between them. SignalBackends uses these to advance idle backends.
+ */
+static QueuePosition queueHeadBeforeWrite;
+static QueuePosition queueHeadAfterWrite;
+
+/*
+ * List of channels with pending notifications in the current transaction.
+ */
+static List *pendingNotifyChannels = NIL;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -428,7 +497,6 @@ bool Trace_notify = false;
int max_notify_queue_pages = 1048576;
/* local function prototypes */
-static inline int64 asyncQueuePageDiff(int64 p, int64 q);
static inline bool asyncQueuePagePrecedes(int64 p, int64 q);
static void queue_listen(ListenActionKind action, const char *channel);
static void Async_UnlistenOnExit(int code, Datum arg);
@@ -457,16 +525,9 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
-
-/*
- * Compute the difference between two queue page numbers.
- * Previously this function accounted for a wraparound.
- */
-static inline int64
-asyncQueuePageDiff(int64 p, int64 q)
-{
- return p - q;
-}
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+static void initChannelHash(void);
/*
* Determines whether p precedes q.
@@ -478,6 +539,105 @@ asyncQueuePagePrecedes(int64 p, int64 q)
return p < q;
}
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
+/*
+ * initListenChannelsHash
+ * Lazy initialization of the local listen channels hash table.
+ */
+static void
+initListenChannelsHash(void)
+{
+ HASHCTL hash_ctl;
+
+ /* Quick exit if we already did this */
+ if (listenChannelsHash != NULL)
+ return;
+
+ /* Initialize local hash table for this backend's listened channels */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+
+ listenChannelsHash =
+ hash_create("Listen Channels",
+ 64,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS);
+}
+
/*
* Report space needed for our shared memory area
*/
@@ -521,12 +681,17 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
+ QUEUE_BACKEND_ADVANCING_POS(i) = false;
}
}
@@ -657,6 +822,7 @@ Async_Notify(const char *channel, const char *payload)
notifies->events = list_make1(n);
/* We certainly don't need a hashtable yet */
notifies->hashtab = NULL;
+ notifies->channelHashtab = NULL;
notifies->upper = pendingNotifies;
pendingNotifies = notifies;
}
@@ -683,7 +849,7 @@ Async_Notify(const char *channel, const char *payload)
* Common code for listen, unlisten, unlisten all commands.
*
* Adds the request to the list of pending actions.
- * Actual update of the listenChannels list happens during transaction
+ * Actual update of the listenChannelsHash happens during transaction
* commit.
*/
static void
@@ -783,30 +949,49 @@ Async_UnlistenAll(void)
* SQL function: return a set of the channel names this backend is actively
* listening to.
*
- * Note: this coding relies on the fact that the listenChannels list cannot
+ * Note: this coding relies on the fact that the listenChannelsHash cannot
* change within a transaction.
*/
Datum
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ HASH_SEQ_STATUS *status;
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Initialize hash table iteration if we have any channels */
+ if (listenChannelsHash != NULL)
+ {
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ status = (HASH_SEQ_STATUS *) palloc(sizeof(HASH_SEQ_STATUS));
+ hash_seq_init(status, listenChannelsHash);
+ funcctx->user_fctx = status;
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ {
+ funcctx->user_fctx = NULL;
+ }
}
/* stuff done on every call of the function */
funcctx = SRF_PERCALL_SETUP();
+ status = (HASH_SEQ_STATUS *) funcctx->user_fctx;
- if (funcctx->call_cntr < list_length(listenChannels))
+ if (status != NULL)
{
- char *channel = (char *) list_nth(listenChannels,
- funcctx->call_cntr);
+ struct ChannelHash *entry;
- SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
+ entry = (struct ChannelHash *) hash_seq_search(status);
+ if (entry != NULL)
+ SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(entry->channel));
}
SRF_RETURN_DONE(funcctx);
@@ -894,6 +1079,36 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
+
+ /*
+ * Build list of unique channels for SignalBackends().
+ *
+ * If we have a channelHashtab, use it to efficiently get the unique
+ * channels. Otherwise, fall back to the linear approach.
+ */
+ pendingNotifyChannels = NIL;
+ if (pendingNotifies->channelHashtab != NULL)
+ {
+ HASH_SEQ_STATUS status;
+ struct ChannelHash *channelEntry;
+
+ hash_seq_init(&status, pendingNotifies->channelHashtab);
+ while ((channelEntry = (struct ChannelHash *) hash_seq_search(&status)) != NULL)
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channelEntry->channel);
+ }
+ else
+ {
+ /* Linear approach for small number of notifications */
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
+ }
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -921,6 +1136,21 @@ PreCommit_Notify(void)
*/
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /*
+ * For the direct advancement optimization in SignalBackends(), we
+ * need to ensure that no other backend can insert queue entries
+ * between queueHeadBeforeWrite and queueHeadAfterWrite. The
+ * heavyweight lock above provides this guarantee, since it serializes
+ * all writers.
+ *
+ * Note: if the heavyweight lock were ever removed for scalability
+ * reasons, we could achieve the same guarantee by holding
+ * NotifyQueueLock in EXCLUSIVE mode across all our insertions, rather
+ * than releasing and reacquiring it for each page as we do below.
+ */
+
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(queueHeadBeforeWrite, 0, 0);
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
@@ -939,12 +1169,20 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ if (firstIteration)
+ {
+ queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many notifications in the NOTIFY queue")));
nextNotify = asyncQueueAddEntries(nextNotify);
+ queueHeadAfterWrite = QUEUE_HEAD;
+
LWLockRelease(NotifyQueueLock);
}
@@ -957,7 +1195,7 @@ PreCommit_Notify(void)
*
* This is called at transaction commit, after committing to clog.
*
- * Update listenChannels and clear transaction-local state.
+ * Update listenChannelsHash and clear transaction-local state.
*
* If we issued any notifications in the transaction, send signals to
* listening backends (possibly including ourselves) to process them.
@@ -1002,7 +1240,8 @@ AtCommit_Notify(void)
}
/* If no longer listening to anything, get out of listener array */
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/*
@@ -1135,50 +1374,145 @@ Exec_ListenPreCommit(void)
static void
Exec_ListenCommit(const char *channel)
{
- MemoryContext oldcontext;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
/* Do nothing if we are already listening on this channel */
if (IsListeningOn(channel))
return;
/*
- * Add the new channel name to listenChannels.
+ * Add the new channel name to listenChannelsHash.
*
* XXX It is theoretically possible to get an out-of-memory failure here,
* which would be bad because we already committed. For the moment it
* doesn't seem worth trying to guard against that, but maybe improve this
* later.
*/
- oldcontext = MemoryContextSwitchTo(TopMemoryContext);
- listenChannels = lappend(listenChannels, pstrdup(channel));
- MemoryContextSwitchTo(oldcontext);
+ initListenChannelsHash();
+ (void) hash_search(listenChannelsHash, channel, HASH_ENTER, NULL);
+
+ /* Now update the shared channelHash for SignalBackends() to use */
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listenersArray to InvalidDsaPointer as a
+ * marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners] = MyProcNumber;
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Unlisten the specified channel for this backend.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
- ListCell *q;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
- foreach(q, listenChannels)
+ /* Remove from our local cache */
+ if (listenChannelsHash != NULL)
+ (void) hash_search(listenChannelsHash, channel, HASH_REMOVE, NULL);
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(q);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i] == MyProcNumber)
{
- listenChannels = foreach_delete_current(listenChannels, q);
- pfree(lchan);
- break;
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ return;
}
}
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1193,34 +1527,68 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
- list_free_deep(listenChannels);
- listenChannels = NIL;
+ /* Clear our local cache */
+ if (listenChannelsHash != NULL)
+ {
+ hash_destroy(listenChannelsHash);
+ listenChannelsHash = NULL;
+ }
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+ int i;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
}
/*
* Test whether we are actively listening on the given channel name.
*
* Note: this function is executed for every notification found in the queue.
- * Perhaps it is worth further optimization, eg convert the list to a sorted
- * array so we can binary-search it. In practice the list is likely to be
- * fairly short, though.
*/
static bool
IsListeningOn(const char *channel)
{
- ListCell *p;
+ if (listenChannelsHash == NULL)
+ return false;
- foreach(p, listenChannels)
- {
- char *lchan = (char *) lfirst(p);
-
- if (strcmp(lchan, channel) == 0)
- return true;
- }
- return false;
+ return (hash_search(listenChannelsHash, channel, HASH_FIND, NULL) != NULL);
}
/*
@@ -1230,7 +1598,7 @@ IsListeningOn(const char *channel)
static void
asyncQueueUnregister(void)
{
- Assert(listenChannels == NIL); /* else caller error */
+ Assert(listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0); /* else caller error */
if (!amRegisteredListener) /* nothing to do */
return;
@@ -1242,6 +1610,7 @@ asyncQueueUnregister(void)
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
/* and remove it from the list */
if (QUEUE_FIRST_LISTENER == MyProcNumber)
QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);
@@ -1565,12 +1934,15 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are known to still be positioned at the queue head
+ * from before our commit can be safely advanced directly to the new
+ * head, since the queue region we wrote is known to contain only our
+ * own notifications. This avoids unnecessary wakeups when there is
+ * nothing of interest to them.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1955,13 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ ListCell *lc;
+
+ /*
+ * Attach to the channel hash if needed. We might not have one if this
+ * backend hasn't done LISTEN, but we need it to find listeners.
+ */
+ initChannelHash();
/*
* Identify backends that we need to signal. We don't want to send
@@ -1597,36 +1976,104 @@ SignalBackends(void)
count = 0;
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, pendingNotifyChannels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ProcNumber *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ /*
+ * Direct advancement: avoid waking non-caught up backends that aren't
+ * interested in our notifications.
+ */
+ if (pendingNotifies != NULL)
+ {
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ if (QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
+ {
+ /*
+ * Safe to directly update a backend's shared pos if it isn't
+ * currently advancing its position.
+ */
+ if (!QUEUE_BACKEND_ADVANCING_POS(i))
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ }
+ else if (QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
+ {
+ /*
+ * Need to signal, cannot skip over, since we don't know if
+ * the notifications between pos and the queue head before our
+ * write are of interest for this backend or not.
+ */
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
+ else
+ {
+ /*
+ * The backend is already ahead of the notifications we wrote.
+ * No need to do anything.
+ */
+ Assert(QUEUE_POS_PRECEDES(queueHeadBeforeWrite, pos));
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1673,9 +2120,10 @@ AtAbort_Notify(void)
/*
* If we LISTEN but then roll back the transaction after PreCommit_Notify,
* we have registered as a listener but have not made any entry in
- * listenChannels. In that case, deregister again.
+ * listenChannelsHash. In that case, deregister again.
*/
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/* And clean up */
@@ -1861,19 +2309,26 @@ asyncQueueReadAllNotifications(void)
AsyncQueueEntry align;
} page_buffer;
- /* Fetch current state */
+ /*
+ * Fetch current state, indicate to others that we have woken up, and that
+ * we now will be advancing our position.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = true;
pos = QUEUE_BACKEND_POS(MyProcNumber);
head = QUEUE_HEAD;
- LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
{
/* Nothing to do, we have read all notifications already. */
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = false;
+ LWLockRelease(NotifyQueueLock);
return;
}
+ LWLockRelease(NotifyQueueLock);
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
@@ -1987,7 +2442,10 @@ asyncQueueReadAllNotifications(void)
{
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
+
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = false;
QUEUE_BACKEND_POS(MyProcNumber) = pos;
+
LWLockRelease(NotifyQueueLock);
}
PG_END_TRY();
@@ -2186,7 +2644,7 @@ ProcessIncomingNotify(bool flush)
notifyInterruptPending = false;
/* Do nothing else if we aren't actively listening */
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
return;
if (Trace_notify)
@@ -2290,13 +2748,15 @@ AddEventToPendingNotifies(Notification *n)
{
Assert(pendingNotifies->events != NIL);
- /* Create the hash table if it's time to */
+ /* Create the hash tables if it's time to */
if (list_length(pendingNotifies->events) >= MIN_HASHABLE_NOTIFIES &&
pendingNotifies->hashtab == NULL)
{
HASHCTL hash_ctl;
ListCell *l;
+ Assert(pendingNotifies->channelHashtab == NULL);
+
/* Create the hash table */
hash_ctl.keysize = sizeof(Notification *);
hash_ctl.entrysize = sizeof(struct NotificationHash);
@@ -2309,10 +2769,22 @@ AddEventToPendingNotifies(Notification *n)
&hash_ctl,
HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
+ /* Create the channel hash table */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+ hash_ctl.hcxt = CurTransactionContext;
+ pendingNotifies->channelHashtab =
+ hash_create("Pending Notify Channels",
+ 64L,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
/* Insert all the already-existing events */
foreach(l, pendingNotifies->events)
{
Notification *oldn = (Notification *) lfirst(l);
+ char *channel = oldn->data;
bool found;
(void) hash_search(pendingNotifies->hashtab,
@@ -2320,22 +2792,42 @@ AddEventToPendingNotifies(Notification *n)
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Insert channel into channelHashtab */
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if multiple events on same channel */
}
}
/* Add new event to the list, in order */
pendingNotifies->events = lappend(pendingNotifies->events, n);
- /* Add event to the hash table if needed */
+ /* Add event to the hash tables if needed */
if (pendingNotifies->hashtab != NULL)
{
bool found;
+ Assert(pendingNotifies->channelHashtab != NULL);
+
(void) hash_search(pendingNotifies->hashtab,
&n,
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Add channel to channelHashtab */
+ {
+ char *channel = n->data;
+
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if we already have an event on this channel */
+ }
}
}
@@ -2373,7 +2865,7 @@ notification_match(const void *key1, const void *key2, Size keysize)
return 1; /* not equal */
}
-/* Clear the pendingActions and pendingNotifies lists. */
+/* Clear the pendingActions, pendingNotifies, and pendingNotifyChannels lists. */
static void
ClearPendingActionsAndNotifies(void)
{
@@ -2385,6 +2877,7 @@ ClearPendingActionsAndNotifies(void)
*/
pendingActions = NULL;
pendingNotifies = NULL;
+ pendingNotifyChannels = NIL;
}
/*
@@ -2395,3 +2888,16 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 018b5919cf6..b8443725398 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-05 01:06 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 0 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-11-05 01:06 UTC (permalink / raw)
To: Arseniy Mukhin <[email protected]>; Chao Li <[email protected]>; Heikki Linnakangas <[email protected]>; +Cc: pgsql-hackers
On Wed, Nov 5, 2025, at 01:58, Joel Jacobson wrote:
> Changes since v22:
>
> * Optimize listening on thousands of channels per backend by replacing
> the listenChannels List with a local hash table, renamed to
> listenChannelsHash to avoid confusion.
I forgot to say that this is per idea from Heikki in the other thread [1]:
"The elephant in the room of course is that a lookup in a linked list is
O(n) and it would be very straightforward to replace it with e.g. a hash
table. We should do that irrespective of this bug fix. But I'm inclined
to do it as a separate followup patch."
[1] https://www.postgresql.org/message-id/66213fee-00ff-4952-802d-c06454e521ac%40iki.fi
> * Removed advisoryPos, since it was not actually used. We only needed
> advancingPos to fix the bug with truncation of the queue. It's possible
> that the bottleneck in some workloads is no longer the wakeups, but I'm
> not sure yet; I'll do some more benchmarking to get a better
> understanding of whether it would be worthwhile to pursue further
> optimization.
>
> * Removed asyncQueuePageDiff, since it's no longer used.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-05 09:21 Chao Li <[email protected]>
parent: Arseniy Mukhin <[email protected]>
1 sibling, 1 reply; 120+ messages in thread
From: Chao Li @ 2025-11-05 09:21 UTC (permalink / raw)
To: Arseniy Mukhin <[email protected]>; +Cc: Joel Jacobson <[email protected]>; pgsql-hackers
> On Nov 2, 2025, at 04:41, Arseniy Mukhin <[email protected]> wrote:
>
> This condition seems to be redundant. I would say it should always be
> true, otherwise it would mean that somebody allowed the listener to
> skip our notification.
Hi Arseniy,
Did you read the example I explained in my previous email?
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-05 17:51 Arseniy Mukhin <[email protected]>
parent: Chao Li <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Arseniy Mukhin @ 2025-11-05 17:51 UTC (permalink / raw)
To: Chao Li <[email protected]>; +Cc: Joel Jacobson <[email protected]>; pgsql-hackers
On Wed, Nov 5, 2025 at 12:22 PM Chao Li <[email protected]> wrote:
>
>
>
> > On Nov 2, 2025, at 04:41, Arseniy Mukhin <[email protected]> wrote:
> >
> > This condition seems to be redundant. I would say it should always be
> > true, otherwise it would mean that somebody allowed the listener to
> > skip our notification.
>
> Hi Arseniy,
>
Hi Chao,
> Did you read the example I explained in my previous email?
>
Yes, I read it. Thank you for the example. It shows the case where we
can fail to apply 'direct advancement'. I think there are several
cases where it can happen. IIUC all such cases are about lagging
listeners that failed to catch up with the head before the notifier
tries to apply 'direct advancement' to them. Your example is about
listeners that finished reading but didn't update their positions
because they were stuck on the lock. I think it is also possible that
the listener can be in the process of reading or even didn't start
reading at all (for example listener backend is in the active
transaction at the moment). In these cases we also can't apply direct
advancement. Don't know if some of these examples are more important,
maybe some of them can be met more frequently.
I think the current version of 'direct advancement' will work good for
'sleepy' listeners, but probably can be not very efficient for
listeners that get notifications frequently, don't know. But maybe
it's ok, we have optimization that sometimes works and have a quite
simple implementation.
Best regards,
Arseniy Mukhin
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-05 23:21 Chao Li <[email protected]>
parent: Arseniy Mukhin <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Chao Li @ 2025-11-05 23:21 UTC (permalink / raw)
To: Arseniy Mukhin <[email protected]>; +Cc: Joel Jacobson <[email protected]>; pgsql-hackers
> On Nov 6, 2025, at 01:51, Arseniy Mukhin <[email protected]> wrote:
>
> On Wed, Nov 5, 2025 at 12:22 PM Chao Li <[email protected]> wrote:
>>
>>
>>
>>> On Nov 2, 2025, at 04:41, Arseniy Mukhin <[email protected]> wrote:
>>>
>>> This condition seems to be redundant. I would say it should always be
>>> true, otherwise it would mean that somebody allowed the listener to
>>> skip our notification.
>>
>> Hi Arseniy,
>>
>
> Hi Chao,
>
>> Did you read the example I explained in my previous email?
>>
>
> Yes, I read it. Thank you for the example. It shows the case where we
> can fail to apply 'direct advancement'. I think there are several
> cases where it can happen. IIUC all such cases are about lagging
> listeners that failed to catch up with the head before the notifier
> tries to apply 'direct advancement' to them. Your example is about
> listeners that finished reading but didn't update their positions
> because they were stuck on the lock. I think it is also possible that
> the listener can be in the process of reading or even didn't start
> reading at all (for example listener backend is in the active
> transaction at the moment). In these cases we also can't apply direct
> advancement. Don't know if some of these examples are more important,
> maybe some of them can be met more frequently.
Cool, you got my idea. What I was thinking is to handle both sleeping listeners and “slow” listeners. In my view, which shouldn’t be too much complicated.
>
> I think the current version of 'direct advancement' will work good for
> 'sleepy' listeners, but probably can be not very efficient for
> listeners that get notifications frequently, don't know. But maybe
> it's ok, we have optimization that sometimes works and have a quite
> simple implementation.
>
That’s what we don’t know. We now lack a performance test for evaluating how “direct advancement” efficiently helps if it only handles sleeping listeners. So what I was suggesting is that we should first create some tests, maybe also add a few more statistics, so that we can evaluate different solutions. If a simple implementation that only handles sleeping listeners would have performed good enough, of course we can take it; otherwise we may need to either pursue a better solution.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-06 08:33 Joel Jacobson <[email protected]>
parent: Chao Li <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-11-06 08:33 UTC (permalink / raw)
To: Chao Li <[email protected]>; Arseniy Mukhin <[email protected]>; +Cc: pgsql-hackers
On Thu, Nov 6, 2025, at 00:21, Chao Li wrote:
> That’s what we don’t know. We now lack a performance test for
> evaluating how “direct advancement” efficiently helps if it only
> handles sleeping listeners. So what I was suggesting is that we should
> first create some tests, maybe also add a few more statistics, so that
> we can evaluate different solutions. If a simple implementation that
> only handles sleeping listeners would have performed good enough, of
> course we can take it; otherwise we may need to either pursue a better
> solution.
Just for the sake of evaluating this patch, I've added instrumentation
of async.c that increments counters for the different branches in
asyncQueueReadAllNotifications and SignalBackends. (I'm just using
atomics without any locking, but should be fine since this is just
statistics.)
pg_get_async_wakeup_stats-patch.txt adds the SQL-callable
catalog functions pg_reset_async_wakeup_stats() and
pg_get_async_wakeup_stats(), which should not be included in the patch,
they are just for evaluating. It can be applied on top of the v23 patch.
Below is just an example of how to compile and an arbitrary mix of
command line options. I've tired a lot of combinations, and we seem to
be holding up fine in all cases I've tried.
async-notify-test-5.c will detect if the pg_*_async_wakeup_stats() functions
exists, and only show the extra histograms if so.
% gcc -Wall -Wextra -O2 -pthread -I/Users/joel/pg19/include/postgresql/server -I/Users/joel/pg19/include -o async-notify-test-5 async-notify-test-5.c -L/Users/joel/pg19/lib -lpq -pthread -lm
% ./async-notify-test-5 --listeners 10 --notifiers 10 --channels 10 --sleep 0.1 --sleep-exp 2.0 --batch 10
10 s: 38100 sent (3690/s), 381000 received (36900/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 23 (0.0%) avg: 0.092ms
0.10-1.00ms ####### 298002 (78.2%) avg: 0.563ms
1.00-10.00ms ## 82975 (21.8%) avg: 1.506ms
10.00-100.00ms 0 (0.0%) avg: 0.000ms
>100.00ms 0 (0.0%) avg: 0.000ms
asyncQueueReadAllNotifications Statistics:
necessary_wakeups ######## 35469 (88.2%)
unnecessary_wakeups # 4762 (11.8%)
SignalBackends Statistics:
signaled_needed # 34983 (9.5%)
avoided_wakeups ######## 325874 (88.9%)
already_advancing # 3 (0.0%)
signaled_uncertain # 5347 (1.5%)
already_ahead # 375 (0.1%)
Thoughts on how to interpret results:
- Is the notification latency distribution good enough, for the given
workload? Naturally, if the workload is too high, we cannot expect to
ever achieve sub millisecond latency anyway, so it's a judgement.
- Even if the "unnecessary_wakeups" is high relative to
"necessary_wakeups", it's not necessarily a problem, if the latency
distribution still is good enough. We should also think about the
ratio between "unnecessary_wakeups" and "avoided_wakeups", since even
if "unnecessary_wakeups" is high in absolute numbers, if the
"avoided_wakeups" is magnitudes larger, that means the cost of the
context switching has been dramatically reduced already. I think there
is always a risk when optimizing to forget what problem one was trying
to solve initially, usually a bottleneck. When the bottleneck is gone
and is somewhere else instead, then the efforts should IMO usually be
spent elsewhere, especially if more optimizations would need a
insignificant increase of code complexity.
- It's the "signaled_uncertain" that primarily contribute to
"unnecessary_wakeups".
/Joel
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 8dac12f8124..7e8e0b14f42 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -137,6 +137,7 @@
#include <signal.h>
#include <string.h>
+#include "access/htup_details.h"
#include "access/parallel.h"
#include "access/slru.h"
#include "access/transam.h"
@@ -332,6 +333,13 @@ typedef struct AsyncQueueControl
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
dsa_handle channelHashDSA;
dshash_table_handle channelHashDSH;
+ pg_atomic_uint32 signaledNeeded; /* listening to some of the channels; signal needed */
+ pg_atomic_uint32 avoidedWakeups; /* directly advanced */
+ pg_atomic_uint32 alreadyAdvancing; /* already advancing its position */
+ pg_atomic_uint32 signaledUncertain; /* signaled due to uncertain need */
+ pg_atomic_uint32 alreadyAhead; /* already ahead, no action needed */
+ pg_atomic_uint32 necessaryWakeups; /* wakeups where at least one message was interesting */
+ pg_atomic_uint32 unnecessaryWakeups; /* wakeups where no messages were interesting */
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
@@ -517,7 +525,8 @@ static void asyncQueueReadAllNotifications(void);
static bool asyncQueueProcessPageEntries(volatile QueuePosition *current,
QueuePosition stop,
char *page_buffer,
- Snapshot snapshot);
+ Snapshot snapshot,
+ bool *interested);
static void asyncQueueAdvanceTail(void);
static void ProcessIncomingNotify(bool flush);
static bool AsyncExistsPendingNotify(Notification *n);
@@ -683,6 +692,13 @@ AsyncShmemInit(void)
asyncQueueControl->lastQueueFillWarn = 0;
asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
+ pg_atomic_init_u32(&asyncQueueControl->signaledNeeded, 0);
+ pg_atomic_init_u32(&asyncQueueControl->avoidedWakeups, 0);
+ pg_atomic_init_u32(&asyncQueueControl->alreadyAdvancing, 0);
+ pg_atomic_init_u32(&asyncQueueControl->signaledUncertain, 0);
+ pg_atomic_init_u32(&asyncQueueControl->alreadyAhead, 0);
+ pg_atomic_init_u32(&asyncQueueControl->necessaryWakeups, 0);
+ pg_atomic_init_u32(&asyncQueueControl->unnecessaryWakeups, 0);
for (int i = 0; i < MaxBackends; i++)
{
@@ -997,6 +1013,81 @@ pg_listening_channels(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * SQL function: return statistics about NOTIFY wakeups
+ *
+ * This function returns a single row with:
+ * - necessary_wakeups: wakeups where at least one message was interesting
+ * - unnecessary_wakeups: wakeups where no messages were interesting
+ * - direct_advancements_success: directly advanced
+ * - already_advancing: already advancing its position
+ * - signaled_uncertain: signaled due to uncertain need
+ * - already_ahead: already ahead, no action needed
+ */
+Datum
+pg_get_async_wakeup_stats(PG_FUNCTION_ARGS)
+{
+ TupleDesc tupdesc;
+ Datum values[7];
+ bool nulls[7];
+ HeapTuple tuple;
+ uint32 signaled_needed;
+ uint32 direct_advancements_success;
+ uint32 already_advancing;
+ uint32 signaled_uncertain;
+ uint32 already_ahead;
+ uint32 necessary_wakeups;
+ uint32 unnecessary_wakeups;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function returning record called in context that cannot accept type record")));
+
+ /* Read the atomic counters */
+ signaled_needed = pg_atomic_read_u32(&asyncQueueControl->signaledNeeded);
+ direct_advancements_success = pg_atomic_read_u32(&asyncQueueControl->avoidedWakeups);
+ already_advancing = pg_atomic_read_u32(&asyncQueueControl->alreadyAdvancing);
+ signaled_uncertain = pg_atomic_read_u32(&asyncQueueControl->signaledUncertain);
+ already_ahead = pg_atomic_read_u32(&asyncQueueControl->alreadyAhead);
+ necessary_wakeups = pg_atomic_read_u32(&asyncQueueControl->necessaryWakeups);
+ unnecessary_wakeups = pg_atomic_read_u32(&asyncQueueControl->unnecessaryWakeups);
+
+ /* Fill in the values */
+ memset(nulls, 0, sizeof(nulls));
+ values[0] = Int64GetDatum((int64) signaled_needed);
+ values[1] = Int64GetDatum((int64) direct_advancements_success);
+ values[2] = Int64GetDatum((int64) already_advancing);
+ values[3] = Int64GetDatum((int64) signaled_uncertain);
+ values[4] = Int64GetDatum((int64) already_ahead);
+ values[5] = Int64GetDatum((int64) necessary_wakeups);
+ values[6] = Int64GetDatum((int64) unnecessary_wakeups);
+
+ tuple = heap_form_tuple(tupdesc, values, nulls);
+ PG_RETURN_DATUM(HeapTupleGetDatum(tuple));
+}
+
+/*
+ * SQL function: reset NOTIFY wakeup statistics
+ *
+ * This function resets all the async wakeup counters to zero.
+ */
+Datum
+pg_reset_async_wakeup_stats(PG_FUNCTION_ARGS)
+{
+ /* Reset all the atomic counters to zero */
+ pg_atomic_write_u32(&asyncQueueControl->signaledNeeded, 0);
+ pg_atomic_write_u32(&asyncQueueControl->avoidedWakeups, 0);
+ pg_atomic_write_u32(&asyncQueueControl->alreadyAdvancing, 0);
+ pg_atomic_write_u32(&asyncQueueControl->signaledUncertain, 0);
+ pg_atomic_write_u32(&asyncQueueControl->alreadyAhead, 0);
+ pg_atomic_write_u32(&asyncQueueControl->necessaryWakeups, 0);
+ pg_atomic_write_u32(&asyncQueueControl->unnecessaryWakeups, 0);
+
+ PG_RETURN_VOID();
+}
+
/*
* Async_UnlistenOnExit
*
@@ -2014,6 +2105,7 @@ SignalBackends(void)
Assert(pid != InvalidPid);
+ pg_atomic_fetch_add_u32(&asyncQueueControl->signaledNeeded, 1);
QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
pids[count] = pid;
procnos[count] = i;
@@ -2049,7 +2141,14 @@ SignalBackends(void)
* currently advancing its position.
*/
if (!QUEUE_BACKEND_ADVANCING_POS(i))
+ {
QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ pg_atomic_fetch_add_u32(&asyncQueueControl->avoidedWakeups, 1);
+ }
+ else
+ {
+ pg_atomic_fetch_add_u32(&asyncQueueControl->alreadyAdvancing, 1);
+ }
}
else if (QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
{
@@ -2060,6 +2159,7 @@ SignalBackends(void)
*/
Assert(pid != InvalidPid);
+ pg_atomic_fetch_add_u32(&asyncQueueControl->signaledUncertain, 1);
QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
pids[count] = pid;
procnos[count] = i;
@@ -2071,6 +2171,7 @@ SignalBackends(void)
* The backend is already ahead of the notifications we wrote.
* No need to do anything.
*/
+ pg_atomic_fetch_add_u32(&asyncQueueControl->alreadyAhead, 1);
Assert(QUEUE_POS_PRECEDES(queueHeadBeforeWrite, pos));
}
}
@@ -2301,6 +2402,7 @@ asyncQueueReadAllNotifications(void)
volatile QueuePosition pos;
QueuePosition head;
Snapshot snapshot;
+ bool interested = false;
/* page_buffer must be adequately aligned, so use a union */
union
@@ -2435,7 +2537,8 @@ asyncQueueReadAllNotifications(void)
*/
reachedStop = asyncQueueProcessPageEntries(&pos, head,
page_buffer.buf,
- snapshot);
+ snapshot,
+ &interested);
} while (!reachedStop);
}
PG_FINALLY();
@@ -2450,6 +2553,11 @@ asyncQueueReadAllNotifications(void)
}
PG_END_TRY();
+ if (interested)
+ pg_atomic_fetch_add_u32(&asyncQueueControl->necessaryWakeups, 1);
+ else
+ pg_atomic_fetch_add_u32(&asyncQueueControl->unnecessaryWakeups, 1);
+
/* Done with snapshot */
UnregisterSnapshot(snapshot);
}
@@ -2474,7 +2582,8 @@ static bool
asyncQueueProcessPageEntries(volatile QueuePosition *current,
QueuePosition stop,
char *page_buffer,
- Snapshot snapshot)
+ Snapshot snapshot,
+ bool *interested)
{
bool reachedStop = false;
bool reachedEndOfPage;
@@ -2535,6 +2644,9 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
char *payload = qe->data + strlen(channel) + 1;
NotifyMyFrontEnd(channel, payload, qe->srcPid);
+
+ /* Mark were interested in at least one message */
+ *interested = true;
}
}
else
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9121a382f76..0bbd7db39c7 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8571,7 +8571,18 @@
proname => 'pg_notification_queue_usage', provolatile => 'v',
proparallel => 'r', prorettype => 'float8', proargtypes => '',
prosrc => 'pg_notification_queue_usage' },
-
+{ oid => '9315',
+ descr => 'get statistics about NOTIFY wakeups',
+ proname => 'pg_get_async_wakeup_stats', provolatile => 'v',
+ proparallel => 'r', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,int8,int8,int8,int8,int8,int8}', proargmodes => '{o,o,o,o,o,o,o}',
+ proargnames => '{signaled_needed,avoided_wakeups,already_advancing,signaled_uncertain,already_ahead,necessary_wakeups,unnecessary_wakeups}',
+ prosrc => 'pg_get_async_wakeup_stats' },
+{ oid => '9316',
+ descr => 'reset statistics about NOTIFY wakeups',
+ proname => 'pg_reset_async_wakeup_stats', provolatile => 'v',
+ proparallel => 'r', prorettype => 'void', proargtypes => '',
+ prosrc => 'pg_reset_async_wakeup_stats' },
# shared memory usage
{ oid => '5052', descr => 'allocations from the main shared memory segment',
proname => 'pg_get_shmem_allocations', prorows => '50', proretset => 't',
Attachments:
[application/octet-stream] async-notify-test-5.c (24.9K, 2-async-notify-test-5.c)
download
[text/plain] pg_get_async_wakeup_stats-patch.txt (9.2K, 3-pg_get_async_wakeup_stats-patch.txt)
download | inline diff:
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 8dac12f8124..7e8e0b14f42 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -137,6 +137,7 @@
#include <signal.h>
#include <string.h>
+#include "access/htup_details.h"
#include "access/parallel.h"
#include "access/slru.h"
#include "access/transam.h"
@@ -332,6 +333,13 @@ typedef struct AsyncQueueControl
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
dsa_handle channelHashDSA;
dshash_table_handle channelHashDSH;
+ pg_atomic_uint32 signaledNeeded; /* listening to some of the channels; signal needed */
+ pg_atomic_uint32 avoidedWakeups; /* directly advanced */
+ pg_atomic_uint32 alreadyAdvancing; /* already advancing its position */
+ pg_atomic_uint32 signaledUncertain; /* signaled due to uncertain need */
+ pg_atomic_uint32 alreadyAhead; /* already ahead, no action needed */
+ pg_atomic_uint32 necessaryWakeups; /* wakeups where at least one message was interesting */
+ pg_atomic_uint32 unnecessaryWakeups; /* wakeups where no messages were interesting */
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
@@ -517,7 +525,8 @@ static void asyncQueueReadAllNotifications(void);
static bool asyncQueueProcessPageEntries(volatile QueuePosition *current,
QueuePosition stop,
char *page_buffer,
- Snapshot snapshot);
+ Snapshot snapshot,
+ bool *interested);
static void asyncQueueAdvanceTail(void);
static void ProcessIncomingNotify(bool flush);
static bool AsyncExistsPendingNotify(Notification *n);
@@ -683,6 +692,13 @@ AsyncShmemInit(void)
asyncQueueControl->lastQueueFillWarn = 0;
asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
+ pg_atomic_init_u32(&asyncQueueControl->signaledNeeded, 0);
+ pg_atomic_init_u32(&asyncQueueControl->avoidedWakeups, 0);
+ pg_atomic_init_u32(&asyncQueueControl->alreadyAdvancing, 0);
+ pg_atomic_init_u32(&asyncQueueControl->signaledUncertain, 0);
+ pg_atomic_init_u32(&asyncQueueControl->alreadyAhead, 0);
+ pg_atomic_init_u32(&asyncQueueControl->necessaryWakeups, 0);
+ pg_atomic_init_u32(&asyncQueueControl->unnecessaryWakeups, 0);
for (int i = 0; i < MaxBackends; i++)
{
@@ -997,6 +1013,81 @@ pg_listening_channels(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * SQL function: return statistics about NOTIFY wakeups
+ *
+ * This function returns a single row with:
+ * - necessary_wakeups: wakeups where at least one message was interesting
+ * - unnecessary_wakeups: wakeups where no messages were interesting
+ * - direct_advancements_success: directly advanced
+ * - already_advancing: already advancing its position
+ * - signaled_uncertain: signaled due to uncertain need
+ * - already_ahead: already ahead, no action needed
+ */
+Datum
+pg_get_async_wakeup_stats(PG_FUNCTION_ARGS)
+{
+ TupleDesc tupdesc;
+ Datum values[7];
+ bool nulls[7];
+ HeapTuple tuple;
+ uint32 signaled_needed;
+ uint32 direct_advancements_success;
+ uint32 already_advancing;
+ uint32 signaled_uncertain;
+ uint32 already_ahead;
+ uint32 necessary_wakeups;
+ uint32 unnecessary_wakeups;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function returning record called in context that cannot accept type record")));
+
+ /* Read the atomic counters */
+ signaled_needed = pg_atomic_read_u32(&asyncQueueControl->signaledNeeded);
+ direct_advancements_success = pg_atomic_read_u32(&asyncQueueControl->avoidedWakeups);
+ already_advancing = pg_atomic_read_u32(&asyncQueueControl->alreadyAdvancing);
+ signaled_uncertain = pg_atomic_read_u32(&asyncQueueControl->signaledUncertain);
+ already_ahead = pg_atomic_read_u32(&asyncQueueControl->alreadyAhead);
+ necessary_wakeups = pg_atomic_read_u32(&asyncQueueControl->necessaryWakeups);
+ unnecessary_wakeups = pg_atomic_read_u32(&asyncQueueControl->unnecessaryWakeups);
+
+ /* Fill in the values */
+ memset(nulls, 0, sizeof(nulls));
+ values[0] = Int64GetDatum((int64) signaled_needed);
+ values[1] = Int64GetDatum((int64) direct_advancements_success);
+ values[2] = Int64GetDatum((int64) already_advancing);
+ values[3] = Int64GetDatum((int64) signaled_uncertain);
+ values[4] = Int64GetDatum((int64) already_ahead);
+ values[5] = Int64GetDatum((int64) necessary_wakeups);
+ values[6] = Int64GetDatum((int64) unnecessary_wakeups);
+
+ tuple = heap_form_tuple(tupdesc, values, nulls);
+ PG_RETURN_DATUM(HeapTupleGetDatum(tuple));
+}
+
+/*
+ * SQL function: reset NOTIFY wakeup statistics
+ *
+ * This function resets all the async wakeup counters to zero.
+ */
+Datum
+pg_reset_async_wakeup_stats(PG_FUNCTION_ARGS)
+{
+ /* Reset all the atomic counters to zero */
+ pg_atomic_write_u32(&asyncQueueControl->signaledNeeded, 0);
+ pg_atomic_write_u32(&asyncQueueControl->avoidedWakeups, 0);
+ pg_atomic_write_u32(&asyncQueueControl->alreadyAdvancing, 0);
+ pg_atomic_write_u32(&asyncQueueControl->signaledUncertain, 0);
+ pg_atomic_write_u32(&asyncQueueControl->alreadyAhead, 0);
+ pg_atomic_write_u32(&asyncQueueControl->necessaryWakeups, 0);
+ pg_atomic_write_u32(&asyncQueueControl->unnecessaryWakeups, 0);
+
+ PG_RETURN_VOID();
+}
+
/*
* Async_UnlistenOnExit
*
@@ -2014,6 +2105,7 @@ SignalBackends(void)
Assert(pid != InvalidPid);
+ pg_atomic_fetch_add_u32(&asyncQueueControl->signaledNeeded, 1);
QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
pids[count] = pid;
procnos[count] = i;
@@ -2049,7 +2141,14 @@ SignalBackends(void)
* currently advancing its position.
*/
if (!QUEUE_BACKEND_ADVANCING_POS(i))
+ {
QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ pg_atomic_fetch_add_u32(&asyncQueueControl->avoidedWakeups, 1);
+ }
+ else
+ {
+ pg_atomic_fetch_add_u32(&asyncQueueControl->alreadyAdvancing, 1);
+ }
}
else if (QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
{
@@ -2060,6 +2159,7 @@ SignalBackends(void)
*/
Assert(pid != InvalidPid);
+ pg_atomic_fetch_add_u32(&asyncQueueControl->signaledUncertain, 1);
QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
pids[count] = pid;
procnos[count] = i;
@@ -2071,6 +2171,7 @@ SignalBackends(void)
* The backend is already ahead of the notifications we wrote.
* No need to do anything.
*/
+ pg_atomic_fetch_add_u32(&asyncQueueControl->alreadyAhead, 1);
Assert(QUEUE_POS_PRECEDES(queueHeadBeforeWrite, pos));
}
}
@@ -2301,6 +2402,7 @@ asyncQueueReadAllNotifications(void)
volatile QueuePosition pos;
QueuePosition head;
Snapshot snapshot;
+ bool interested = false;
/* page_buffer must be adequately aligned, so use a union */
union
@@ -2435,7 +2537,8 @@ asyncQueueReadAllNotifications(void)
*/
reachedStop = asyncQueueProcessPageEntries(&pos, head,
page_buffer.buf,
- snapshot);
+ snapshot,
+ &interested);
} while (!reachedStop);
}
PG_FINALLY();
@@ -2450,6 +2553,11 @@ asyncQueueReadAllNotifications(void)
}
PG_END_TRY();
+ if (interested)
+ pg_atomic_fetch_add_u32(&asyncQueueControl->necessaryWakeups, 1);
+ else
+ pg_atomic_fetch_add_u32(&asyncQueueControl->unnecessaryWakeups, 1);
+
/* Done with snapshot */
UnregisterSnapshot(snapshot);
}
@@ -2474,7 +2582,8 @@ static bool
asyncQueueProcessPageEntries(volatile QueuePosition *current,
QueuePosition stop,
char *page_buffer,
- Snapshot snapshot)
+ Snapshot snapshot,
+ bool *interested)
{
bool reachedStop = false;
bool reachedEndOfPage;
@@ -2535,6 +2644,9 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
char *payload = qe->data + strlen(channel) + 1;
NotifyMyFrontEnd(channel, payload, qe->srcPid);
+
+ /* Mark were interested in at least one message */
+ *interested = true;
}
}
else
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9121a382f76..0bbd7db39c7 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8571,7 +8571,18 @@
proname => 'pg_notification_queue_usage', provolatile => 'v',
proparallel => 'r', prorettype => 'float8', proargtypes => '',
prosrc => 'pg_notification_queue_usage' },
-
+{ oid => '9315',
+ descr => 'get statistics about NOTIFY wakeups',
+ proname => 'pg_get_async_wakeup_stats', provolatile => 'v',
+ proparallel => 'r', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,int8,int8,int8,int8,int8,int8}', proargmodes => '{o,o,o,o,o,o,o}',
+ proargnames => '{signaled_needed,avoided_wakeups,already_advancing,signaled_uncertain,already_ahead,necessary_wakeups,unnecessary_wakeups}',
+ prosrc => 'pg_get_async_wakeup_stats' },
+{ oid => '9316',
+ descr => 'reset statistics about NOTIFY wakeups',
+ proname => 'pg_reset_async_wakeup_stats', provolatile => 'v',
+ proparallel => 'r', prorettype => 'void', proargtypes => '',
+ prosrc => 'pg_reset_async_wakeup_stats' },
# shared memory usage
{ oid => '5052', descr => 'allocations from the main shared memory segment',
proname => 'pg_get_shmem_allocations', prorows => '50', proretset => 't',
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-07 18:59 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-11-07 18:59 UTC (permalink / raw)
To: pgsql-hackers
On Thu, Nov 6, 2025, at 09:33, Joel Jacobson wrote:
> On Thu, Nov 6, 2025, at 00:21, Chao Li wrote:
>> That’s what we don’t know. We now lack a performance test for
>> evaluating how “direct advancement” efficiently helps if it only
>> handles sleeping listeners. So what I was suggesting is that we should
>> first create some tests, maybe also add a few more statistics, so that
>> we can evaluate different solutions. If a simple implementation that
>> only handles sleeping listeners would have performed good enough, of
>> course we can take it; otherwise we may need to either pursue a better
>> solution.
Changes since v23:
* The advancingPos flag has been split into two fields:
bool isAdvancing: indicates if a backend is currently advancing
QueuePosition advancingPos: the target position the backend will advance to
* The logic in SignalBackends has been reworked and simplified,
thanks to the new isAdvancing and advancingPos fields.
I now think it's finally easy to reason about why each branch
in SignalBackends must be correct.
I've also attached 0003-optimize_listen_notify-v24.txt that adds
instrumentation that can then be used together with the
benchmark/correctness tool pg_async_notify_test-v24.c.
This has been very helpful to me, to develop an intuition for its
concurrency behavior. I hope it can help others as well.
The 0003 patch is only for testing and not part of the patchset,
hence the .txt.
% gcc -Wall -Wextra -O2 -pthread \
-I$(pg_config --includedir-server) \
-I$(pg_config --includedir) \
-L$(pg_config --libdir) \
-o pg_async_notify_test-v24 pg_async_notify_test-v24.c \
-lpq -pthread -lm
% ./pg_async_notify_test-v24 --listeners 10 --notifiers 10 --channels 10 --duration 10
10 s: 301622 sent (29409/s), 3015940 received (294185/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 25 (0.0%) avg: 0.074ms
0.10-1.00ms # 289 (0.0%) avg: 0.461ms
1.00-10.00ms ######### 2824257 (93.6%) avg: 4.193ms
10.00-100.00ms # 189923 (6.3%) avg: 24.893ms
>100.00ms # 1453 (0.0%) avg: 109.662ms
SignalBackends Statistics:
signaled_targeted # 1251569 (9.4%)
advancing_behind # 108505 (0.8%)
advancing_ahead # 207494 (1.6%)
idle_behind # 408641 (3.1%)
avoided_wakeups ####### 10589740 (79.6%)
already_ahead # 744087 (5.6%)
asyncQueueReadAllNotifications Statistics:
necessary_wakeups ######## 1525695 (86.3%)
unnecessary_wakeups # 242106 (13.7%)
/Joel
From 9f2da84a9c58df155961481aa0802ffb95460811 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Fri, 7 Nov 2025 19:24:37 +0100
Subject: [PATCH 3/3] Add instrumentation for analyzing LISTEN/NOTIFY wakeup
behavior
This commit adds a set of atomic counters and SQL-accessible functions
to help understand how SignalBackends and asyncQueueReadAllNotifications
interact under various workloads. The instrumentation is intended only
for development and performance analysis and will not be included in the
final patch.
Specifically:
* Added several pg_atomic_uint32 counters in AsyncQueueControl tracking
wakeup categories such as signaled backends, advancing vs. idle
positions, direct advancements, and unnecessary wakeups.
* Incremented these counters in SignalBackends() and
asyncQueueReadAllNotifications() to classify wakeup decisions.
* Added SQL functions pg_get_async_wakeup_stats() and
pg_reset_async_wakeup_stats() for reading and resetting these counters
during test runs.
* Modified asyncQueueProcessPageEntries() to report whether any
notifications were of interest to the backend, allowing
differentiation between necessary and unnecessary wakeups.
This is purely diagnostic code to help reason about backend wakeup
patterns and validate assumptions during optimization. It introduces no
user-visible or behavioral changes and is not intended for commit to the
main tree.
---
src/backend/commands/async.c | 171 +++++++++++++++++++++++++++-----
src/include/catalog/pg_proc.dat | 13 ++-
2 files changed, 159 insertions(+), 25 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 9f7b8a3324a..2ec6b6b9e2b 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -136,6 +136,7 @@
#include <unistd.h>
#include <signal.h>
+#include "access/htup_details.h"
#include "access/parallel.h"
#include "access/slru.h"
#include "access/transam.h"
@@ -332,6 +333,14 @@ typedef struct AsyncQueueControl
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
dsa_handle channelHashDSA;
dshash_table_handle channelHashDSH;
+ pg_atomic_uint32 signaledTargeted; /* listening to some of the channels; signal needed */
+ pg_atomic_uint32 advancingBehind; /* advancing, position behind queue head before write */
+ pg_atomic_uint32 advancingAhead; /* advancing, position ahead of queue head after write */
+ pg_atomic_uint32 idleBehind; /* stationary at a position behind queue head before write */
+ pg_atomic_uint32 avoidedWakeups; /* directly advanced */
+ pg_atomic_uint32 alreadyAhead; /* already caught up or ahead, no action needed */
+ pg_atomic_uint32 necessaryWakeups; /* wakeups where at least one message was interesting */
+ pg_atomic_uint32 unnecessaryWakeups; /* wakeups where we had no interest in any of the messages */
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
@@ -518,7 +527,8 @@ static void asyncQueueReadAllNotifications(void);
static bool asyncQueueProcessPageEntries(volatile QueuePosition *current,
QueuePosition stop,
char *page_buffer,
- Snapshot snapshot);
+ Snapshot snapshot,
+ bool *interested);
static void asyncQueueAdvanceTail(void);
static void ProcessIncomingNotify(bool flush);
static bool AsyncExistsPendingNotify(Notification *n);
@@ -684,6 +694,15 @@ AsyncShmemInit(void)
asyncQueueControl->lastQueueFillWarn = 0;
asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
+ pg_atomic_init_u32(&asyncQueueControl->signaledTargeted, 0);
+ pg_atomic_init_u32(&asyncQueueControl->advancingBehind, 0);
+ pg_atomic_init_u32(&asyncQueueControl->advancingAhead, 0);
+ pg_atomic_init_u32(&asyncQueueControl->idleBehind, 0);
+ pg_atomic_init_u32(&asyncQueueControl->avoidedWakeups, 0);
+ pg_atomic_init_u32(&asyncQueueControl->alreadyAhead, 0);
+ pg_atomic_init_u32(&asyncQueueControl->necessaryWakeups, 0);
+ pg_atomic_init_u32(&asyncQueueControl->unnecessaryWakeups, 0);
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
@@ -998,6 +1017,85 @@ pg_listening_channels(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * SQL function: return statistics about NOTIFY wakeups
+ *
+ * This function returns a single row with:
+ * - necessary_wakeups: wakeups where at least one message was interesting
+ * - unnecessary_wakeups: wakeups where no messages were interesting
+ * - direct_advancements_success: directly advanced
+ * - already_advancing: already advancing its position
+ * - signaled_uncertain: signaled due to uncertain need
+ * - already_ahead: already ahead, no action needed
+ */
+Datum
+pg_get_async_wakeup_stats(PG_FUNCTION_ARGS)
+{
+ TupleDesc tupdesc;
+ Datum values[8];
+ bool nulls[8];
+ HeapTuple tuple;
+ uint32 signaled_targeted;
+ uint32 advancing_behind;
+ uint32 advancing_ahead;
+ uint32 idle_behind;
+ uint32 avoided_wakeups;
+ uint32 already_ahead;
+ uint32 necessary_wakeups;
+ uint32 unnecessary_wakeups;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function returning record called in context that cannot accept type record")));
+
+ /* Read the atomic counters */
+ signaled_targeted = pg_atomic_read_u32(&asyncQueueControl->signaledTargeted);
+ advancing_behind = pg_atomic_read_u32(&asyncQueueControl->advancingBehind);
+ advancing_ahead = pg_atomic_read_u32(&asyncQueueControl->advancingAhead);
+ idle_behind = pg_atomic_read_u32(&asyncQueueControl->idleBehind);
+ avoided_wakeups = pg_atomic_read_u32(&asyncQueueControl->avoidedWakeups);
+ already_ahead = pg_atomic_read_u32(&asyncQueueControl->alreadyAhead);
+ necessary_wakeups = pg_atomic_read_u32(&asyncQueueControl->necessaryWakeups);
+ unnecessary_wakeups = pg_atomic_read_u32(&asyncQueueControl->unnecessaryWakeups);
+
+ /* Fill in the values */
+ memset(nulls, 0, sizeof(nulls));
+ values[0] = Int64GetDatum((int64) signaled_targeted);
+ values[1] = Int64GetDatum((int64) advancing_behind);
+ values[2] = Int64GetDatum((int64) advancing_ahead);
+ values[3] = Int64GetDatum((int64) idle_behind);
+ values[4] = Int64GetDatum((int64) avoided_wakeups);
+ values[5] = Int64GetDatum((int64) already_ahead);
+ values[6] = Int64GetDatum((int64) necessary_wakeups);
+ values[7] = Int64GetDatum((int64) unnecessary_wakeups);
+
+ tuple = heap_form_tuple(tupdesc, values, nulls);
+ PG_RETURN_DATUM(HeapTupleGetDatum(tuple));
+}
+
+/*
+ * SQL function: reset NOTIFY wakeup statistics
+ *
+ * This function resets all the async wakeup counters to zero.
+ */
+Datum
+pg_reset_async_wakeup_stats(PG_FUNCTION_ARGS)
+{
+ /* Reset all the atomic counters to zero */
+ pg_atomic_write_u32(&asyncQueueControl->signaledTargeted, 0);
+ pg_atomic_write_u32(&asyncQueueControl->advancingBehind, 0);
+ pg_atomic_write_u32(&asyncQueueControl->advancingAhead, 0);
+ pg_atomic_write_u32(&asyncQueueControl->idleBehind, 0);
+ pg_atomic_write_u32(&asyncQueueControl->avoidedWakeups, 0);
+ pg_atomic_write_u32(&asyncQueueControl->alreadyAhead, 0);
+ pg_atomic_write_u32(&asyncQueueControl->necessaryWakeups, 0);
+ pg_atomic_write_u32(&asyncQueueControl->unnecessaryWakeups, 0);
+
+ PG_RETURN_VOID();
+}
+
/*
* Async_UnlistenOnExit
*
@@ -2016,6 +2114,7 @@ SignalBackends(void)
Assert(pid != InvalidPid);
+ pg_atomic_fetch_add_u32(&asyncQueueControl->signaledTargeted, 1);
QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
pids[count] = pid;
procnos[count] = i;
@@ -2037,6 +2136,7 @@ SignalBackends(void)
{
QueuePosition pos;
int32 pid;
+ bool need_signal = false;
if (QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
@@ -2044,21 +2144,39 @@ SignalBackends(void)
pos = QUEUE_BACKEND_POS(i);
pid = QUEUE_BACKEND_PID(i);
- /*
- * We need to signal advancing listening backends that would get
- * stuck at a position before the new queue head. We also need to
- * signal listening backends that are idle at a position before
- * the old queue head since they could be interested in the
- * messages in-between.
- *
- * Listening backends that are not advancing and are stationary at
- * a position somewhere in the range we just wrote, can safely be
- * direct advanced to the new queue head, since we know that they
- * are not interested in our messages.
- */
- if (QUEUE_BACKEND_IS_ADVANCING(i) ?
- QUEUE_POS_PRECEDES(QUEUE_BACKEND_ADVANCING_POS(i), queueHeadAfterWrite) :
- QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
+ if (QUEUE_BACKEND_IS_ADVANCING(i))
+ {
+ if (QUEUE_POS_PRECEDES(QUEUE_BACKEND_ADVANCING_POS(i), queueHeadAfterWrite))
+ {
+ need_signal = true;
+ pg_atomic_fetch_add_u32(&asyncQueueControl->advancingBehind, 1);
+ }
+ else
+ {
+ pg_atomic_fetch_add_u32(&asyncQueueControl->advancingAhead, 1);
+ }
+ }
+ else
+ {
+ if (QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
+ {
+ need_signal = true;
+ pg_atomic_fetch_add_u32(&asyncQueueControl->idleBehind, 1);
+ }
+ else if (QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
+ {
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ pg_atomic_fetch_add_u32(&asyncQueueControl->avoidedWakeups, 1);
+ }
+ else
+ {
+ Assert(QUEUE_POS_EQUAL(pos, queueHeadAfterWrite) ||
+ QUEUE_POS_PRECEDES(queueHeadAfterWrite, pos));
+ pg_atomic_fetch_add_u32(&asyncQueueControl->alreadyAhead, 1);
+ }
+ }
+
+ if (need_signal)
{
Assert(pid != InvalidPid);
@@ -2067,13 +2185,7 @@ SignalBackends(void)
procnos[count] = i;
count++;
}
- else if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
- QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
- {
- Assert(!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite));
- QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
- }
}
}
LWLockRelease(NotifyQueueLock);
@@ -2302,6 +2414,7 @@ asyncQueueReadAllNotifications(void)
volatile QueuePosition pos;
QueuePosition head;
Snapshot snapshot;
+ bool interested = false;
/* page_buffer must be adequately aligned, so use a union */
union
@@ -2438,7 +2551,8 @@ asyncQueueReadAllNotifications(void)
*/
reachedStop = asyncQueueProcessPageEntries(&pos, head,
page_buffer.buf,
- snapshot);
+ snapshot,
+ &interested);
} while (!reachedStop);
}
PG_FINALLY();
@@ -2452,6 +2566,11 @@ asyncQueueReadAllNotifications(void)
}
PG_END_TRY();
+ if (interested)
+ pg_atomic_fetch_add_u32(&asyncQueueControl->necessaryWakeups, 1);
+ else
+ pg_atomic_fetch_add_u32(&asyncQueueControl->unnecessaryWakeups, 1);
+
/* Done with snapshot */
UnregisterSnapshot(snapshot);
}
@@ -2476,7 +2595,8 @@ static bool
asyncQueueProcessPageEntries(volatile QueuePosition *current,
QueuePosition stop,
char *page_buffer,
- Snapshot snapshot)
+ Snapshot snapshot,
+ bool *interested)
{
bool reachedStop = false;
bool reachedEndOfPage;
@@ -2537,6 +2657,9 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
char *payload = qe->data + strlen(channel) + 1;
NotifyMyFrontEnd(channel, payload, qe->srcPid);
+
+ /* Mark were interested in at least one message */
+ *interested = true;
}
}
else
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9121a382f76..b259bccfa4b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8571,7 +8571,18 @@
proname => 'pg_notification_queue_usage', provolatile => 'v',
proparallel => 'r', prorettype => 'float8', proargtypes => '',
prosrc => 'pg_notification_queue_usage' },
-
+{ oid => '9315',
+ descr => 'get statistics about NOTIFY wakeups',
+ proname => 'pg_get_async_wakeup_stats', provolatile => 'v',
+ proparallel => 'r', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,int8,int8,int8,int8,int8,int8,int8}', proargmodes => '{o,o,o,o,o,o,o,o}',
+ proargnames => '{signaled_targeted,advancing_behind,advancing_ahead,idle_behind,avoided_wakeups,already_ahead,necessary_wakeups,unnecessary_wakeups}',
+ prosrc => 'pg_get_async_wakeup_stats' },
+{ oid => '9316',
+ descr => 'reset statistics about NOTIFY wakeups',
+ proname => 'pg_reset_async_wakeup_stats', provolatile => 'v',
+ proparallel => 'r', prorettype => 'void', proargtypes => '',
+ prosrc => 'pg_reset_async_wakeup_stats' },
# shared memory usage
{ oid => '5052', descr => 'allocations from the main shared memory segment',
proname => 'pg_get_shmem_allocations', prorows => '50', proretset => 't',
--
2.50.1
Attachments:
[text/plain] 0003-optimize_listen_notify-v24.txt (12.5K, 2-0003-optimize_listen_notify-v24.txt)
download | inline diff:
From 9f2da84a9c58df155961481aa0802ffb95460811 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Fri, 7 Nov 2025 19:24:37 +0100
Subject: [PATCH 3/3] Add instrumentation for analyzing LISTEN/NOTIFY wakeup
behavior
This commit adds a set of atomic counters and SQL-accessible functions
to help understand how SignalBackends and asyncQueueReadAllNotifications
interact under various workloads. The instrumentation is intended only
for development and performance analysis and will not be included in the
final patch.
Specifically:
* Added several pg_atomic_uint32 counters in AsyncQueueControl tracking
wakeup categories such as signaled backends, advancing vs. idle
positions, direct advancements, and unnecessary wakeups.
* Incremented these counters in SignalBackends() and
asyncQueueReadAllNotifications() to classify wakeup decisions.
* Added SQL functions pg_get_async_wakeup_stats() and
pg_reset_async_wakeup_stats() for reading and resetting these counters
during test runs.
* Modified asyncQueueProcessPageEntries() to report whether any
notifications were of interest to the backend, allowing
differentiation between necessary and unnecessary wakeups.
This is purely diagnostic code to help reason about backend wakeup
patterns and validate assumptions during optimization. It introduces no
user-visible or behavioral changes and is not intended for commit to the
main tree.
---
src/backend/commands/async.c | 171 +++++++++++++++++++++++++++-----
src/include/catalog/pg_proc.dat | 13 ++-
2 files changed, 159 insertions(+), 25 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 9f7b8a3324a..2ec6b6b9e2b 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -136,6 +136,7 @@
#include <unistd.h>
#include <signal.h>
+#include "access/htup_details.h"
#include "access/parallel.h"
#include "access/slru.h"
#include "access/transam.h"
@@ -332,6 +333,14 @@ typedef struct AsyncQueueControl
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
dsa_handle channelHashDSA;
dshash_table_handle channelHashDSH;
+ pg_atomic_uint32 signaledTargeted; /* listening to some of the channels; signal needed */
+ pg_atomic_uint32 advancingBehind; /* advancing, position behind queue head before write */
+ pg_atomic_uint32 advancingAhead; /* advancing, position ahead of queue head after write */
+ pg_atomic_uint32 idleBehind; /* stationary at a position behind queue head before write */
+ pg_atomic_uint32 avoidedWakeups; /* directly advanced */
+ pg_atomic_uint32 alreadyAhead; /* already caught up or ahead, no action needed */
+ pg_atomic_uint32 necessaryWakeups; /* wakeups where at least one message was interesting */
+ pg_atomic_uint32 unnecessaryWakeups; /* wakeups where we had no interest in any of the messages */
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
@@ -518,7 +527,8 @@ static void asyncQueueReadAllNotifications(void);
static bool asyncQueueProcessPageEntries(volatile QueuePosition *current,
QueuePosition stop,
char *page_buffer,
- Snapshot snapshot);
+ Snapshot snapshot,
+ bool *interested);
static void asyncQueueAdvanceTail(void);
static void ProcessIncomingNotify(bool flush);
static bool AsyncExistsPendingNotify(Notification *n);
@@ -684,6 +694,15 @@ AsyncShmemInit(void)
asyncQueueControl->lastQueueFillWarn = 0;
asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
+ pg_atomic_init_u32(&asyncQueueControl->signaledTargeted, 0);
+ pg_atomic_init_u32(&asyncQueueControl->advancingBehind, 0);
+ pg_atomic_init_u32(&asyncQueueControl->advancingAhead, 0);
+ pg_atomic_init_u32(&asyncQueueControl->idleBehind, 0);
+ pg_atomic_init_u32(&asyncQueueControl->avoidedWakeups, 0);
+ pg_atomic_init_u32(&asyncQueueControl->alreadyAhead, 0);
+ pg_atomic_init_u32(&asyncQueueControl->necessaryWakeups, 0);
+ pg_atomic_init_u32(&asyncQueueControl->unnecessaryWakeups, 0);
+
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
@@ -998,6 +1017,85 @@ pg_listening_channels(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funcctx);
}
+/*
+ * SQL function: return statistics about NOTIFY wakeups
+ *
+ * This function returns a single row with:
+ * - necessary_wakeups: wakeups where at least one message was interesting
+ * - unnecessary_wakeups: wakeups where no messages were interesting
+ * - direct_advancements_success: directly advanced
+ * - already_advancing: already advancing its position
+ * - signaled_uncertain: signaled due to uncertain need
+ * - already_ahead: already ahead, no action needed
+ */
+Datum
+pg_get_async_wakeup_stats(PG_FUNCTION_ARGS)
+{
+ TupleDesc tupdesc;
+ Datum values[8];
+ bool nulls[8];
+ HeapTuple tuple;
+ uint32 signaled_targeted;
+ uint32 advancing_behind;
+ uint32 advancing_ahead;
+ uint32 idle_behind;
+ uint32 avoided_wakeups;
+ uint32 already_ahead;
+ uint32 necessary_wakeups;
+ uint32 unnecessary_wakeups;
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("function returning record called in context that cannot accept type record")));
+
+ /* Read the atomic counters */
+ signaled_targeted = pg_atomic_read_u32(&asyncQueueControl->signaledTargeted);
+ advancing_behind = pg_atomic_read_u32(&asyncQueueControl->advancingBehind);
+ advancing_ahead = pg_atomic_read_u32(&asyncQueueControl->advancingAhead);
+ idle_behind = pg_atomic_read_u32(&asyncQueueControl->idleBehind);
+ avoided_wakeups = pg_atomic_read_u32(&asyncQueueControl->avoidedWakeups);
+ already_ahead = pg_atomic_read_u32(&asyncQueueControl->alreadyAhead);
+ necessary_wakeups = pg_atomic_read_u32(&asyncQueueControl->necessaryWakeups);
+ unnecessary_wakeups = pg_atomic_read_u32(&asyncQueueControl->unnecessaryWakeups);
+
+ /* Fill in the values */
+ memset(nulls, 0, sizeof(nulls));
+ values[0] = Int64GetDatum((int64) signaled_targeted);
+ values[1] = Int64GetDatum((int64) advancing_behind);
+ values[2] = Int64GetDatum((int64) advancing_ahead);
+ values[3] = Int64GetDatum((int64) idle_behind);
+ values[4] = Int64GetDatum((int64) avoided_wakeups);
+ values[5] = Int64GetDatum((int64) already_ahead);
+ values[6] = Int64GetDatum((int64) necessary_wakeups);
+ values[7] = Int64GetDatum((int64) unnecessary_wakeups);
+
+ tuple = heap_form_tuple(tupdesc, values, nulls);
+ PG_RETURN_DATUM(HeapTupleGetDatum(tuple));
+}
+
+/*
+ * SQL function: reset NOTIFY wakeup statistics
+ *
+ * This function resets all the async wakeup counters to zero.
+ */
+Datum
+pg_reset_async_wakeup_stats(PG_FUNCTION_ARGS)
+{
+ /* Reset all the atomic counters to zero */
+ pg_atomic_write_u32(&asyncQueueControl->signaledTargeted, 0);
+ pg_atomic_write_u32(&asyncQueueControl->advancingBehind, 0);
+ pg_atomic_write_u32(&asyncQueueControl->advancingAhead, 0);
+ pg_atomic_write_u32(&asyncQueueControl->idleBehind, 0);
+ pg_atomic_write_u32(&asyncQueueControl->avoidedWakeups, 0);
+ pg_atomic_write_u32(&asyncQueueControl->alreadyAhead, 0);
+ pg_atomic_write_u32(&asyncQueueControl->necessaryWakeups, 0);
+ pg_atomic_write_u32(&asyncQueueControl->unnecessaryWakeups, 0);
+
+ PG_RETURN_VOID();
+}
+
/*
* Async_UnlistenOnExit
*
@@ -2016,6 +2114,7 @@ SignalBackends(void)
Assert(pid != InvalidPid);
+ pg_atomic_fetch_add_u32(&asyncQueueControl->signaledTargeted, 1);
QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
pids[count] = pid;
procnos[count] = i;
@@ -2037,6 +2136,7 @@ SignalBackends(void)
{
QueuePosition pos;
int32 pid;
+ bool need_signal = false;
if (QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
@@ -2044,21 +2144,39 @@ SignalBackends(void)
pos = QUEUE_BACKEND_POS(i);
pid = QUEUE_BACKEND_PID(i);
- /*
- * We need to signal advancing listening backends that would get
- * stuck at a position before the new queue head. We also need to
- * signal listening backends that are idle at a position before
- * the old queue head since they could be interested in the
- * messages in-between.
- *
- * Listening backends that are not advancing and are stationary at
- * a position somewhere in the range we just wrote, can safely be
- * direct advanced to the new queue head, since we know that they
- * are not interested in our messages.
- */
- if (QUEUE_BACKEND_IS_ADVANCING(i) ?
- QUEUE_POS_PRECEDES(QUEUE_BACKEND_ADVANCING_POS(i), queueHeadAfterWrite) :
- QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
+ if (QUEUE_BACKEND_IS_ADVANCING(i))
+ {
+ if (QUEUE_POS_PRECEDES(QUEUE_BACKEND_ADVANCING_POS(i), queueHeadAfterWrite))
+ {
+ need_signal = true;
+ pg_atomic_fetch_add_u32(&asyncQueueControl->advancingBehind, 1);
+ }
+ else
+ {
+ pg_atomic_fetch_add_u32(&asyncQueueControl->advancingAhead, 1);
+ }
+ }
+ else
+ {
+ if (QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
+ {
+ need_signal = true;
+ pg_atomic_fetch_add_u32(&asyncQueueControl->idleBehind, 1);
+ }
+ else if (QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
+ {
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ pg_atomic_fetch_add_u32(&asyncQueueControl->avoidedWakeups, 1);
+ }
+ else
+ {
+ Assert(QUEUE_POS_EQUAL(pos, queueHeadAfterWrite) ||
+ QUEUE_POS_PRECEDES(queueHeadAfterWrite, pos));
+ pg_atomic_fetch_add_u32(&asyncQueueControl->alreadyAhead, 1);
+ }
+ }
+
+ if (need_signal)
{
Assert(pid != InvalidPid);
@@ -2067,13 +2185,7 @@ SignalBackends(void)
procnos[count] = i;
count++;
}
- else if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
- QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
- {
- Assert(!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite));
- QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
- }
}
}
LWLockRelease(NotifyQueueLock);
@@ -2302,6 +2414,7 @@ asyncQueueReadAllNotifications(void)
volatile QueuePosition pos;
QueuePosition head;
Snapshot snapshot;
+ bool interested = false;
/* page_buffer must be adequately aligned, so use a union */
union
@@ -2438,7 +2551,8 @@ asyncQueueReadAllNotifications(void)
*/
reachedStop = asyncQueueProcessPageEntries(&pos, head,
page_buffer.buf,
- snapshot);
+ snapshot,
+ &interested);
} while (!reachedStop);
}
PG_FINALLY();
@@ -2452,6 +2566,11 @@ asyncQueueReadAllNotifications(void)
}
PG_END_TRY();
+ if (interested)
+ pg_atomic_fetch_add_u32(&asyncQueueControl->necessaryWakeups, 1);
+ else
+ pg_atomic_fetch_add_u32(&asyncQueueControl->unnecessaryWakeups, 1);
+
/* Done with snapshot */
UnregisterSnapshot(snapshot);
}
@@ -2476,7 +2595,8 @@ static bool
asyncQueueProcessPageEntries(volatile QueuePosition *current,
QueuePosition stop,
char *page_buffer,
- Snapshot snapshot)
+ Snapshot snapshot,
+ bool *interested)
{
bool reachedStop = false;
bool reachedEndOfPage;
@@ -2537,6 +2657,9 @@ asyncQueueProcessPageEntries(volatile QueuePosition *current,
char *payload = qe->data + strlen(channel) + 1;
NotifyMyFrontEnd(channel, payload, qe->srcPid);
+
+ /* Mark were interested in at least one message */
+ *interested = true;
}
}
else
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9121a382f76..b259bccfa4b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8571,7 +8571,18 @@
proname => 'pg_notification_queue_usage', provolatile => 'v',
proparallel => 'r', prorettype => 'float8', proargtypes => '',
prosrc => 'pg_notification_queue_usage' },
-
+{ oid => '9315',
+ descr => 'get statistics about NOTIFY wakeups',
+ proname => 'pg_get_async_wakeup_stats', provolatile => 'v',
+ proparallel => 'r', prorettype => 'record', proargtypes => '',
+ proallargtypes => '{int8,int8,int8,int8,int8,int8,int8,int8}', proargmodes => '{o,o,o,o,o,o,o,o}',
+ proargnames => '{signaled_targeted,advancing_behind,advancing_ahead,idle_behind,avoided_wakeups,already_ahead,necessary_wakeups,unnecessary_wakeups}',
+ prosrc => 'pg_get_async_wakeup_stats' },
+{ oid => '9316',
+ descr => 'reset statistics about NOTIFY wakeups',
+ proname => 'pg_reset_async_wakeup_stats', provolatile => 'v',
+ proparallel => 'r', prorettype => 'void', proargtypes => '',
+ prosrc => 'pg_reset_async_wakeup_stats' },
# shared memory usage
{ oid => '5052', descr => 'allocations from the main shared memory segment',
proname => 'pg_get_shmem_allocations', prorows => '50', proretset => 't',
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v24.patch (43.1K, 3-0002-optimize_listen_notify-v24.patch)
download | inline diff:
From 99c2bfe0c9d7c519c494bc1701750f5b60306ec4 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Fri, 7 Nov 2025 19:08:39 +0100
Subject: [PATCH 2/3] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of listening backends (ProcNumbers). This allows the sender to
target only those backends actually listening on the channels for which
it has queued notifications.
At commit time:
* AtCommit_Notify updates the shared channelHash to reflect any LISTEN
or UNLISTEN actions performed in the transaction.
* SignalBackends consults this hash to find the backends that are
listening on the channels being notified in the current database, and
signals only those.
Each backend's entry in AsyncQueueControl now includes a wakeupPending
flag to prevent duplicate signals while a previous wakeup is still being
processed.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
To handle advancing backends correctly, each backend's entry tracks both
whether it is currently advancing (isAdvancing) and the target position
it is advancing to (advancingPos). This allows SignalBackends to signal
advancing backends only when their target position would leave them
behind the new queue head, while safely direct-advancing idle backends
that would not be interested in the newly written notifications.
Other notes
-----------
* Maintains dual data structures: a shared channelHash for determining
which backends to signal, and a local per-backend listenChannels list
for fast lock-free lookups during notification processing. This avoids
contention on the shared hash during the high-frequency IsListeningOn
checks that occur for every notification read from the queue.
* Backends remain registered in the global listener list as long as
listenChannels is non-empty.
* Adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
* No user-visible behavioral changes; this is an internal optimization
only.
---
src/backend/commands/async.c | 716 +++++++++++++++---
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 616 insertions(+), 104 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..9f7b8a3324a 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,21 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
- * Then we signal any backends that may be interested in our messages
- * (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * make any actual updates to the local listen state (listenChannelsHash) and
+ * shared channel hash table (channelHash). Then we signal any backends
+ * that may be interested in our messages (including our own backend,
+ * if listening). This is done by SignalBackends(), which consults the
+ * shared channel hash table to identify listeners for the channels that
+ * have pending notifications in the current database. Each selected
+ * backend is marked as having a wakeup pending to avoid duplicate signals,
+ * and a PROCSIG_NOTIFY_INTERRUPT signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, avoiding unnecessary
+ * wakeups for idle listeners that have nothing to read.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -137,14 +144,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +172,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ProcNumber array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -224,11 +257,16 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +284,9 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
+ bool isAdvancing; /* backend is advancing its position */
+ QueuePosition advancingPos; /* target position backend is advancing to */
} QueueBackendStatus;
/*
@@ -260,9 +301,10 @@ typedef struct QueueBackendStatus
* (since no other backend will inspect it).
*
* When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
- * entries of other backends and also change the head pointer. When holding
- * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointers.
+ * entries of other backends and also change the head pointer. They can
+ * also advance other backends' queue positions, unless they are not
+ * in the process of doing that themselves. When holding both NotifyQueueLock and
+ * NotifyQueueTailLock in EXCLUSIVE mode, backends can change the tail pointers.
*
* SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
* the control lock for the pg_notify SLRU buffers.
@@ -288,11 +330,16 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +348,9 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
+#define QUEUE_BACKEND_IS_ADVANCING(i) (asyncQueueControl->backend[i].isAdvancing)
+#define QUEUE_BACKEND_ADVANCING_POS(i) (asyncQueueControl->backend[i].advancingPos)
/*
* The SLRU buffer area through which we access the notification queue
@@ -313,16 +363,16 @@ static SlruCtlData NotifyCtlData;
#define QUEUE_FULL_WARN_INTERVAL 5000 /* warn at most once every 5s */
/*
- * listenChannels identifies the channels we are actually listening to
- * (ie, have committed a LISTEN on). It is a simple list of channel names,
+ * listenChannelsHash identifies the channels we are actually listening to
+ * (ie, have committed a LISTEN on). It is a hash table of channel names,
* allocated in TopMemoryContext.
*/
-static List *listenChannels = NIL; /* list of C strings */
+static HTAB *listenChannelsHash = NULL;
/*
* State for pending LISTEN/UNLISTEN actions consists of an ordered list of
* all actions requested in the current transaction. As explained above,
- * we don't actually change listenChannels until we reach transaction commit.
+ * we don't actually change listenChannelsHash until we reach transaction commit.
*
* The list is kept in CurTransactionContext. In subtransactions, each
* subtransaction has its own list in its own CurTransactionContext, but
@@ -391,6 +441,7 @@ typedef struct NotificationList
int nestingLevel; /* current transaction nesting depth */
List *events; /* list of Notification structs */
HTAB *hashtab; /* hash of NotificationHash structs, or NULL */
+ HTAB *channelHashtab; /* hash of unique channel names, or NULL */
struct NotificationList *upper; /* details for upper transaction levels */
} NotificationList;
@@ -401,6 +452,11 @@ struct NotificationHash
Notification *event; /* => the actual Notification struct */
};
+struct ChannelHash
+{
+ char channel[NAMEDATALEN];
+};
+
static NotificationList *pendingNotifies = NULL;
/*
@@ -418,6 +474,20 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/*
+ * Queue head positions for direct advancement.
+ * These are captured during PreCommit_Notify while holding the heavyweight
+ * lock on database 0, ensuring no other backend can insert notifications
+ * between them. SignalBackends uses these to advance idle backends.
+ */
+static QueuePosition queueHeadBeforeWrite;
+static QueuePosition queueHeadAfterWrite;
+
+/*
+ * List of channels with pending notifications in the current transaction.
+ */
+static List *pendingNotifyChannels = NIL;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -428,7 +498,6 @@ bool Trace_notify = false;
int max_notify_queue_pages = 1048576;
/* local function prototypes */
-static inline int64 asyncQueuePageDiff(int64 p, int64 q);
static inline bool asyncQueuePagePrecedes(int64 p, int64 q);
static void queue_listen(ListenActionKind action, const char *channel);
static void Async_UnlistenOnExit(int code, Datum arg);
@@ -457,16 +526,9 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
-
-/*
- * Compute the difference between two queue page numbers.
- * Previously this function accounted for a wraparound.
- */
-static inline int64
-asyncQueuePageDiff(int64 p, int64 q)
-{
- return p - q;
-}
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+static void initChannelHash(void);
/*
* Determines whether p precedes q.
@@ -478,6 +540,105 @@ asyncQueuePagePrecedes(int64 p, int64 q)
return p < q;
}
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
+/*
+ * initListenChannelsHash
+ * Lazy initialization of the local listen channels hash table.
+ */
+static void
+initListenChannelsHash(void)
+{
+ HASHCTL hash_ctl;
+
+ /* Quick exit if we already did this */
+ if (listenChannelsHash != NULL)
+ return;
+
+ /* Initialize local hash table for this backend's listened channels */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+
+ listenChannelsHash =
+ hash_create("Listen Channels",
+ 64,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS);
+}
+
/*
* Report space needed for our shared memory area
*/
@@ -521,12 +682,17 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVANCING_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
+ QUEUE_BACKEND_IS_ADVANCING(i) = false;
}
}
@@ -657,6 +823,7 @@ Async_Notify(const char *channel, const char *payload)
notifies->events = list_make1(n);
/* We certainly don't need a hashtable yet */
notifies->hashtab = NULL;
+ notifies->channelHashtab = NULL;
notifies->upper = pendingNotifies;
pendingNotifies = notifies;
}
@@ -683,7 +850,7 @@ Async_Notify(const char *channel, const char *payload)
* Common code for listen, unlisten, unlisten all commands.
*
* Adds the request to the list of pending actions.
- * Actual update of the listenChannels list happens during transaction
+ * Actual update of the listenChannelsHash happens during transaction
* commit.
*/
static void
@@ -783,30 +950,49 @@ Async_UnlistenAll(void)
* SQL function: return a set of the channel names this backend is actively
* listening to.
*
- * Note: this coding relies on the fact that the listenChannels list cannot
+ * Note: this coding relies on the fact that the listenChannelsHash cannot
* change within a transaction.
*/
Datum
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ HASH_SEQ_STATUS *status;
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Initialize hash table iteration if we have any channels */
+ if (listenChannelsHash != NULL)
+ {
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ status = (HASH_SEQ_STATUS *) palloc(sizeof(HASH_SEQ_STATUS));
+ hash_seq_init(status, listenChannelsHash);
+ funcctx->user_fctx = status;
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ {
+ funcctx->user_fctx = NULL;
+ }
}
/* stuff done on every call of the function */
funcctx = SRF_PERCALL_SETUP();
+ status = (HASH_SEQ_STATUS *) funcctx->user_fctx;
- if (funcctx->call_cntr < list_length(listenChannels))
+ if (status != NULL)
{
- char *channel = (char *) list_nth(listenChannels,
- funcctx->call_cntr);
+ struct ChannelHash *entry;
- SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
+ entry = (struct ChannelHash *) hash_seq_search(status);
+ if (entry != NULL)
+ SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(entry->channel));
}
SRF_RETURN_DONE(funcctx);
@@ -894,6 +1080,36 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
+
+ /*
+ * Build list of unique channels for SignalBackends().
+ *
+ * If we have a channelHashtab, use it to efficiently get the unique
+ * channels. Otherwise, fall back to the linear approach.
+ */
+ pendingNotifyChannels = NIL;
+ if (pendingNotifies->channelHashtab != NULL)
+ {
+ HASH_SEQ_STATUS status;
+ struct ChannelHash *channelEntry;
+
+ hash_seq_init(&status, pendingNotifies->channelHashtab);
+ while ((channelEntry = (struct ChannelHash *) hash_seq_search(&status)) != NULL)
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channelEntry->channel);
+ }
+ else
+ {
+ /* Linear approach for small number of notifications */
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
+ }
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -922,6 +1138,22 @@ PreCommit_Notify(void)
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /*
+ * For the direct advancement optimization in SignalBackends(), we
+ * need to ensure that no other backend can insert queue entries
+ * between queueHeadBeforeWrite and queueHeadAfterWrite. The
+ * heavyweight lock above provides this guarantee, since it serializes
+ * all writers.
+ *
+ * Note: if the heavyweight lock were ever removed for scalability
+ * reasons, we could achieve the same guarantee by holding
+ * NotifyQueueLock in EXCLUSIVE mode across all our insertions, rather
+ * than releasing and reacquiring it for each page as we do below.
+ */
+
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(queueHeadBeforeWrite, 0, 0);
+
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
while (nextNotify != NULL)
@@ -939,12 +1171,20 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ if (firstIteration)
+ {
+ queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many notifications in the NOTIFY queue")));
nextNotify = asyncQueueAddEntries(nextNotify);
+ queueHeadAfterWrite = QUEUE_HEAD;
+
LWLockRelease(NotifyQueueLock);
}
@@ -957,7 +1197,7 @@ PreCommit_Notify(void)
*
* This is called at transaction commit, after committing to clog.
*
- * Update listenChannels and clear transaction-local state.
+ * Update listenChannelsHash and clear transaction-local state.
*
* If we issued any notifications in the transaction, send signals to
* listening backends (possibly including ourselves) to process them.
@@ -1002,7 +1242,8 @@ AtCommit_Notify(void)
}
/* If no longer listening to anything, get out of listener array */
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/*
@@ -1135,50 +1376,145 @@ Exec_ListenPreCommit(void)
static void
Exec_ListenCommit(const char *channel)
{
- MemoryContext oldcontext;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
/* Do nothing if we are already listening on this channel */
if (IsListeningOn(channel))
return;
/*
- * Add the new channel name to listenChannels.
+ * Add the new channel name to listenChannelsHash.
*
* XXX It is theoretically possible to get an out-of-memory failure here,
* which would be bad because we already committed. For the moment it
* doesn't seem worth trying to guard against that, but maybe improve this
* later.
*/
- oldcontext = MemoryContextSwitchTo(TopMemoryContext);
- listenChannels = lappend(listenChannels, pstrdup(channel));
- MemoryContextSwitchTo(oldcontext);
+ initListenChannelsHash();
+ (void) hash_search(listenChannelsHash, channel, HASH_ENTER, NULL);
+
+ /* Now update the shared channelHash for SignalBackends() to use */
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listenersArray to InvalidDsaPointer as a
+ * marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners] = MyProcNumber;
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Unlisten the specified channel for this backend.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
- ListCell *q;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
- foreach(q, listenChannels)
+ /* Remove from our local cache */
+ if (listenChannelsHash != NULL)
+ (void) hash_search(listenChannelsHash, channel, HASH_REMOVE, NULL);
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(q);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i] == MyProcNumber)
{
- listenChannels = foreach_delete_current(listenChannels, q);
- pfree(lchan);
- break;
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ return;
}
}
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1193,34 +1529,68 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
- list_free_deep(listenChannels);
- listenChannels = NIL;
+ /* Clear our local cache */
+ if (listenChannelsHash != NULL)
+ {
+ hash_destroy(listenChannelsHash);
+ listenChannelsHash = NULL;
+ }
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+ int i;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
}
/*
* Test whether we are actively listening on the given channel name.
*
* Note: this function is executed for every notification found in the queue.
- * Perhaps it is worth further optimization, eg convert the list to a sorted
- * array so we can binary-search it. In practice the list is likely to be
- * fairly short, though.
*/
static bool
IsListeningOn(const char *channel)
{
- ListCell *p;
+ if (listenChannelsHash == NULL)
+ return false;
- foreach(p, listenChannels)
- {
- char *lchan = (char *) lfirst(p);
-
- if (strcmp(lchan, channel) == 0)
- return true;
- }
- return false;
+ return (hash_search(listenChannelsHash, channel, HASH_FIND, NULL) != NULL);
}
/*
@@ -1230,7 +1600,7 @@ IsListeningOn(const char *channel)
static void
asyncQueueUnregister(void)
{
- Assert(listenChannels == NIL); /* else caller error */
+ Assert(listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0); /* else caller error */
if (!amRegisteredListener) /* nothing to do */
return;
@@ -1242,6 +1612,7 @@ asyncQueueUnregister(void)
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
/* and remove it from the list */
if (QUEUE_FIRST_LISTENER == MyProcNumber)
QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);
@@ -1565,12 +1936,15 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are known to still be positioned at the queue head
+ * from before our commit can be safely advanced directly to the new
+ * head, since the queue region we wrote is known to contain only our
+ * own notifications. This avoids unnecessary wakeups when there is
+ * nothing of interest to them.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1957,13 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ ListCell *lc;
+
+ /*
+ * Attach to the channel hash if needed. We might not have one if this
+ * backend hasn't done LISTEN, but we need it to find listeners.
+ */
+ initChannelHash();
/*
* Identify backends that we need to signal. We don't want to send
@@ -1597,36 +1978,103 @@ SignalBackends(void)
count = 0;
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, pendingNotifyChannels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ProcNumber *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ /*
+ * Direct advancement: avoid waking non-caught up backends that aren't
+ * interested in our notifications.
+ */
+ if (pendingNotifies != NULL)
+ {
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
+ QueuePosition pos;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
/*
- * Listeners in other databases should be signaled only if they
- * are far behind.
+ * We need to signal advancing listening backends that would get
+ * stuck at a position before the new queue head. We also need to
+ * signal listening backends that are idle at a position before
+ * the old queue head since they could be interested in the
+ * messages in-between.
+ *
+ * Listening backends that are not advancing and are stationary at
+ * a position somewhere in the range we just wrote, can safely be
+ * direct advanced to the new queue head, since we know that they
+ * are not interested in our messages.
*/
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
- continue;
+ if (QUEUE_BACKEND_IS_ADVANCING(i) ?
+ QUEUE_POS_PRECEDES(QUEUE_BACKEND_ADVANCING_POS(i), queueHeadAfterWrite) :
+ QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
+ {
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
+ else if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
+ QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
+ {
+ Assert(!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite));
+
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1673,9 +2121,10 @@ AtAbort_Notify(void)
/*
* If we LISTEN but then roll back the transaction after PreCommit_Notify,
* we have registered as a listener but have not made any entry in
- * listenChannels. In that case, deregister again.
+ * listenChannelsHash. In that case, deregister again.
*/
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/* And clean up */
@@ -1861,20 +2310,29 @@ asyncQueueReadAllNotifications(void)
AsyncQueueEntry align;
} page_buffer;
- /* Fetch current state */
+ /*
+ * Fetch current state, indicate to others that we have woken up, and that
+ * we now will be advancing our position.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ head = QUEUE_HEAD;
pos = QUEUE_BACKEND_POS(MyProcNumber);
- head = QUEUE_HEAD;
- LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
{
/* Nothing to do, we have read all notifications already. */
+ LWLockRelease(NotifyQueueLock);
return;
}
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = true;
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = head;
+
+ LWLockRelease(NotifyQueueLock);
+
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
* This is trickier than it might seem, because of race conditions.
@@ -1987,6 +2445,8 @@ asyncQueueReadAllNotifications(void)
{
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
+
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = false;
QUEUE_BACKEND_POS(MyProcNumber) = pos;
LWLockRelease(NotifyQueueLock);
}
@@ -2186,7 +2646,7 @@ ProcessIncomingNotify(bool flush)
notifyInterruptPending = false;
/* Do nothing else if we aren't actively listening */
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
return;
if (Trace_notify)
@@ -2290,13 +2750,15 @@ AddEventToPendingNotifies(Notification *n)
{
Assert(pendingNotifies->events != NIL);
- /* Create the hash table if it's time to */
+ /* Create the hash tables if it's time to */
if (list_length(pendingNotifies->events) >= MIN_HASHABLE_NOTIFIES &&
pendingNotifies->hashtab == NULL)
{
HASHCTL hash_ctl;
ListCell *l;
+ Assert(pendingNotifies->channelHashtab == NULL);
+
/* Create the hash table */
hash_ctl.keysize = sizeof(Notification *);
hash_ctl.entrysize = sizeof(struct NotificationHash);
@@ -2309,10 +2771,22 @@ AddEventToPendingNotifies(Notification *n)
&hash_ctl,
HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
+ /* Create the channel hash table */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+ hash_ctl.hcxt = CurTransactionContext;
+ pendingNotifies->channelHashtab =
+ hash_create("Pending Notify Channels",
+ 64L,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
/* Insert all the already-existing events */
foreach(l, pendingNotifies->events)
{
Notification *oldn = (Notification *) lfirst(l);
+ char *channel = oldn->data;
bool found;
(void) hash_search(pendingNotifies->hashtab,
@@ -2320,22 +2794,42 @@ AddEventToPendingNotifies(Notification *n)
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Insert channel into channelHashtab */
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if multiple events on same channel */
}
}
/* Add new event to the list, in order */
pendingNotifies->events = lappend(pendingNotifies->events, n);
- /* Add event to the hash table if needed */
+ /* Add event to the hash tables if needed */
if (pendingNotifies->hashtab != NULL)
{
bool found;
+ Assert(pendingNotifies->channelHashtab != NULL);
+
(void) hash_search(pendingNotifies->hashtab,
&n,
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Add channel to channelHashtab */
+ {
+ char *channel = n->data;
+
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if we already have an event on this channel */
+ }
}
}
@@ -2373,7 +2867,7 @@ notification_match(const void *key1, const void *key2, Size keysize)
return 1; /* not equal */
}
-/* Clear the pendingActions and pendingNotifies lists. */
+/* Clear the pendingActions, pendingNotifies, and pendingNotifyChannels lists. */
static void
ClearPendingActionsAndNotifies(void)
{
@@ -2385,6 +2879,7 @@ ClearPendingActionsAndNotifies(void)
*/
pendingActions = NULL;
pendingNotifies = NULL;
+ pendingNotifyChannels = NIL;
}
/*
@@ -2395,3 +2890,16 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 018b5919cf6..b8443725398 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
[application/octet-stream] 0001-optimize_listen_notify-v24.patch (9.3K, 4-0001-optimize_listen_notify-v24.patch)
download | inline diff:
From fb822108149ea01fa25a46f1a4c0ba71f86e1a2b Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/3] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 114 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 68 +++++++++++
2 files changed, 181 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..443a6eb669f 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 7 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
@@ -104,6 +205,17 @@ step l2commit: COMMIT;
listener2: NOTIFY "c1" with payload "" from notifier
step l2stop: UNLISTEN *;
+starting permutation: lch_listen nch_notify lch_check
+step lch_listen: LISTEN ch;
+step nch_notify: NOTIFY ch, 'aa';
+step lch_check: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener_ch: NOTIFY "ch" with payload "aa" from notifier_ch
+
starting permutation: llisten lbegin usage bignotify usage
step llisten: LISTEN c1; LISTEN c2;
step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..0a01e777b98 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,38 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Listener session for cross-session notification test with channel 'ch'.
+
+session listener_ch
+step lch_listen { LISTEN ch; }
+step lch_check { SELECT 1 AS x; }
+teardown { UNLISTEN *; }
+
+# Notifier session for cross-session notification test with channel 'ch'.
+
+session notifier_ch
+step nch_notify { NOTIFY ch, 'aa'; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +106,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
@@ -73,6 +137,10 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
# and notify queue is not empty
permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
+# Check that notifications sent from a backend that has not done LISTEN
+# are properly delivered to a listener in another backend.
+permutation lch_listen nch_notify lch_check
+
# Verify that pg_notification_queue_usage correctly reports a non-zero result,
# after submitting notifications while another connection is listening for
# those notifications and waiting inside an active transaction. We have to
--
2.50.1
[application/octet-stream] pg_async_notify_test-v24.c (26.4K, 5-pg_async_notify_test-v24.c)
download
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-08 12:59 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-11-08 12:59 UTC (permalink / raw)
To: pgsql-hackers
On Fri, Nov 7, 2025, at 19:59, Joel Jacobson wrote:
> * The logic in SignalBackends has been reworked and simplified,
> thanks to the new isAdvancing and advancingPos fields.
> I now think it's finally easy to reason about why each branch
> in SignalBackends must be correct.
I was wrong. I wrongly assumed asyncQueueReadAllNotifications would read
up until head, which it might not actually do:
* Process messages up to the stop position, end of page, or an
* uncommitted message.
This in turn could cause a listening backend to remain behind, if there
would be no more notifies, so it unfortunately seems like we will always
need to signal when a backend isAdvancing, and therefore have no use of
the advancingPos field.
I will do more correctness and benchmark testing before posting a new
version. Just wanted to give you a heads up on the bug, so you don't
waste time reviewing.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-08 15:04 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-11-08 15:04 UTC (permalink / raw)
To: pgsql-hackers
On Sat, Nov 8, 2025, at 13:59, Joel Jacobson wrote:
> On Fri, Nov 7, 2025, at 19:59, Joel Jacobson wrote:
>> * The logic in SignalBackends has been reworked and simplified,
>> thanks to the new isAdvancing and advancingPos fields.
>> I now think it's finally easy to reason about why each branch
>> in SignalBackends must be correct.
>
> I was wrong. I wrongly assumed asyncQueueReadAllNotifications would read
> up until head, which it might not actually do:
>
> * Process messages up to the stop position, end of page, or an
> * uncommitted message.
>
> This in turn could cause a listening backend to remain behind, if there
> would be no more notifies, so it unfortunately seems like we will always
> need to signal when a backend isAdvancing, and therefore have no use of
> the advancingPos field.
>
> I will do more correctness and benchmark testing before posting a new
> version. Just wanted to give you a heads up on the bug, so you don't
> waste time reviewing.
Changes since v24:
* Removed the QueuePosition advancingPos QueueBackendStatus field.
* Always signal when isAdvancing is true.
Some benchmarking:
master:
% ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 1000 --sleep 0.1 --sleep-exp 1.01
13 s: 8668 sent (658/s), 8676 received (653/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms 0 (0.0%) avg: 0.000ms
0.10-1.00ms # 3 (0.0%) avg: 0.783ms
1.00-10.00ms # 49 (0.6%) avg: 5.559ms
10.00-100.00ms # 168 (1.9%) avg: 56.360mss
>100.00ms ######### 8456 (97.5%) avg: 256.086ms
v25:
% ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 1000 --sleep 0.1 --sleep-exp 1.01
14 s: 27097 sent (1959/s), 27097 received (1960/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 962 (3.6%) avg: 0.081ms
0.10-1.00ms ######### 25066 (92.5%) avg: 0.321ms
1.00-10.00ms # 1069 (3.9%) avg: 2.104ms
10.00-100.00ms 0 (0.0%) avg: 0.000ms
>100.00ms 0 (0.0%) avg: 0.000ms
On master, I see lots of "waiting for AccessExclusiveLock on object 0"
in the logs for this benchmark setup, but none at all for v25.
I wonder if there would be some way to solve the problem with v24,
without having to always signal when isAdvancing is true with an
advancingPos at or ahead of the notifier's queueHeadAfterWrite. One idea
I had, that I think is a bad idea, but just mentioning it in case you
think it could work, is to check if the new pos is at head, in
asyncQueueReadAllNotifications's PG_FINALLY block, and if not, to let
the listening backend signal itself. I guess this is a bad idea, since
then it could signal itself over and over again, and consume a lot of
resources, if some uncommitted message takes a long time to commit. The
current mechanism limit the rate of wake-ups to the same rate as the
notifies, which seems sensible. Thoughts on this?
/Joel
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v25.patch (9.3K, 2-0001-optimize_listen_notify-v25.patch)
download | inline diff:
From def00569ec6d23b3652edf6a3727e007899fafe8 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 114 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 68 +++++++++++
2 files changed, 181 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..443a6eb669f 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 7 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
@@ -104,6 +205,17 @@ step l2commit: COMMIT;
listener2: NOTIFY "c1" with payload "" from notifier
step l2stop: UNLISTEN *;
+starting permutation: lch_listen nch_notify lch_check
+step lch_listen: LISTEN ch;
+step nch_notify: NOTIFY ch, 'aa';
+step lch_check: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener_ch: NOTIFY "ch" with payload "aa" from notifier_ch
+
starting permutation: llisten lbegin usage bignotify usage
step llisten: LISTEN c1; LISTEN c2;
step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..0a01e777b98 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,38 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Listener session for cross-session notification test with channel 'ch'.
+
+session listener_ch
+step lch_listen { LISTEN ch; }
+step lch_check { SELECT 1 AS x; }
+teardown { UNLISTEN *; }
+
+# Notifier session for cross-session notification test with channel 'ch'.
+
+session notifier_ch
+step nch_notify { NOTIFY ch, 'aa'; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +106,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
@@ -73,6 +137,10 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
# and notify queue is not empty
permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
+# Check that notifications sent from a backend that has not done LISTEN
+# are properly delivered to a listener in another backend.
+permutation lch_listen nch_notify lch_check
+
# Verify that pg_notification_queue_usage correctly reports a non-zero result,
# after submitting notifications while another connection is listening for
# those notifications and waiting inside an active transaction. We have to
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v25.patch (42.7K, 3-0002-optimize_listen_notify-v25.patch)
download | inline diff:
From 2abed1a217642a9037fe03b270906c849662c3a9 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sat, 8 Nov 2025 13:47:09 +0100
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of listening backends (ProcNumbers). This allows the sender to
target only those backends actually listening on the channels for which
it has queued notifications.
At commit time:
* AtCommit_Notify updates the shared channelHash to reflect any LISTEN
or UNLISTEN actions performed in the transaction.
* SignalBackends consults this hash to find the backends that are
listening on the channels being notified in the current database, and
signals only those.
Each backend's entry in AsyncQueueControl now includes a wakeupPending
flag to prevent duplicate signals while a previous wakeup is still being
processed.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
To handle advancing backends correctly, each backend's entry tracks both
whether it is currently advancing (isAdvancing) and the target position
it is advancing to (advancingPos). This allows SignalBackends to signal
advancing backends only when their target position would leave them
behind the new queue head, while safely direct-advancing idle backends
that would not be interested in the newly written notifications.
Other notes
-----------
* Maintains dual data structures: a shared channelHash for determining
which backends to signal, and a local per-backend listenChannels list
for fast lock-free lookups during notification processing. This avoids
contention on the shared hash during the high-frequency IsListeningOn
checks that occur for every notification read from the queue.
* Backends remain registered in the global listener list as long as
listenChannels is non-empty.
* Adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
* No user-visible behavioral changes; this is an internal optimization
only.
---
src/backend/commands/async.c | 708 +++++++++++++++---
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 608 insertions(+), 104 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..1f54fa8e41b 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,21 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
- * Then we signal any backends that may be interested in our messages
- * (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * make any actual updates to the local listen state (listenChannelsHash) and
+ * shared channel hash table (channelHash). Then we signal any backends
+ * that may be interested in our messages (including our own backend,
+ * if listening). This is done by SignalBackends(), which consults the
+ * shared channel hash table to identify listeners for the channels that
+ * have pending notifications in the current database. Each selected
+ * backend is marked as having a wakeup pending to avoid duplicate signals,
+ * and a PROCSIG_NOTIFY_INTERRUPT signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, avoiding unnecessary
+ * wakeups for idle listeners that have nothing to read.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -137,14 +144,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +172,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ProcNumber array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -224,11 +257,16 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +284,8 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
+ bool isAdvancing; /* backend is advancing its position */
} QueueBackendStatus;
/*
@@ -260,9 +300,10 @@ typedef struct QueueBackendStatus
* (since no other backend will inspect it).
*
* When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
- * entries of other backends and also change the head pointer. When holding
- * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointers.
+ * entries of other backends and also change the head pointer. They can
+ * also advance other backends' queue positions, unless they are not
+ * in the process of doing that themselves. When holding both NotifyQueueLock and
+ * NotifyQueueTailLock in EXCLUSIVE mode, backends can change the tail pointers.
*
* SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
* the control lock for the pg_notify SLRU buffers.
@@ -288,11 +329,16 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +347,8 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
+#define QUEUE_BACKEND_IS_ADVANCING(i) (asyncQueueControl->backend[i].isAdvancing)
/*
* The SLRU buffer area through which we access the notification queue
@@ -313,16 +361,16 @@ static SlruCtlData NotifyCtlData;
#define QUEUE_FULL_WARN_INTERVAL 5000 /* warn at most once every 5s */
/*
- * listenChannels identifies the channels we are actually listening to
- * (ie, have committed a LISTEN on). It is a simple list of channel names,
+ * listenChannelsHash identifies the channels we are actually listening to
+ * (ie, have committed a LISTEN on). It is a hash table of channel names,
* allocated in TopMemoryContext.
*/
-static List *listenChannels = NIL; /* list of C strings */
+static HTAB *listenChannelsHash = NULL;
/*
* State for pending LISTEN/UNLISTEN actions consists of an ordered list of
* all actions requested in the current transaction. As explained above,
- * we don't actually change listenChannels until we reach transaction commit.
+ * we don't actually change listenChannelsHash until we reach transaction commit.
*
* The list is kept in CurTransactionContext. In subtransactions, each
* subtransaction has its own list in its own CurTransactionContext, but
@@ -391,6 +439,7 @@ typedef struct NotificationList
int nestingLevel; /* current transaction nesting depth */
List *events; /* list of Notification structs */
HTAB *hashtab; /* hash of NotificationHash structs, or NULL */
+ HTAB *channelHashtab; /* hash of unique channel names, or NULL */
struct NotificationList *upper; /* details for upper transaction levels */
} NotificationList;
@@ -401,6 +450,11 @@ struct NotificationHash
Notification *event; /* => the actual Notification struct */
};
+struct ChannelHash
+{
+ char channel[NAMEDATALEN];
+};
+
static NotificationList *pendingNotifies = NULL;
/*
@@ -418,6 +472,20 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/*
+ * Queue head positions for direct advancement.
+ * These are captured during PreCommit_Notify while holding the heavyweight
+ * lock on database 0, ensuring no other backend can insert notifications
+ * between them. SignalBackends uses these to advance idle backends.
+ */
+static QueuePosition queueHeadBeforeWrite;
+static QueuePosition queueHeadAfterWrite;
+
+/*
+ * List of channels with pending notifications in the current transaction.
+ */
+static List *pendingNotifyChannels = NIL;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -428,7 +496,6 @@ bool Trace_notify = false;
int max_notify_queue_pages = 1048576;
/* local function prototypes */
-static inline int64 asyncQueuePageDiff(int64 p, int64 q);
static inline bool asyncQueuePagePrecedes(int64 p, int64 q);
static void queue_listen(ListenActionKind action, const char *channel);
static void Async_UnlistenOnExit(int code, Datum arg);
@@ -457,16 +524,9 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
-
-/*
- * Compute the difference between two queue page numbers.
- * Previously this function accounted for a wraparound.
- */
-static inline int64
-asyncQueuePageDiff(int64 p, int64 q)
-{
- return p - q;
-}
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+static void initChannelHash(void);
/*
* Determines whether p precedes q.
@@ -478,6 +538,105 @@ asyncQueuePagePrecedes(int64 p, int64 q)
return p < q;
}
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
+/*
+ * initListenChannelsHash
+ * Lazy initialization of the local listen channels hash table.
+ */
+static void
+initListenChannelsHash(void)
+{
+ HASHCTL hash_ctl;
+
+ /* Quick exit if we already did this */
+ if (listenChannelsHash != NULL)
+ return;
+
+ /* Initialize local hash table for this backend's listened channels */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+
+ listenChannelsHash =
+ hash_create("Listen Channels",
+ 64,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS);
+}
+
/*
* Report space needed for our shared memory area
*/
@@ -521,12 +680,16 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
+ QUEUE_BACKEND_IS_ADVANCING(i) = false;
}
}
@@ -657,6 +820,7 @@ Async_Notify(const char *channel, const char *payload)
notifies->events = list_make1(n);
/* We certainly don't need a hashtable yet */
notifies->hashtab = NULL;
+ notifies->channelHashtab = NULL;
notifies->upper = pendingNotifies;
pendingNotifies = notifies;
}
@@ -683,7 +847,7 @@ Async_Notify(const char *channel, const char *payload)
* Common code for listen, unlisten, unlisten all commands.
*
* Adds the request to the list of pending actions.
- * Actual update of the listenChannels list happens during transaction
+ * Actual update of the listenChannelsHash happens during transaction
* commit.
*/
static void
@@ -783,30 +947,49 @@ Async_UnlistenAll(void)
* SQL function: return a set of the channel names this backend is actively
* listening to.
*
- * Note: this coding relies on the fact that the listenChannels list cannot
+ * Note: this coding relies on the fact that the listenChannelsHash cannot
* change within a transaction.
*/
Datum
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ HASH_SEQ_STATUS *status;
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Initialize hash table iteration if we have any channels */
+ if (listenChannelsHash != NULL)
+ {
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ status = (HASH_SEQ_STATUS *) palloc(sizeof(HASH_SEQ_STATUS));
+ hash_seq_init(status, listenChannelsHash);
+ funcctx->user_fctx = status;
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ {
+ funcctx->user_fctx = NULL;
+ }
}
/* stuff done on every call of the function */
funcctx = SRF_PERCALL_SETUP();
+ status = (HASH_SEQ_STATUS *) funcctx->user_fctx;
- if (funcctx->call_cntr < list_length(listenChannels))
+ if (status != NULL)
{
- char *channel = (char *) list_nth(listenChannels,
- funcctx->call_cntr);
+ struct ChannelHash *entry;
- SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
+ entry = (struct ChannelHash *) hash_seq_search(status);
+ if (entry != NULL)
+ SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(entry->channel));
}
SRF_RETURN_DONE(funcctx);
@@ -894,6 +1077,36 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
+
+ /*
+ * Build list of unique channels for SignalBackends().
+ *
+ * If we have a channelHashtab, use it to efficiently get the unique
+ * channels. Otherwise, fall back to the linear approach.
+ */
+ pendingNotifyChannels = NIL;
+ if (pendingNotifies->channelHashtab != NULL)
+ {
+ HASH_SEQ_STATUS status;
+ struct ChannelHash *channelEntry;
+
+ hash_seq_init(&status, pendingNotifies->channelHashtab);
+ while ((channelEntry = (struct ChannelHash *) hash_seq_search(&status)) != NULL)
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channelEntry->channel);
+ }
+ else
+ {
+ /* Linear approach for small number of notifications */
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
+ }
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -922,6 +1135,22 @@ PreCommit_Notify(void)
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /*
+ * For the direct advancement optimization in SignalBackends(), we
+ * need to ensure that no other backend can insert queue entries
+ * between queueHeadBeforeWrite and queueHeadAfterWrite. The
+ * heavyweight lock above provides this guarantee, since it serializes
+ * all writers.
+ *
+ * Note: if the heavyweight lock were ever removed for scalability
+ * reasons, we could achieve the same guarantee by holding
+ * NotifyQueueLock in EXCLUSIVE mode across all our insertions, rather
+ * than releasing and reacquiring it for each page as we do below.
+ */
+
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(queueHeadBeforeWrite, 0, 0);
+
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
while (nextNotify != NULL)
@@ -939,12 +1168,20 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ if (firstIteration)
+ {
+ queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many notifications in the NOTIFY queue")));
nextNotify = asyncQueueAddEntries(nextNotify);
+ queueHeadAfterWrite = QUEUE_HEAD;
+
LWLockRelease(NotifyQueueLock);
}
@@ -957,7 +1194,7 @@ PreCommit_Notify(void)
*
* This is called at transaction commit, after committing to clog.
*
- * Update listenChannels and clear transaction-local state.
+ * Update listenChannelsHash and clear transaction-local state.
*
* If we issued any notifications in the transaction, send signals to
* listening backends (possibly including ourselves) to process them.
@@ -1002,7 +1239,8 @@ AtCommit_Notify(void)
}
/* If no longer listening to anything, get out of listener array */
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/*
@@ -1135,50 +1373,145 @@ Exec_ListenPreCommit(void)
static void
Exec_ListenCommit(const char *channel)
{
- MemoryContext oldcontext;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
/* Do nothing if we are already listening on this channel */
if (IsListeningOn(channel))
return;
/*
- * Add the new channel name to listenChannels.
+ * Add the new channel name to listenChannelsHash.
*
* XXX It is theoretically possible to get an out-of-memory failure here,
* which would be bad because we already committed. For the moment it
* doesn't seem worth trying to guard against that, but maybe improve this
* later.
*/
- oldcontext = MemoryContextSwitchTo(TopMemoryContext);
- listenChannels = lappend(listenChannels, pstrdup(channel));
- MemoryContextSwitchTo(oldcontext);
+ initListenChannelsHash();
+ (void) hash_search(listenChannelsHash, channel, HASH_ENTER, NULL);
+
+ /* Now update the shared channelHash for SignalBackends() to use */
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listenersArray to InvalidDsaPointer as a
+ * marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners] = MyProcNumber;
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Unlisten the specified channel for this backend.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
- ListCell *q;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
- foreach(q, listenChannels)
+ /* Remove from our local cache */
+ if (listenChannelsHash != NULL)
+ (void) hash_search(listenChannelsHash, channel, HASH_REMOVE, NULL);
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(q);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i] == MyProcNumber)
{
- listenChannels = foreach_delete_current(listenChannels, q);
- pfree(lchan);
- break;
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ return;
}
}
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1193,34 +1526,68 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
- list_free_deep(listenChannels);
- listenChannels = NIL;
+ /* Clear our local cache */
+ if (listenChannelsHash != NULL)
+ {
+ hash_destroy(listenChannelsHash);
+ listenChannelsHash = NULL;
+ }
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+ int i;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
}
/*
* Test whether we are actively listening on the given channel name.
*
* Note: this function is executed for every notification found in the queue.
- * Perhaps it is worth further optimization, eg convert the list to a sorted
- * array so we can binary-search it. In practice the list is likely to be
- * fairly short, though.
*/
static bool
IsListeningOn(const char *channel)
{
- ListCell *p;
+ if (listenChannelsHash == NULL)
+ return false;
- foreach(p, listenChannels)
- {
- char *lchan = (char *) lfirst(p);
-
- if (strcmp(lchan, channel) == 0)
- return true;
- }
- return false;
+ return (hash_search(listenChannelsHash, channel, HASH_FIND, NULL) != NULL);
}
/*
@@ -1230,7 +1597,7 @@ IsListeningOn(const char *channel)
static void
asyncQueueUnregister(void)
{
- Assert(listenChannels == NIL); /* else caller error */
+ Assert(listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0); /* else caller error */
if (!amRegisteredListener) /* nothing to do */
return;
@@ -1242,6 +1609,7 @@ asyncQueueUnregister(void)
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
/* and remove it from the list */
if (QUEUE_FIRST_LISTENER == MyProcNumber)
QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);
@@ -1565,12 +1933,15 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are known to still be positioned at the queue head
+ * from before our commit can be safely advanced directly to the new
+ * head, since the queue region we wrote is known to contain only our
+ * own notifications. This avoids unnecessary wakeups when there is
+ * nothing of interest to them.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1954,13 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ ListCell *lc;
+
+ /*
+ * Attach to the channel hash if needed. We might not have one if this
+ * backend hasn't done LISTEN, but we need it to find listeners.
+ */
+ initChannelHash();
/*
* Identify backends that we need to signal. We don't want to send
@@ -1597,36 +1975,99 @@ SignalBackends(void)
count = 0;
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, pendingNotifyChannels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ProcNumber *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ /*
+ * Direct advancement: avoid waking non-caught up backends that aren't
+ * interested in our notifications.
+ */
+ if (pendingNotifies != NULL)
+ {
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
+ QueuePosition pos;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
/*
- * Listeners in other databases should be signaled only if they
- * are far behind.
+ * We need to signal advancing listening backends since we
+ * don't know where they will stop advancing. We also need
+ * to signal listening backends that are idle at a position
+ * before the old queue head since they could be interested
+ * in the messages in-between.
+ *
+ * Listening backends that are not advancing and are stationary at
+ * a position somewhere in the range we just wrote, can safely be
+ * direct advanced to the new queue head, since we know that they
+ * are not interested in our messages.
*/
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
- continue;
+ if (QUEUE_BACKEND_IS_ADVANCING(i))
+ {
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
+ else if (!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite) &&
+ QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
+ {
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1673,9 +2114,10 @@ AtAbort_Notify(void)
/*
* If we LISTEN but then roll back the transaction after PreCommit_Notify,
* we have registered as a listener but have not made any entry in
- * listenChannels. In that case, deregister again.
+ * listenChannelsHash. In that case, deregister again.
*/
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/* And clean up */
@@ -1861,20 +2303,28 @@ asyncQueueReadAllNotifications(void)
AsyncQueueEntry align;
} page_buffer;
- /* Fetch current state */
+ /*
+ * Fetch current state, indicate to others that we have woken up, and that
+ * we now will be advancing our position.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ head = QUEUE_HEAD;
pos = QUEUE_BACKEND_POS(MyProcNumber);
- head = QUEUE_HEAD;
- LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
{
/* Nothing to do, we have read all notifications already. */
+ LWLockRelease(NotifyQueueLock);
return;
}
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = true;
+
+ LWLockRelease(NotifyQueueLock);
+
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
* This is trickier than it might seem, because of race conditions.
@@ -1987,6 +2437,8 @@ asyncQueueReadAllNotifications(void)
{
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
+
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = false;
QUEUE_BACKEND_POS(MyProcNumber) = pos;
LWLockRelease(NotifyQueueLock);
}
@@ -2186,7 +2638,7 @@ ProcessIncomingNotify(bool flush)
notifyInterruptPending = false;
/* Do nothing else if we aren't actively listening */
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
return;
if (Trace_notify)
@@ -2290,13 +2742,15 @@ AddEventToPendingNotifies(Notification *n)
{
Assert(pendingNotifies->events != NIL);
- /* Create the hash table if it's time to */
+ /* Create the hash tables if it's time to */
if (list_length(pendingNotifies->events) >= MIN_HASHABLE_NOTIFIES &&
pendingNotifies->hashtab == NULL)
{
HASHCTL hash_ctl;
ListCell *l;
+ Assert(pendingNotifies->channelHashtab == NULL);
+
/* Create the hash table */
hash_ctl.keysize = sizeof(Notification *);
hash_ctl.entrysize = sizeof(struct NotificationHash);
@@ -2309,10 +2763,22 @@ AddEventToPendingNotifies(Notification *n)
&hash_ctl,
HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
+ /* Create the channel hash table */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+ hash_ctl.hcxt = CurTransactionContext;
+ pendingNotifies->channelHashtab =
+ hash_create("Pending Notify Channels",
+ 64L,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
/* Insert all the already-existing events */
foreach(l, pendingNotifies->events)
{
Notification *oldn = (Notification *) lfirst(l);
+ char *channel = oldn->data;
bool found;
(void) hash_search(pendingNotifies->hashtab,
@@ -2320,22 +2786,42 @@ AddEventToPendingNotifies(Notification *n)
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Insert channel into channelHashtab */
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if multiple events on same channel */
}
}
/* Add new event to the list, in order */
pendingNotifies->events = lappend(pendingNotifies->events, n);
- /* Add event to the hash table if needed */
+ /* Add event to the hash tables if needed */
if (pendingNotifies->hashtab != NULL)
{
bool found;
+ Assert(pendingNotifies->channelHashtab != NULL);
+
(void) hash_search(pendingNotifies->hashtab,
&n,
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Add channel to channelHashtab */
+ {
+ char *channel = n->data;
+
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if we already have an event on this channel */
+ }
}
}
@@ -2373,7 +2859,7 @@ notification_match(const void *key1, const void *key2, Size keysize)
return 1; /* not equal */
}
-/* Clear the pendingActions and pendingNotifies lists. */
+/* Clear the pendingActions, pendingNotifies, and pendingNotifyChannels lists. */
static void
ClearPendingActionsAndNotifies(void)
{
@@ -2385,6 +2871,7 @@ ClearPendingActionsAndNotifies(void)
*/
pendingActions = NULL;
pendingNotifies = NULL;
+ pendingNotifyChannels = NIL;
}
/*
@@ -2395,3 +2882,16 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c1ac71ff7f2..7c2cf960093 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -369,6 +369,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 5b0ce383408..4236965e72a 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -101,6 +101,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 432509277c9..ee3540643c8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-11 16:34 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-11-11 16:34 UTC (permalink / raw)
To: pgsql-hackers
On Sat, Nov 8, 2025, at 16:04, Joel Jacobson wrote:
> On Sat, Nov 8, 2025, at 13:59, Joel Jacobson wrote:
>> On Fri, Nov 7, 2025, at 19:59, Joel Jacobson wrote:
>> This in turn could cause a listening backend to remain behind, if there
>> would be no more notifies, so it unfortunately seems like we will always
>> need to signal when a backend isAdvancing, and therefore have no use of
>> the advancingPos field.
Having thought about this, I don't think this is actually a problem,
since this isn't any worse than what we currently have in master.
Listening backends can currently end up stationary behind QUEUE_HEAD, in
exactly this situation, when they don't read up until QUEUE_HEAD in
asyncQueueReadAllNotifications. In this case, we currently rely on
another NOTIFY to wake them up, so v24 wouldn't be any worse.
My apologies for again making the mistake of mixing in robustness
improvements into this patch. I must keep in mind this is solely an
optimization patch.
I'm therefore attaching v24 again, but renamed to v26.
/Joel
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v26.patch (9.3K, 2-0001-optimize_listen_notify-v26.patch)
download | inline diff:
From fb822108149ea01fa25a46f1a4c0ba71f86e1a2b Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/3] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 114 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 68 +++++++++++
2 files changed, 181 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..443a6eb669f 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 7 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
@@ -104,6 +205,17 @@ step l2commit: COMMIT;
listener2: NOTIFY "c1" with payload "" from notifier
step l2stop: UNLISTEN *;
+starting permutation: lch_listen nch_notify lch_check
+step lch_listen: LISTEN ch;
+step nch_notify: NOTIFY ch, 'aa';
+step lch_check: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener_ch: NOTIFY "ch" with payload "aa" from notifier_ch
+
starting permutation: llisten lbegin usage bignotify usage
step llisten: LISTEN c1; LISTEN c2;
step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..0a01e777b98 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,38 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Listener session for cross-session notification test with channel 'ch'.
+
+session listener_ch
+step lch_listen { LISTEN ch; }
+step lch_check { SELECT 1 AS x; }
+teardown { UNLISTEN *; }
+
+# Notifier session for cross-session notification test with channel 'ch'.
+
+session notifier_ch
+step nch_notify { NOTIFY ch, 'aa'; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +106,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
@@ -73,6 +137,10 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
# and notify queue is not empty
permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
+# Check that notifications sent from a backend that has not done LISTEN
+# are properly delivered to a listener in another backend.
+permutation lch_listen nch_notify lch_check
+
# Verify that pg_notification_queue_usage correctly reports a non-zero result,
# after submitting notifications while another connection is listening for
# those notifications and waiting inside an active transaction. We have to
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v26.patch (43.1K, 3-0002-optimize_listen_notify-v26.patch)
download | inline diff:
From 99c2bfe0c9d7c519c494bc1701750f5b60306ec4 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Fri, 7 Nov 2025 19:08:39 +0100
Subject: [PATCH 2/3] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of listening backends (ProcNumbers). This allows the sender to
target only those backends actually listening on the channels for which
it has queued notifications.
At commit time:
* AtCommit_Notify updates the shared channelHash to reflect any LISTEN
or UNLISTEN actions performed in the transaction.
* SignalBackends consults this hash to find the backends that are
listening on the channels being notified in the current database, and
signals only those.
Each backend's entry in AsyncQueueControl now includes a wakeupPending
flag to prevent duplicate signals while a previous wakeup is still being
processed.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
To handle advancing backends correctly, each backend's entry tracks both
whether it is currently advancing (isAdvancing) and the target position
it is advancing to (advancingPos). This allows SignalBackends to signal
advancing backends only when their target position would leave them
behind the new queue head, while safely direct-advancing idle backends
that would not be interested in the newly written notifications.
Other notes
-----------
* Maintains dual data structures: a shared channelHash for determining
which backends to signal, and a local per-backend listenChannels list
for fast lock-free lookups during notification processing. This avoids
contention on the shared hash during the high-frequency IsListeningOn
checks that occur for every notification read from the queue.
* Backends remain registered in the global listener list as long as
listenChannels is non-empty.
* Adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
* No user-visible behavioral changes; this is an internal optimization
only.
---
src/backend/commands/async.c | 716 +++++++++++++++---
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 616 insertions(+), 104 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 4bd37d5beb5..9f7b8a3324a 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,21 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
- * Then we signal any backends that may be interested in our messages
- * (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * make any actual updates to the local listen state (listenChannelsHash) and
+ * shared channel hash table (channelHash). Then we signal any backends
+ * that may be interested in our messages (including our own backend,
+ * if listening). This is done by SignalBackends(), which consults the
+ * shared channel hash table to identify listeners for the channels that
+ * have pending notifications in the current database. Each selected
+ * backend is marked as having a wakeup pending to avoid duplicate signals,
+ * and a PROCSIG_NOTIFY_INTERRUPT signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, avoiding unnecessary
+ * wakeups for idle listeners that have nothing to read.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -137,14 +144,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +172,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ProcNumber array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -224,11 +257,16 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +284,9 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
+ bool isAdvancing; /* backend is advancing its position */
+ QueuePosition advancingPos; /* target position backend is advancing to */
} QueueBackendStatus;
/*
@@ -260,9 +301,10 @@ typedef struct QueueBackendStatus
* (since no other backend will inspect it).
*
* When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
- * entries of other backends and also change the head pointer. When holding
- * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointers.
+ * entries of other backends and also change the head pointer. They can
+ * also advance other backends' queue positions, unless they are not
+ * in the process of doing that themselves. When holding both NotifyQueueLock and
+ * NotifyQueueTailLock in EXCLUSIVE mode, backends can change the tail pointers.
*
* SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
* the control lock for the pg_notify SLRU buffers.
@@ -288,11 +330,16 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +348,9 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
+#define QUEUE_BACKEND_IS_ADVANCING(i) (asyncQueueControl->backend[i].isAdvancing)
+#define QUEUE_BACKEND_ADVANCING_POS(i) (asyncQueueControl->backend[i].advancingPos)
/*
* The SLRU buffer area through which we access the notification queue
@@ -313,16 +363,16 @@ static SlruCtlData NotifyCtlData;
#define QUEUE_FULL_WARN_INTERVAL 5000 /* warn at most once every 5s */
/*
- * listenChannels identifies the channels we are actually listening to
- * (ie, have committed a LISTEN on). It is a simple list of channel names,
+ * listenChannelsHash identifies the channels we are actually listening to
+ * (ie, have committed a LISTEN on). It is a hash table of channel names,
* allocated in TopMemoryContext.
*/
-static List *listenChannels = NIL; /* list of C strings */
+static HTAB *listenChannelsHash = NULL;
/*
* State for pending LISTEN/UNLISTEN actions consists of an ordered list of
* all actions requested in the current transaction. As explained above,
- * we don't actually change listenChannels until we reach transaction commit.
+ * we don't actually change listenChannelsHash until we reach transaction commit.
*
* The list is kept in CurTransactionContext. In subtransactions, each
* subtransaction has its own list in its own CurTransactionContext, but
@@ -391,6 +441,7 @@ typedef struct NotificationList
int nestingLevel; /* current transaction nesting depth */
List *events; /* list of Notification structs */
HTAB *hashtab; /* hash of NotificationHash structs, or NULL */
+ HTAB *channelHashtab; /* hash of unique channel names, or NULL */
struct NotificationList *upper; /* details for upper transaction levels */
} NotificationList;
@@ -401,6 +452,11 @@ struct NotificationHash
Notification *event; /* => the actual Notification struct */
};
+struct ChannelHash
+{
+ char channel[NAMEDATALEN];
+};
+
static NotificationList *pendingNotifies = NULL;
/*
@@ -418,6 +474,20 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/*
+ * Queue head positions for direct advancement.
+ * These are captured during PreCommit_Notify while holding the heavyweight
+ * lock on database 0, ensuring no other backend can insert notifications
+ * between them. SignalBackends uses these to advance idle backends.
+ */
+static QueuePosition queueHeadBeforeWrite;
+static QueuePosition queueHeadAfterWrite;
+
+/*
+ * List of channels with pending notifications in the current transaction.
+ */
+static List *pendingNotifyChannels = NIL;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -428,7 +498,6 @@ bool Trace_notify = false;
int max_notify_queue_pages = 1048576;
/* local function prototypes */
-static inline int64 asyncQueuePageDiff(int64 p, int64 q);
static inline bool asyncQueuePagePrecedes(int64 p, int64 q);
static void queue_listen(ListenActionKind action, const char *channel);
static void Async_UnlistenOnExit(int code, Datum arg);
@@ -457,16 +526,9 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
-
-/*
- * Compute the difference between two queue page numbers.
- * Previously this function accounted for a wraparound.
- */
-static inline int64
-asyncQueuePageDiff(int64 p, int64 q)
-{
- return p - q;
-}
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+static void initChannelHash(void);
/*
* Determines whether p precedes q.
@@ -478,6 +540,105 @@ asyncQueuePagePrecedes(int64 p, int64 q)
return p < q;
}
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
+/*
+ * initListenChannelsHash
+ * Lazy initialization of the local listen channels hash table.
+ */
+static void
+initListenChannelsHash(void)
+{
+ HASHCTL hash_ctl;
+
+ /* Quick exit if we already did this */
+ if (listenChannelsHash != NULL)
+ return;
+
+ /* Initialize local hash table for this backend's listened channels */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+
+ listenChannelsHash =
+ hash_create("Listen Channels",
+ 64,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS);
+}
+
/*
* Report space needed for our shared memory area
*/
@@ -521,12 +682,17 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVANCING_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
+ QUEUE_BACKEND_IS_ADVANCING(i) = false;
}
}
@@ -657,6 +823,7 @@ Async_Notify(const char *channel, const char *payload)
notifies->events = list_make1(n);
/* We certainly don't need a hashtable yet */
notifies->hashtab = NULL;
+ notifies->channelHashtab = NULL;
notifies->upper = pendingNotifies;
pendingNotifies = notifies;
}
@@ -683,7 +850,7 @@ Async_Notify(const char *channel, const char *payload)
* Common code for listen, unlisten, unlisten all commands.
*
* Adds the request to the list of pending actions.
- * Actual update of the listenChannels list happens during transaction
+ * Actual update of the listenChannelsHash happens during transaction
* commit.
*/
static void
@@ -783,30 +950,49 @@ Async_UnlistenAll(void)
* SQL function: return a set of the channel names this backend is actively
* listening to.
*
- * Note: this coding relies on the fact that the listenChannels list cannot
+ * Note: this coding relies on the fact that the listenChannelsHash cannot
* change within a transaction.
*/
Datum
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ HASH_SEQ_STATUS *status;
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Initialize hash table iteration if we have any channels */
+ if (listenChannelsHash != NULL)
+ {
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ status = (HASH_SEQ_STATUS *) palloc(sizeof(HASH_SEQ_STATUS));
+ hash_seq_init(status, listenChannelsHash);
+ funcctx->user_fctx = status;
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ {
+ funcctx->user_fctx = NULL;
+ }
}
/* stuff done on every call of the function */
funcctx = SRF_PERCALL_SETUP();
+ status = (HASH_SEQ_STATUS *) funcctx->user_fctx;
- if (funcctx->call_cntr < list_length(listenChannels))
+ if (status != NULL)
{
- char *channel = (char *) list_nth(listenChannels,
- funcctx->call_cntr);
+ struct ChannelHash *entry;
- SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
+ entry = (struct ChannelHash *) hash_seq_search(status);
+ if (entry != NULL)
+ SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(entry->channel));
}
SRF_RETURN_DONE(funcctx);
@@ -894,6 +1080,36 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
+
+ /*
+ * Build list of unique channels for SignalBackends().
+ *
+ * If we have a channelHashtab, use it to efficiently get the unique
+ * channels. Otherwise, fall back to the linear approach.
+ */
+ pendingNotifyChannels = NIL;
+ if (pendingNotifies->channelHashtab != NULL)
+ {
+ HASH_SEQ_STATUS status;
+ struct ChannelHash *channelEntry;
+
+ hash_seq_init(&status, pendingNotifies->channelHashtab);
+ while ((channelEntry = (struct ChannelHash *) hash_seq_search(&status)) != NULL)
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channelEntry->channel);
+ }
+ else
+ {
+ /* Linear approach for small number of notifications */
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
+ }
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -922,6 +1138,22 @@ PreCommit_Notify(void)
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /*
+ * For the direct advancement optimization in SignalBackends(), we
+ * need to ensure that no other backend can insert queue entries
+ * between queueHeadBeforeWrite and queueHeadAfterWrite. The
+ * heavyweight lock above provides this guarantee, since it serializes
+ * all writers.
+ *
+ * Note: if the heavyweight lock were ever removed for scalability
+ * reasons, we could achieve the same guarantee by holding
+ * NotifyQueueLock in EXCLUSIVE mode across all our insertions, rather
+ * than releasing and reacquiring it for each page as we do below.
+ */
+
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(queueHeadBeforeWrite, 0, 0);
+
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
while (nextNotify != NULL)
@@ -939,12 +1171,20 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ if (firstIteration)
+ {
+ queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many notifications in the NOTIFY queue")));
nextNotify = asyncQueueAddEntries(nextNotify);
+ queueHeadAfterWrite = QUEUE_HEAD;
+
LWLockRelease(NotifyQueueLock);
}
@@ -957,7 +1197,7 @@ PreCommit_Notify(void)
*
* This is called at transaction commit, after committing to clog.
*
- * Update listenChannels and clear transaction-local state.
+ * Update listenChannelsHash and clear transaction-local state.
*
* If we issued any notifications in the transaction, send signals to
* listening backends (possibly including ourselves) to process them.
@@ -1002,7 +1242,8 @@ AtCommit_Notify(void)
}
/* If no longer listening to anything, get out of listener array */
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/*
@@ -1135,50 +1376,145 @@ Exec_ListenPreCommit(void)
static void
Exec_ListenCommit(const char *channel)
{
- MemoryContext oldcontext;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
/* Do nothing if we are already listening on this channel */
if (IsListeningOn(channel))
return;
/*
- * Add the new channel name to listenChannels.
+ * Add the new channel name to listenChannelsHash.
*
* XXX It is theoretically possible to get an out-of-memory failure here,
* which would be bad because we already committed. For the moment it
* doesn't seem worth trying to guard against that, but maybe improve this
* later.
*/
- oldcontext = MemoryContextSwitchTo(TopMemoryContext);
- listenChannels = lappend(listenChannels, pstrdup(channel));
- MemoryContextSwitchTo(oldcontext);
+ initListenChannelsHash();
+ (void) hash_search(listenChannelsHash, channel, HASH_ENTER, NULL);
+
+ /* Now update the shared channelHash for SignalBackends() to use */
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listenersArray to InvalidDsaPointer as a
+ * marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners] = MyProcNumber;
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Unlisten the specified channel for this backend.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
- ListCell *q;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
- foreach(q, listenChannels)
+ /* Remove from our local cache */
+ if (listenChannelsHash != NULL)
+ (void) hash_search(listenChannelsHash, channel, HASH_REMOVE, NULL);
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(q);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i] == MyProcNumber)
{
- listenChannels = foreach_delete_current(listenChannels, q);
- pfree(lchan);
- break;
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ return;
}
}
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1193,34 +1529,68 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
- list_free_deep(listenChannels);
- listenChannels = NIL;
+ /* Clear our local cache */
+ if (listenChannelsHash != NULL)
+ {
+ hash_destroy(listenChannelsHash);
+ listenChannelsHash = NULL;
+ }
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+ int i;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
}
/*
* Test whether we are actively listening on the given channel name.
*
* Note: this function is executed for every notification found in the queue.
- * Perhaps it is worth further optimization, eg convert the list to a sorted
- * array so we can binary-search it. In practice the list is likely to be
- * fairly short, though.
*/
static bool
IsListeningOn(const char *channel)
{
- ListCell *p;
+ if (listenChannelsHash == NULL)
+ return false;
- foreach(p, listenChannels)
- {
- char *lchan = (char *) lfirst(p);
-
- if (strcmp(lchan, channel) == 0)
- return true;
- }
- return false;
+ return (hash_search(listenChannelsHash, channel, HASH_FIND, NULL) != NULL);
}
/*
@@ -1230,7 +1600,7 @@ IsListeningOn(const char *channel)
static void
asyncQueueUnregister(void)
{
- Assert(listenChannels == NIL); /* else caller error */
+ Assert(listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0); /* else caller error */
if (!amRegisteredListener) /* nothing to do */
return;
@@ -1242,6 +1612,7 @@ asyncQueueUnregister(void)
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
/* and remove it from the list */
if (QUEUE_FIRST_LISTENER == MyProcNumber)
QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);
@@ -1565,12 +1936,15 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are known to still be positioned at the queue head
+ * from before our commit can be safely advanced directly to the new
+ * head, since the queue region we wrote is known to contain only our
+ * own notifications. This avoids unnecessary wakeups when there is
+ * nothing of interest to them.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1957,13 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ ListCell *lc;
+
+ /*
+ * Attach to the channel hash if needed. We might not have one if this
+ * backend hasn't done LISTEN, but we need it to find listeners.
+ */
+ initChannelHash();
/*
* Identify backends that we need to signal. We don't want to send
@@ -1597,36 +1978,103 @@ SignalBackends(void)
count = 0;
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, pendingNotifyChannels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ProcNumber *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ /*
+ * Direct advancement: avoid waking non-caught up backends that aren't
+ * interested in our notifications.
+ */
+ if (pendingNotifies != NULL)
+ {
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
+ QueuePosition pos;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
/*
- * Listeners in other databases should be signaled only if they
- * are far behind.
+ * We need to signal advancing listening backends that would get
+ * stuck at a position before the new queue head. We also need to
+ * signal listening backends that are idle at a position before
+ * the old queue head since they could be interested in the
+ * messages in-between.
+ *
+ * Listening backends that are not advancing and are stationary at
+ * a position somewhere in the range we just wrote, can safely be
+ * direct advanced to the new queue head, since we know that they
+ * are not interested in our messages.
*/
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
- continue;
+ if (QUEUE_BACKEND_IS_ADVANCING(i) ?
+ QUEUE_POS_PRECEDES(QUEUE_BACKEND_ADVANCING_POS(i), queueHeadAfterWrite) :
+ QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
+ {
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
+ else if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
+ QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
+ {
+ Assert(!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite));
+
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1673,9 +2121,10 @@ AtAbort_Notify(void)
/*
* If we LISTEN but then roll back the transaction after PreCommit_Notify,
* we have registered as a listener but have not made any entry in
- * listenChannels. In that case, deregister again.
+ * listenChannelsHash. In that case, deregister again.
*/
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/* And clean up */
@@ -1861,20 +2310,29 @@ asyncQueueReadAllNotifications(void)
AsyncQueueEntry align;
} page_buffer;
- /* Fetch current state */
+ /*
+ * Fetch current state, indicate to others that we have woken up, and that
+ * we now will be advancing our position.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ head = QUEUE_HEAD;
pos = QUEUE_BACKEND_POS(MyProcNumber);
- head = QUEUE_HEAD;
- LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
{
/* Nothing to do, we have read all notifications already. */
+ LWLockRelease(NotifyQueueLock);
return;
}
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = true;
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = head;
+
+ LWLockRelease(NotifyQueueLock);
+
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
* This is trickier than it might seem, because of race conditions.
@@ -1987,6 +2445,8 @@ asyncQueueReadAllNotifications(void)
{
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
+
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = false;
QUEUE_BACKEND_POS(MyProcNumber) = pos;
LWLockRelease(NotifyQueueLock);
}
@@ -2186,7 +2646,7 @@ ProcessIncomingNotify(bool flush)
notifyInterruptPending = false;
/* Do nothing else if we aren't actively listening */
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
return;
if (Trace_notify)
@@ -2290,13 +2750,15 @@ AddEventToPendingNotifies(Notification *n)
{
Assert(pendingNotifies->events != NIL);
- /* Create the hash table if it's time to */
+ /* Create the hash tables if it's time to */
if (list_length(pendingNotifies->events) >= MIN_HASHABLE_NOTIFIES &&
pendingNotifies->hashtab == NULL)
{
HASHCTL hash_ctl;
ListCell *l;
+ Assert(pendingNotifies->channelHashtab == NULL);
+
/* Create the hash table */
hash_ctl.keysize = sizeof(Notification *);
hash_ctl.entrysize = sizeof(struct NotificationHash);
@@ -2309,10 +2771,22 @@ AddEventToPendingNotifies(Notification *n)
&hash_ctl,
HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
+ /* Create the channel hash table */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+ hash_ctl.hcxt = CurTransactionContext;
+ pendingNotifies->channelHashtab =
+ hash_create("Pending Notify Channels",
+ 64L,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
/* Insert all the already-existing events */
foreach(l, pendingNotifies->events)
{
Notification *oldn = (Notification *) lfirst(l);
+ char *channel = oldn->data;
bool found;
(void) hash_search(pendingNotifies->hashtab,
@@ -2320,22 +2794,42 @@ AddEventToPendingNotifies(Notification *n)
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Insert channel into channelHashtab */
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if multiple events on same channel */
}
}
/* Add new event to the list, in order */
pendingNotifies->events = lappend(pendingNotifies->events, n);
- /* Add event to the hash table if needed */
+ /* Add event to the hash tables if needed */
if (pendingNotifies->hashtab != NULL)
{
bool found;
+ Assert(pendingNotifies->channelHashtab != NULL);
+
(void) hash_search(pendingNotifies->hashtab,
&n,
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Add channel to channelHashtab */
+ {
+ char *channel = n->data;
+
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if we already have an event on this channel */
+ }
}
}
@@ -2373,7 +2867,7 @@ notification_match(const void *key1, const void *key2, Size keysize)
return 1; /* not equal */
}
-/* Clear the pendingActions and pendingNotifies lists. */
+/* Clear the pendingActions, pendingNotifies, and pendingNotifyChannels lists. */
static void
ClearPendingActionsAndNotifies(void)
{
@@ -2385,6 +2879,7 @@ ClearPendingActionsAndNotifies(void)
*/
pendingActions = NULL;
pendingNotifies = NULL;
+ pendingNotifyChannels = NIL;
}
/*
@@ -2395,3 +2890,16 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..a4fadbd0767 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -366,6 +366,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..2768ddf4414 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -100,6 +100,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 018b5919cf6..b8443725398 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-12 16:57 Arseniy Mukhin <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 2 replies; 120+ messages in thread
From: Arseniy Mukhin @ 2025-11-12 16:57 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: pgsql-hackers
Hi,
On Tue, Nov 11, 2025 at 7:35 PM Joel Jacobson <[email protected]> wrote:
>
> On Sat, Nov 8, 2025, at 16:04, Joel Jacobson wrote:
> > On Sat, Nov 8, 2025, at 13:59, Joel Jacobson wrote:
> >> On Fri, Nov 7, 2025, at 19:59, Joel Jacobson wrote:
> >> This in turn could cause a listening backend to remain behind, if there
> >> would be no more notifies, so it unfortunately seems like we will always
> >> need to signal when a backend isAdvancing, and therefore have no use of
> >> the advancingPos field.
>
> Having thought about this, I don't think this is actually a problem,
> since this isn't any worse than what we currently have in master.
> Listening backends can currently end up stationary behind QUEUE_HEAD, in
> exactly this situation, when they don't read up until QUEUE_HEAD in
> asyncQueueReadAllNotifications. In this case, we currently rely on
> another NOTIFY to wake them up, so v24 wouldn't be any worse.
>
> My apologies for again making the mistake of mixing in robustness
> improvements into this patch. I must keep in mind this is solely an
> optimization patch.
>
> I'm therefore attaching v24 again, but renamed to v26.
Thank you for the new version!
I read direct advancement part of v26, one point about it:
The comment in SignalBackend says:
* Listening backends that are not advancing and are stationary at
* a position somewhere in the range we just wrote, can safely be
* direct advanced to the new queue head, since we know that they
* are not interested in our messages.
*/
IIUC it's impossible for the listener to stop somewhere in between
queueHeadBeforeWrite and queueHeadAfterWrite. If the listener has
managed to read the first notification from the notifier, it means the
notifier transaction is complete and the listener should stop only
after reading all notifications (so we should always see pos =
queueHeadAfterWrite or further).
So If I haven't missed anything, I think we can use QUEUE_POS_EQUAL as
direct advancement condition:
if (!QUEUE_BACKEND_IS_ADVANCING(i) && QUEUE_POS_EQUAL(pos,
queueHeadBeforeWrite))
{
QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
}
Best regards,
Arseniy Mukhin
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-12 20:37 Joel Jacobson <[email protected]>
parent: Arseniy Mukhin <[email protected]>
1 sibling, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-11-12 20:37 UTC (permalink / raw)
To: Arseniy Mukhin <[email protected]>; +Cc: pgsql-hackers
On Wed, Nov 12, 2025, at 17:57, Arseniy Mukhin wrote:
> On Tue, Nov 11, 2025 at 7:35 PM Joel Jacobson <[email protected]> wrote:
>> I'm therefore attaching v24 again, but renamed to v26.
>
> Thank you for the new version!
Thanks for reviewing!
> I read direct advancement part of v26, one point about it:
>
> The comment in SignalBackend says:
>
> * Listening backends that are not advancing and are stationary at
> * a position somewhere in the range we just wrote, can safely be
> * direct advanced to the new queue head, since we know that they
> * are not interested in our messages.
> */
>
> IIUC it's impossible for the listener to stop somewhere in between
> queueHeadBeforeWrite and queueHeadAfterWrite. If the listener has
> managed to read the first notification from the notifier, it means the
> notifier transaction is complete and the listener should stop only
> after reading all notifications (so we should always see pos =
> queueHeadAfterWrite or further).
Here is what I think can happen:
If the notifications written by the notifier fills the current page,
it updates QUEUE_HEAD, and if a listening backend then
enters asyncQueueReadAllNotifications at this time,
it will set its local `head` variable to the current QUEUE_HEAD,
and when the notifier continues filling the next page,
it will again update QUEUE_HEAD, and PreCommit_Notify
will overwrite queueHeadAfterWrite with the QUEUE_HEAD.
Sequence of events:
1. In the notifier, PreCommit_Notify calls asyncQueueAddEntries,
which updates QUEUE_HEAD when the page is full,
(and sets queueHeadAfterWrite to this value).
2. At this time, a listener wakes up and asyncQueueAddEntries
reads the current QUEUE_HEAD value and stores it
in its local `head` variable, and starts reading up to this pos.
3. In the notifier, PreCommit_Notify calls asyncQueueAddEntries
the second time, which updates QUEUE_HEAD,
and sets queueHeadAfterWrite to the final value
before returning.
For this reason, I think the listener could actually stop
in between queueHeadBeforeWrite and queueHeadAfterWrite,
since it's local `head` variable could get the intermediary
QUEUE_HEAD value, when a page is full.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-12 20:53 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 0 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-11-12 20:53 UTC (permalink / raw)
To: Arseniy Mukhin <[email protected]>; +Cc: pgsql-hackers
On Wed, Nov 12, 2025, at 21:37, Joel Jacobson wrote:
> Sequence of events:
>
> 1. In the notifier, PreCommit_Notify calls asyncQueueAddEntries,
> which updates QUEUE_HEAD when the page is full,
> (and sets queueHeadAfterWrite to this value).
>
> 2. At this time, a listener wakes up and asyncQueueAddEntries
Correction:
I meant "asyncQueueReadAllNotifications" here, not "asyncQueueAddEntries".
> reads the current QUEUE_HEAD value and stores it
> in its local `head` variable, and starts reading up to this pos.
>
> 3. In the notifier, PreCommit_Notify calls asyncQueueAddEntries
> the second time, which updates QUEUE_HEAD,
> and sets queueHeadAfterWrite to the final value
> before returning.
>
> For this reason, I think the listener could actually stop
> in between queueHeadBeforeWrite and queueHeadAfterWrite,
> since it's local `head` variable could get the intermediary
> QUEUE_HEAD value, when a page is full.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-13 06:28 Joel Jacobson <[email protected]>
parent: Arseniy Mukhin <[email protected]>
1 sibling, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-11-13 06:28 UTC (permalink / raw)
To: Arseniy Mukhin <[email protected]>; +Cc: pgsql-hackers
On Wed, Nov 12, 2025, at 17:57, Arseniy Mukhin wrote:
> IIUC it's impossible for the listener to stop somewhere in between
> queueHeadBeforeWrite and queueHeadAfterWrite. If the listener has
> managed to read the first notification from the notifier, it means the
> notifier transaction is complete and the listener should stop only
> after reading all notifications (so we should always see pos =
> queueHeadAfterWrite or further).
>
> So If I haven't missed anything, I think we can use QUEUE_POS_EQUAL as
> direct advancement condition:
>
> if (!QUEUE_BACKEND_IS_ADVANCING(i) && QUEUE_POS_EQUAL(pos,
> queueHeadBeforeWrite))
> {
> QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
> }
I added some logging just to test the hypothesis:
@@ -2072,6 +2082,12 @@ SignalBackends(void)
{
Assert(!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite));
+ if (!QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
+ elog(LOG, "Direct advancement: PID %d from pos (%lld,%d) to queueHeadAfterWrite (%lld,%d)",
+ pid,
+ (long long) QUEUE_POS_PAGE(pos), QUEUE_POS_OFFSET(pos),
+ (long long) QUEUE_POS_PAGE(queueHeadAfterWrite), QUEUE_POS_OFFSET(queueHeadAfterWrite));
+
QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
}
}
And I'm getting a lot of such log entries when benchmarking
`./pg_async_notify_test --listeners 1 --notifiers 1 --channels 50`
I think this confirms that listeners can actually stop somewhere in between
queueHeadBeforeWrite and queueHeadAfterWrite.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-13 06:36 Arseniy Mukhin <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Arseniy Mukhin @ 2025-11-13 06:36 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: pgsql-hackers
On Thu, Nov 13, 2025 at 9:28 AM Joel Jacobson <[email protected]> wrote:
>
> On Wed, Nov 12, 2025, at 17:57, Arseniy Mukhin wrote:
> > IIUC it's impossible for the listener to stop somewhere in between
> > queueHeadBeforeWrite and queueHeadAfterWrite. If the listener has
> > managed to read the first notification from the notifier, it means the
> > notifier transaction is complete and the listener should stop only
> > after reading all notifications (so we should always see pos =
> > queueHeadAfterWrite or further).
> >
> > So If I haven't missed anything, I think we can use QUEUE_POS_EQUAL as
> > direct advancement condition:
> >
> > if (!QUEUE_BACKEND_IS_ADVANCING(i) && QUEUE_POS_EQUAL(pos,
> > queueHeadBeforeWrite))
> > {
> > QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
> > }
>
> I added some logging just to test the hypothesis:
>
> @@ -2072,6 +2082,12 @@ SignalBackends(void)
> {
> Assert(!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite));
>
> + if (!QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
> + elog(LOG, "Direct advancement: PID %d from pos (%lld,%d) to queueHeadAfterWrite (%lld,%d)",
> + pid,
> + (long long) QUEUE_POS_PAGE(pos), QUEUE_POS_OFFSET(pos),
> + (long long) QUEUE_POS_PAGE(queueHeadAfterWrite), QUEUE_POS_OFFSET(queueHeadAfterWrite));
> +
> QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
> }
> }
>
> And I'm getting a lot of such log entries when benchmarking
> `./pg_async_notify_test --listeners 1 --notifiers 1 --channels 50`
>
> I think this confirms that listeners can actually stop somewhere in between
> queueHeadBeforeWrite and queueHeadAfterWrite.
Ahh, yes, I think you are right. I missed that notifiers update the
head when they move to the next page. Thank you for the detailed
example and sorry for taking your time with it. I agree that
QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite) is correct and covers
more cases where we can do direct advancement.
Best regards,
Arseniy Mukhin
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-13 07:13 Joel Jacobson <[email protected]>
parent: Arseniy Mukhin <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-11-13 07:13 UTC (permalink / raw)
To: Arseniy Mukhin <[email protected]>; +Cc: pgsql-hackers
On Thu, Nov 13, 2025, at 07:36, Arseniy Mukhin wrote:
> On Thu, Nov 13, 2025 at 9:28 AM Joel Jacobson <[email protected]> wrote:
>> I think this confirms that listeners can actually stop somewhere in between
>> queueHeadBeforeWrite and queueHeadAfterWrite.
>
>
> Ahh, yes, I think you are right. I missed that notifiers update the
> head when they move to the next page. Thank you for the detailed
> example and sorry for taking your time with it. I agree that
> QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite) is correct and covers
> more cases where we can do direct advancement.
Thanks for investigating this; we now both have an even stronger mental
model of the code.
Attached, please find a new version rebased on top of the bug fix
patches that just got committed in 0bdc777, 797e9ea, 8eeb4a0, and
1b46990.
/Joel
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v27.patch (9.3K, 2-0001-optimize_listen_notify-v27.patch)
download | inline diff:
From a7f75495d655dbaced4515f9f04a95b9da5905ad Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 114 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 68 +++++++++++
2 files changed, 181 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..443a6eb669f 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 7 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
@@ -104,6 +205,17 @@ step l2commit: COMMIT;
listener2: NOTIFY "c1" with payload "" from notifier
step l2stop: UNLISTEN *;
+starting permutation: lch_listen nch_notify lch_check
+step lch_listen: LISTEN ch;
+step nch_notify: NOTIFY ch, 'aa';
+step lch_check: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener_ch: NOTIFY "ch" with payload "aa" from notifier_ch
+
starting permutation: llisten lbegin usage bignotify usage
step llisten: LISTEN c1; LISTEN c2;
step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..0a01e777b98 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,38 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Listener session for cross-session notification test with channel 'ch'.
+
+session listener_ch
+step lch_listen { LISTEN ch; }
+step lch_check { SELECT 1 AS x; }
+teardown { UNLISTEN *; }
+
+# Notifier session for cross-session notification test with channel 'ch'.
+
+session notifier_ch
+step nch_notify { NOTIFY ch, 'aa'; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +106,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
@@ -73,6 +137,10 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
# and notify queue is not empty
permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
+# Check that notifications sent from a backend that has not done LISTEN
+# are properly delivered to a listener in another backend.
+permutation lch_listen nch_notify lch_check
+
# Verify that pg_notification_queue_usage correctly reports a non-zero result,
# after submitting notifications while another connection is listening for
# those notifications and waiting inside an active transaction. We have to
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v27.patch (43.4K, 3-0002-optimize_listen_notify-v27.patch)
download | inline diff:
From bb5b0df0e307f88ef35a381b6881dc98bdd6ab05 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Fri, 7 Nov 2025 19:08:39 +0100
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of listening backends (ProcNumbers). This allows the sender to
target only those backends actually listening on the channels for which
it has queued notifications.
At commit time:
* AtCommit_Notify updates the shared channelHash to reflect any LISTEN
or UNLISTEN actions performed in the transaction.
* SignalBackends consults this hash to find the backends that are
listening on the channels being notified in the current database, and
signals only those.
Each backend's entry in AsyncQueueControl now includes a wakeupPending
flag to prevent duplicate signals while a previous wakeup is still being
processed.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
To handle advancing backends correctly, each backend's entry tracks both
whether it is currently advancing (isAdvancing) and the target position
it is advancing to (advancingPos). This allows SignalBackends to signal
advancing backends only when their target position would leave them
behind the new queue head, while safely direct-advancing idle backends
that would not be interested in the newly written notifications.
Other notes
-----------
* Maintains dual data structures: a shared channelHash for determining
which backends to signal, and a local per-backend listenChannels list
for fast lock-free lookups during notification processing. This avoids
contention on the shared hash during the high-frequency IsListeningOn
checks that occur for every notification read from the queue.
* Backends remain registered in the global listener list as long as
listenChannels is non-empty.
* Adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
* No user-visible behavioral changes; this is an internal optimization
only.
---
src/backend/commands/async.c | 718 +++++++++++++++---
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 617 insertions(+), 105 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index e1cf659485a..3f1fa7ac0cc 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,21 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
- * Then we signal any backends that may be interested in our messages
- * (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * make any actual updates to the local listen state (listenChannelsHash) and
+ * shared channel hash table (channelHash). Then we signal any backends
+ * that may be interested in our messages (including our own backend,
+ * if listening). This is done by SignalBackends(), which consults the
+ * shared channel hash table to identify listeners for the channels that
+ * have pending notifications in the current database. Each selected
+ * backend is marked as having a wakeup pending to avoid duplicate signals,
+ * and a PROCSIG_NOTIFY_INTERRUPT signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, avoiding unnecessary
+ * wakeups for idle listeners that have nothing to read.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -137,14 +144,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +172,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ProcNumber array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -224,11 +257,16 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
* we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * also the distance by which a backend needs to be behind before we'll
+ * decide we need to wake it up to advance its pointer.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +284,9 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
+ bool isAdvancing; /* backend is advancing its position */
+ QueuePosition advancingPos; /* target position backend is advancing to */
} QueueBackendStatus;
/*
@@ -260,9 +301,10 @@ typedef struct QueueBackendStatus
* (since no other backend will inspect it).
*
* When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
- * entries of other backends and also change the head pointer. When holding
- * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointers.
+ * entries of other backends and also change the head pointer. They can
+ * also advance other backends' queue positions, unless they are not
+ * in the process of doing that themselves. When holding both NotifyQueueLock and
+ * NotifyQueueTailLock in EXCLUSIVE mode, backends can change the tail pointers.
*
* SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
* the control lock for the pg_notify SLRU buffers.
@@ -288,11 +330,16 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +348,9 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
+#define QUEUE_BACKEND_IS_ADVANCING(i) (asyncQueueControl->backend[i].isAdvancing)
+#define QUEUE_BACKEND_ADVANCING_POS(i) (asyncQueueControl->backend[i].advancingPos)
/*
* The SLRU buffer area through which we access the notification queue
@@ -313,16 +363,16 @@ static SlruCtlData NotifyCtlData;
#define QUEUE_FULL_WARN_INTERVAL 5000 /* warn at most once every 5s */
/*
- * listenChannels identifies the channels we are actually listening to
- * (ie, have committed a LISTEN on). It is a simple list of channel names,
+ * listenChannelsHash identifies the channels we are actually listening to
+ * (ie, have committed a LISTEN on). It is a hash table of channel names,
* allocated in TopMemoryContext.
*/
-static List *listenChannels = NIL; /* list of C strings */
+static HTAB *listenChannelsHash = NULL;
/*
* State for pending LISTEN/UNLISTEN actions consists of an ordered list of
* all actions requested in the current transaction. As explained above,
- * we don't actually change listenChannels until we reach transaction commit.
+ * we don't actually change listenChannelsHash until we reach transaction commit.
*
* The list is kept in CurTransactionContext. In subtransactions, each
* subtransaction has its own list in its own CurTransactionContext, but
@@ -391,6 +441,7 @@ typedef struct NotificationList
int nestingLevel; /* current transaction nesting depth */
List *events; /* list of Notification structs */
HTAB *hashtab; /* hash of NotificationHash structs, or NULL */
+ HTAB *channelHashtab; /* hash of unique channel names, or NULL */
struct NotificationList *upper; /* details for upper transaction levels */
} NotificationList;
@@ -401,6 +452,11 @@ struct NotificationHash
Notification *event; /* => the actual Notification struct */
};
+struct ChannelHash
+{
+ char channel[NAMEDATALEN];
+};
+
static NotificationList *pendingNotifies = NULL;
/*
@@ -418,6 +474,20 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/*
+ * Queue head positions for direct advancement.
+ * These are captured during PreCommit_Notify while holding the heavyweight
+ * lock on database 0, ensuring no other backend can insert notifications
+ * between them. SignalBackends uses these to advance idle backends.
+ */
+static QueuePosition queueHeadBeforeWrite;
+static QueuePosition queueHeadAfterWrite;
+
+/*
+ * List of channels with pending notifications in the current transaction.
+ */
+static List *pendingNotifyChannels = NIL;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -428,7 +498,6 @@ bool Trace_notify = false;
int max_notify_queue_pages = 1048576;
/* local function prototypes */
-static inline int64 asyncQueuePageDiff(int64 p, int64 q);
static inline bool asyncQueuePagePrecedes(int64 p, int64 q);
static void queue_listen(ListenActionKind action, const char *channel);
static void Async_UnlistenOnExit(int code, Datum arg);
@@ -456,16 +525,9 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
-
-/*
- * Compute the difference between two queue page numbers.
- * Previously this function accounted for a wraparound.
- */
-static inline int64
-asyncQueuePageDiff(int64 p, int64 q)
-{
- return p - q;
-}
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+static void initChannelHash(void);
/*
* Determines whether p precedes q.
@@ -477,6 +539,105 @@ asyncQueuePagePrecedes(int64 p, int64 q)
return p < q;
}
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
+/*
+ * initListenChannelsHash
+ * Lazy initialization of the local listen channels hash table.
+ */
+static void
+initListenChannelsHash(void)
+{
+ HASHCTL hash_ctl;
+
+ /* Quick exit if we already did this */
+ if (listenChannelsHash != NULL)
+ return;
+
+ /* Initialize local hash table for this backend's listened channels */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+
+ listenChannelsHash =
+ hash_create("Listen Channels",
+ 64,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS);
+}
+
/*
* Report space needed for our shared memory area
*/
@@ -520,12 +681,17 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVANCING_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
+ QUEUE_BACKEND_IS_ADVANCING(i) = false;
}
}
@@ -656,6 +822,7 @@ Async_Notify(const char *channel, const char *payload)
notifies->events = list_make1(n);
/* We certainly don't need a hashtable yet */
notifies->hashtab = NULL;
+ notifies->channelHashtab = NULL;
notifies->upper = pendingNotifies;
pendingNotifies = notifies;
}
@@ -682,7 +849,7 @@ Async_Notify(const char *channel, const char *payload)
* Common code for listen, unlisten, unlisten all commands.
*
* Adds the request to the list of pending actions.
- * Actual update of the listenChannels list happens during transaction
+ * Actual update of the listenChannelsHash happens during transaction
* commit.
*/
static void
@@ -782,30 +949,49 @@ Async_UnlistenAll(void)
* SQL function: return a set of the channel names this backend is actively
* listening to.
*
- * Note: this coding relies on the fact that the listenChannels list cannot
+ * Note: this coding relies on the fact that the listenChannelsHash cannot
* change within a transaction.
*/
Datum
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ HASH_SEQ_STATUS *status;
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Initialize hash table iteration if we have any channels */
+ if (listenChannelsHash != NULL)
+ {
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ status = (HASH_SEQ_STATUS *) palloc(sizeof(HASH_SEQ_STATUS));
+ hash_seq_init(status, listenChannelsHash);
+ funcctx->user_fctx = status;
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ {
+ funcctx->user_fctx = NULL;
+ }
}
/* stuff done on every call of the function */
funcctx = SRF_PERCALL_SETUP();
+ status = (HASH_SEQ_STATUS *) funcctx->user_fctx;
- if (funcctx->call_cntr < list_length(listenChannels))
+ if (status != NULL)
{
- char *channel = (char *) list_nth(listenChannels,
- funcctx->call_cntr);
+ struct ChannelHash *entry;
- SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
+ entry = (struct ChannelHash *) hash_seq_search(status);
+ if (entry != NULL)
+ SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(entry->channel));
}
SRF_RETURN_DONE(funcctx);
@@ -893,6 +1079,36 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
+
+ /*
+ * Build list of unique channels for SignalBackends().
+ *
+ * If we have a channelHashtab, use it to efficiently get the unique
+ * channels. Otherwise, fall back to the linear approach.
+ */
+ pendingNotifyChannels = NIL;
+ if (pendingNotifies->channelHashtab != NULL)
+ {
+ HASH_SEQ_STATUS status;
+ struct ChannelHash *channelEntry;
+
+ hash_seq_init(&status, pendingNotifies->channelHashtab);
+ while ((channelEntry = (struct ChannelHash *) hash_seq_search(&status)) != NULL)
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channelEntry->channel);
+ }
+ else
+ {
+ /* Linear approach for small number of notifications */
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
+ }
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -921,6 +1137,22 @@ PreCommit_Notify(void)
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /*
+ * For the direct advancement optimization in SignalBackends(), we
+ * need to ensure that no other backend can insert queue entries
+ * between queueHeadBeforeWrite and queueHeadAfterWrite. The
+ * heavyweight lock above provides this guarantee, since it serializes
+ * all writers.
+ *
+ * Note: if the heavyweight lock were ever removed for scalability
+ * reasons, we could achieve the same guarantee by holding
+ * NotifyQueueLock in EXCLUSIVE mode across all our insertions, rather
+ * than releasing and reacquiring it for each page as we do below.
+ */
+
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(queueHeadBeforeWrite, 0, 0);
+
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
while (nextNotify != NULL)
@@ -938,12 +1170,20 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ if (firstIteration)
+ {
+ queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many notifications in the NOTIFY queue")));
nextNotify = asyncQueueAddEntries(nextNotify);
+ queueHeadAfterWrite = QUEUE_HEAD;
+
LWLockRelease(NotifyQueueLock);
}
@@ -956,7 +1196,7 @@ PreCommit_Notify(void)
*
* This is called at transaction commit, after committing to clog.
*
- * Update listenChannels and clear transaction-local state.
+ * Update listenChannelsHash and clear transaction-local state.
*
* If we issued any notifications in the transaction, send signals to
* listening backends (possibly including ourselves) to process them.
@@ -1001,7 +1241,8 @@ AtCommit_Notify(void)
}
/* If no longer listening to anything, get out of listener array */
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/*
@@ -1134,50 +1375,145 @@ Exec_ListenPreCommit(void)
static void
Exec_ListenCommit(const char *channel)
{
- MemoryContext oldcontext;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
/* Do nothing if we are already listening on this channel */
if (IsListeningOn(channel))
return;
/*
- * Add the new channel name to listenChannels.
+ * Add the new channel name to listenChannelsHash.
*
* XXX It is theoretically possible to get an out-of-memory failure here,
* which would be bad because we already committed. For the moment it
* doesn't seem worth trying to guard against that, but maybe improve this
* later.
*/
- oldcontext = MemoryContextSwitchTo(TopMemoryContext);
- listenChannels = lappend(listenChannels, pstrdup(channel));
- MemoryContextSwitchTo(oldcontext);
+ initListenChannelsHash();
+ (void) hash_search(listenChannelsHash, channel, HASH_ENTER, NULL);
+
+ /* Now update the shared channelHash for SignalBackends() to use */
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listenersArray to InvalidDsaPointer as a
+ * marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners] = MyProcNumber;
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Unlisten the specified channel for this backend.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
- ListCell *q;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
- foreach(q, listenChannels)
+ /* Remove from our local cache */
+ if (listenChannelsHash != NULL)
+ (void) hash_search(listenChannelsHash, channel, HASH_REMOVE, NULL);
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(q);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i] == MyProcNumber)
{
- listenChannels = foreach_delete_current(listenChannels, q);
- pfree(lchan);
- break;
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ return;
}
}
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1192,34 +1528,68 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
- list_free_deep(listenChannels);
- listenChannels = NIL;
+ /* Clear our local cache */
+ if (listenChannelsHash != NULL)
+ {
+ hash_destroy(listenChannelsHash);
+ listenChannelsHash = NULL;
+ }
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+ int i;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
}
/*
* Test whether we are actively listening on the given channel name.
*
* Note: this function is executed for every notification found in the queue.
- * Perhaps it is worth further optimization, eg convert the list to a sorted
- * array so we can binary-search it. In practice the list is likely to be
- * fairly short, though.
*/
static bool
IsListeningOn(const char *channel)
{
- ListCell *p;
+ if (listenChannelsHash == NULL)
+ return false;
- foreach(p, listenChannels)
- {
- char *lchan = (char *) lfirst(p);
-
- if (strcmp(lchan, channel) == 0)
- return true;
- }
- return false;
+ return (hash_search(listenChannelsHash, channel, HASH_FIND, NULL) != NULL);
}
/*
@@ -1229,7 +1599,7 @@ IsListeningOn(const char *channel)
static void
asyncQueueUnregister(void)
{
- Assert(listenChannels == NIL); /* else caller error */
+ Assert(listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0); /* else caller error */
if (!amRegisteredListener) /* nothing to do */
return;
@@ -1241,6 +1611,7 @@ asyncQueueUnregister(void)
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
/* and remove it from the list */
if (QUEUE_FIRST_LISTENER == MyProcNumber)
QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);
@@ -1565,12 +1936,15 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are known to still be positioned at the queue head
+ * from before our commit can be safely advanced directly to the new
+ * head, since the queue region we wrote is known to contain only our
+ * own notifications. This avoids unnecessary wakeups when there is
+ * nothing of interest to them.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1957,13 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ ListCell *lc;
+
+ /*
+ * Attach to the channel hash if needed. We might not have one if this
+ * backend hasn't done LISTEN, but we need it to find listeners.
+ */
+ initChannelHash();
/*
* Identify backends that we need to signal. We don't want to send
@@ -1597,36 +1978,103 @@ SignalBackends(void)
count = 0;
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, pendingNotifyChannels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ProcNumber *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ /*
+ * Direct advancement: avoid waking non-caught up backends that aren't
+ * interested in our notifications.
+ */
+ if (pendingNotifies != NULL)
+ {
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
+ QueuePosition pos;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
/*
- * Listeners in other databases should be signaled only if they
- * are far behind.
+ * We need to signal advancing listening backends that would get
+ * stuck at a position before the new queue head. We also need to
+ * signal listening backends that are idle at a position before
+ * the old queue head since they could be interested in the
+ * messages in-between.
+ *
+ * Listening backends that are not advancing and are stationary at
+ * a position somewhere in the range we just wrote, can safely be
+ * direct advanced to the new queue head, since we know that they
+ * are not interested in our messages.
*/
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
- continue;
+ if (QUEUE_BACKEND_IS_ADVANCING(i) ?
+ QUEUE_POS_PRECEDES(QUEUE_BACKEND_ADVANCING_POS(i), queueHeadAfterWrite) :
+ QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
+ {
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
+ else if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
+ QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
+ {
+ Assert(!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite));
+
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1673,9 +2121,10 @@ AtAbort_Notify(void)
/*
* If we LISTEN but then roll back the transaction after PreCommit_Notify,
* we have registered as a listener but have not made any entry in
- * listenChannels. In that case, deregister again.
+ * listenChannelsHash. In that case, deregister again.
*/
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/* And clean up */
@@ -1854,20 +2303,29 @@ asyncQueueReadAllNotifications(void)
QueuePosition head;
Snapshot snapshot;
- /* Fetch current state */
+ /*
+ * Fetch current state, indicate to others that we have woken up, and that
+ * we now will be advancing our position.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ head = QUEUE_HEAD;
pos = QUEUE_BACKEND_POS(MyProcNumber);
- head = QUEUE_HEAD;
- LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
{
/* Nothing to do, we have read all notifications already. */
+ LWLockRelease(NotifyQueueLock);
return;
}
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = true;
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = head;
+
+ LWLockRelease(NotifyQueueLock);
+
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
* This is trickier than it might seem, because of race conditions.
@@ -1954,6 +2412,8 @@ asyncQueueReadAllNotifications(void)
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
+
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = false;
QUEUE_BACKEND_POS(MyProcNumber) = pos;
LWLockRelease(NotifyQueueLock);
@@ -2055,7 +2515,7 @@ asyncQueueProcessPageEntries(QueuePosition *current,
* over it on the first LISTEN in a session, and not get stuck on
* it indefinitely.
*/
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
continue;
if (TransactionIdDidCommit(qe->xid))
@@ -2310,7 +2770,7 @@ ProcessIncomingNotify(bool flush)
notifyInterruptPending = false;
/* Do nothing else if we aren't actively listening */
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
return;
if (Trace_notify)
@@ -2414,13 +2874,15 @@ AddEventToPendingNotifies(Notification *n)
{
Assert(pendingNotifies->events != NIL);
- /* Create the hash table if it's time to */
+ /* Create the hash tables if it's time to */
if (list_length(pendingNotifies->events) >= MIN_HASHABLE_NOTIFIES &&
pendingNotifies->hashtab == NULL)
{
HASHCTL hash_ctl;
ListCell *l;
+ Assert(pendingNotifies->channelHashtab == NULL);
+
/* Create the hash table */
hash_ctl.keysize = sizeof(Notification *);
hash_ctl.entrysize = sizeof(struct NotificationHash);
@@ -2433,10 +2895,22 @@ AddEventToPendingNotifies(Notification *n)
&hash_ctl,
HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
+ /* Create the channel hash table */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+ hash_ctl.hcxt = CurTransactionContext;
+ pendingNotifies->channelHashtab =
+ hash_create("Pending Notify Channels",
+ 64L,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
/* Insert all the already-existing events */
foreach(l, pendingNotifies->events)
{
Notification *oldn = (Notification *) lfirst(l);
+ char *channel = oldn->data;
bool found;
(void) hash_search(pendingNotifies->hashtab,
@@ -2444,22 +2918,42 @@ AddEventToPendingNotifies(Notification *n)
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Insert channel into channelHashtab */
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if multiple events on same channel */
}
}
/* Add new event to the list, in order */
pendingNotifies->events = lappend(pendingNotifies->events, n);
- /* Add event to the hash table if needed */
+ /* Add event to the hash tables if needed */
if (pendingNotifies->hashtab != NULL)
{
bool found;
+ Assert(pendingNotifies->channelHashtab != NULL);
+
(void) hash_search(pendingNotifies->hashtab,
&n,
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Add channel to channelHashtab */
+ {
+ char *channel = n->data;
+
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if we already have an event on this channel */
+ }
}
}
@@ -2497,7 +2991,7 @@ notification_match(const void *key1, const void *key2, Size keysize)
return 1; /* not equal */
}
-/* Clear the pendingActions and pendingNotifies lists. */
+/* Clear the pendingActions, pendingNotifies, and pendingNotifyChannels lists. */
static void
ClearPendingActionsAndNotifies(void)
{
@@ -2509,6 +3003,7 @@ ClearPendingActionsAndNotifies(void)
*/
pendingActions = NULL;
pendingNotifies = NULL;
+ pendingNotifyChannels = NIL;
}
/*
@@ -2519,3 +3014,16 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c1ac71ff7f2..7c2cf960093 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -369,6 +369,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 5b0ce383408..4236965e72a 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -101,6 +101,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bce72ae64..c9917e87d45 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-14 16:01 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-11-14 16:01 UTC (permalink / raw)
To: pgsql-hackers
On Thu, Nov 13, 2025, at 08:13, Joel Jacobson wrote:
> Attached, please find a new version rebased on top of the bug fix
> patches that just got committed in 0bdc777, 797e9ea, 8eeb4a0, and
> 1b46990.
To help reviewers, here is a new write-up of the patch:
PROBLEM
=======
The current implementation has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, SignalBackends() iterates over all registered listeners
in the same database and sends each one a PROCSIG_NOTIFY_INTERRUPT
signal, regardless of whether they are listening on the notified
channel.
This behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers unnecessary wakeups and context switches. As the number of idle
listeners grows, this often becomes the bottleneck and throughput drops
sharply.
Performance degrades dramatically: benchmarks show throughput dropping
from ~9,000 TPS with few listeners to ~200 TPS with 1,000 idle listeners
on unrelated channels - a 45x slowdown purely from waking backends that
have no notifications to process.
SOLUTION OVERVIEW
=================
This patch introduces two optimizations:
1. Targeted Signaling
A lazily-created dynamic shared hash table (dshash) backed by dynamic
shared memory (DSA) maps (database OID, channel name) to arrays of
listening backends (ProcNumbers). This allows the notifier to signal
only those backends actually listening on the channels being
notified.
2. Direct Advancement
Even with targeted signaling, idle backends might still need to be
woken to advance their queue read positions past notifications they
don't care about. This patch avoids those unnecessary wakeups by
directly advancing the queue positions of idle backends that are not
listening on the channels being notified.
This is possible because all NOTIFY writers are serialized by a
heavyweight lock, allowing the notifier to identify precisely which
queue entries belong to the current transaction. The notifier can
then determine which idle backends are positioned within that range
and safely advance their positions without waking them, since we know
from the shared channel hash that they are not listening on any of
the notified channels.
IMPLEMENTATION DETAILS
=======================
Shared Channel Hash
-------------------
The patch adds a dshash table that maps (dboid, channel) keys to
ChannelEntry structures.
The listenersArray starts with capacity for 4 listeners and doubles when
full. Memory is allocated from a DSA area and freed when a channel has
zero listeners.
The table is created lazily on the first LISTEN command. The DSA handle
and dshash handle are stored in AsyncQueueControl for other backends to
attach.
Dual Data Structures
--------------------
The implementation maintains two complementary data structures:
1. Shared channelHash: Used during commit to determine which backends
need to be signaled. Updated during Exec_ListenCommit/UnlistenCommit/
UnlistenAllCommit.
2. Local listenChannelsHash: Changed from a List to an HTAB for fast
lookups, used by IsListeningOn().
This separation avoids contention on the shared hash during the frequent
IsListeningOn() checks that occur for every notification read from the
queue.
Direct Advancement Algorithm
-----------------------------
In PreCommit_Notify(), while holding the heavyweight lock on "database
0" that serializes all NOTIFY writers:
1. Before writing the first notification, capture queueHeadBeforeWrite
2. Write all notifications for the transaction to the queue 3. After
writing the last notification, capture queueHeadAfterWrite
The heavyweight lock guarantee means the range [queueHeadBeforeWrite,
queueHeadAfterWrite) contains only notifications written by this commit,
and no other backend could have inserted entries in this range.
SignalBackends() then processes each backend:
- If the backend has wakeupPending: skip (already signaled)
- If the backend is advancing (reading the queue):
If advancingPos < queueHeadAfterWrite: signal it
(it will get stuck before our new entries without a signal)
- If the backend is idle:
If pos < queueHeadBeforeWrite: signal it
(it might be interested in older messages)
If pos >= queueHeadBeforeWrite AND pos < queueHeadAfterWrite:
Direct advance pos to queueHeadAfterWrite
(skip our messages entirely, no signal needed)
New QueueBackendStatus Fields
-----------------------------
Each backend's entry in AsyncQueueControl now includes:
wakeupPending: signal sent but not yet processed
isAdvancing: backend is advancing its position
advancingPos: target position backend is advancing to
These flags ensure correct interaction between direct advancement and
backends that are concurrently processing their queue.
Transaction-Local State
------------------------
PreCommit_Notify() builds a list of unique channels
(pendingNotifyChannels) from the transaction's notifications. This list
is used by SignalBackends() to look up listeners in the shared hash
efficiently, avoiding duplicate lookups when multiple notifications are
sent to the same channel.
Functions Modified
------------------
AsyncShmemInit
Initialize channelHashDSA/DSH handles (InvalidHandle) and new
per-backend fields: wakeupPending, isAdvancing, advancingPos.
Async_Notify
Initialize channelHashtab.
pg_listening_channels
Rewritten to iterate over listenChannelsHash using HASH_SEQ_STATUS
instead of traversing the old listenChannels list.
PreCommit_Notify
Build pendingNotifyChannels list of unique channels from transaction's
notifications. Capture queueHeadBeforeWrite before writing first
notification and queueHeadAfterWrite after each write to enable direct
advancement optimization.
AtCommit_Notify
Check hash table entry count instead of list emptiness when deciding
whether to unregister from listener array.
Exec_ListenCommit
Complete rewrite to maintain both local listenChannelsHash and shared
channelHash. Insert backend's ProcNumber into DSA-allocated listeners
array, growing array (doubling strategy) when full.
Exec_UnlistenCommit
Remove from both local and shared hashes. Compact listeners array with
memmove, free DSA memory and delete hash entry when last listener
removed.
Exec_UnlistenAllCommit
Iterate shared channelHash with dshash_seq_*, remove this backend from
all channel entries in current database, clean up DSA memory and
delete entries when empty.
IsListeningOn
Simplified to single hash_search() call on listenChannelsHash.
asyncQueueUnregister
Clear QUEUE_BACKEND_WAKEUP_PENDING flag and update assertion to check
hash table instead of list.
SignalBackends
Rewrite to use targeted signaling instead of broadcast. Iterate
pendingNotifyChannels, look up listeners per channel in shared
channelHash. Implement direct advancement: advance idle backends
positioned in [queueHeadBeforeWrite, queueHeadAfterWrite) without
signaling. Use wakeupPending flag to prevent duplicate signals and
respect isAdvancing flag to avoid interfering with concurrent position
updates.
AtAbort_Notify
Use listenChannelsHash instead of listenChannels.
asyncQueueReadAllNotifications
Set isAdvancing flag and advancingPos before reading, clear
isAdvancing after advancing position.
asyncQueueProcessPageEntries
Use listenChannelsHash instead of listenChannels.
ProcessIncomingNotify
Use listenChannelsHash instead of listenChannels.
AddEventToPendingNotifies
Build channelHashtab when notification count exceeds
MIN_HASHABLE_NOTIFIES, enabling efficient extraction of unique channel
names in PreCommit_Notify.
ClearPendingActionsAndNotifies
Also free pendingNotifyChannels.
Functions Added
---------------
asyncQueuePagePrecedes
Inline function returning true if page p precedes page q (p < q).
channelHashFunc
Hash function for ChannelHashKey, combining hash of dboid and channel
name using XOR. Required callback for dshash operations.
initChannelHash
Lazy initialization of shared dshash table mapping (dboid, channel) to
listener arrays. First caller creates DSA area and dshash, stores
handles in asyncQueueControl; subsequent callers attach using stored
handles.
initListenChannelsHash
Lazy initialization of backend-local hash table (listenChannelsHash)
for faster IsListeningOn() checks.
ChannelHashPrepareKey
Inline helper to construct ChannelHashKey.
TESTING
=======
The patch adds comprehensive isolation tests covering:
1. Subtransaction handling:
- LISTEN in subtransaction with SAVEPOINT/RELEASE - LISTEN merge path
(both outer and inner transactions) - ROLLBACK TO SAVEPOINT
discarding pending actions
2. Notification deduplication:
- Hash table duplicate detection with 17 notifications + 1 duplicate
3. Listener array growth:
- Multiple listeners triggering ChannelEntry array expansion
4. Cross-session delivery:
- Notifications from non-listening backend to listener in another
session
Total test sessions expanded from 3 to 7 to cover these scenarios.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-15 21:53 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 2 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-11-15 21:53 UTC (permalink / raw)
To: pgsql-hackers
On Fri, Nov 14, 2025, at 17:01, Joel Jacobson wrote:
> On Thu, Nov 13, 2025, at 08:13, Joel Jacobson wrote:
>> Attached, please find a new version rebased on top of the bug fix
>> patches that just got committed in 0bdc777, 797e9ea, 8eeb4a0, and
>> 1b46990.
>
> To help reviewers, here is a new write-up of the patch:
> [...write-up...]
While reviewing all the comments in async.c to make sure they match the
patch's code, I noticed a few discrepancies. One of them was the comment
above QUEUE_CLEANUP_DELAY, which again made me think about how master
currently uses that value as the threshold for when to "wake laggers".
In this patch, QUEUE_CLEANUP_DELAY is no longer used for that
purpose; it now only determines how often we try to advance the tail
pointer.
I realize someone who is familiar with the current code in master,
might ask the following question:
Why not do direct advancement, but just use the old "wake laggers"
logic for listeners that lag behind more than QUEUE_CLEANUP_DELAY?
On the surface it might look like a plausible alternative,
since that's what master currently does (but for other databases).
I was curious to see how such alternative approach would affect
performance, so I changed SignalBackends and ran some benchmarks.
Below is the v27 logic:
```
if (QUEUE_BACKEND_IS_ADVANCING(i) ?
QUEUE_POS_PRECEDES(QUEUE_BACKEND_ADVANCING_POS(i), queueHeadAfterWrite) :
QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
{
Assert(pid != InvalidPid);
QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
pids[count] = pid;
procnos[count] = i;
count++;
}
else if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
{
Assert(!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite));
QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
}
```
And here is the modified "wake laggers" version I tested:
```
if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite) &&
QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
{
QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
}
else if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
QUEUE_POS_PAGE(pos)) >= QUEUE_CLEANUP_DELAY)
{
QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
pids[count] = pid;
procnos[count] = i;
count++;
}
```
This preserves direct advancement under the same conditions as the patch,
but only sends signals to backends that are "laggers" by
the QUEUE_CLEANUP_DELAY threshold, similar to master's behavior.
It turns out the "wake laggers" approach is significantly slower:
"wake laggers":
./pg_async_notify_test --listeners 1 --notifiers 1 --channels 1000 --sleep 0.01 --sleep-exp 1.01
20 s: 80871 sent (4650/s), 80871 received (4650/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 380 (0.5%) avg: 0.085ms
0.10-1.00ms # 2765 (3.4%) avg: 0.289ms
1.00-10.00ms # 4947 (6.1%) avg: 5.870ms
10.00-100.00ms ####### 57467 (71.1%) avg: 52.101ms
>100.00ms # 15312 (18.9%) avg: 119.847ms
v27:
% ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 1000 --sleep 0.01 --sleep-exp 1.01
20 s: 229866 sent (13985/s), 229865 received (13984/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 18351 (8.0%) avg: 0.072ms
0.10-1.00ms # 32899 (14.3%) avg: 0.315ms
1.00-10.00ms #### 103273 (44.9%) avg: 5.055ms
10.00-100.00ms ### 75342 (32.8%) avg: 18.197ms
>100.00ms 0 (0.0%) avg: 0.000ms
I reran the experiments a three times with similar results.
Also tested other test permutaions, that also showed "wake laggers" was worse.
At first glance, both approaches ultimately signal all backends
interested in our notifies, so it may seem surprising that latency
differs this much. The key point is what happens to non-interested
backends:
A backend that is *not* listening to our channels may have stopped at a
position before the old queue head, because it woke up and read the head
before all notifies for a previous commit were written. Such backend
might actually be interested in the notifies that lie in between its
current position and the old queue head, and it could therefore be
urgent to wake it up to make it process the queue and delivery the
notifies.
In v27, such a backend gets signaled on the next NOTIFY, when it notice
it is stationary at a pos behind the queueHeadBeforeWrite.
With "wake laggers", however, it receives no signal until it becomes a
"lagger" by QUEUE_CLEANUP_DELAY pages, which can be far later. This
risks delaying its processing and the delivery of notifications it is
interested in.
In master this is fine because "wake laggers" is only applied to
backends in other databases that we know are not interested in our
notifications.
In conclusion, the "wake laggers" mechanism seems inherently
incompatible with the direct advancement mechanism, and I therefore
think the current approach in v27 is sound.
I just wanted to share this reasoning, not because anyone has raised any
concerns about this, but because I hadn't fully internalized this myself,
and thought the reasoning could possibly be helpful to others.
The attached v28 is the same as v27, except some comments have been
fixed to accurately reflect the code.
/Joel
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v28.patch (9.3K, 2-0001-optimize_listen_notify-v28.patch)
download | inline diff:
From a7f75495d655dbaced4515f9f04a95b9da5905ad Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 114 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 68 +++++++++++
2 files changed, 181 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..443a6eb669f 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 7 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
@@ -104,6 +205,17 @@ step l2commit: COMMIT;
listener2: NOTIFY "c1" with payload "" from notifier
step l2stop: UNLISTEN *;
+starting permutation: lch_listen nch_notify lch_check
+step lch_listen: LISTEN ch;
+step nch_notify: NOTIFY ch, 'aa';
+step lch_check: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener_ch: NOTIFY "ch" with payload "aa" from notifier_ch
+
starting permutation: llisten lbegin usage bignotify usage
step llisten: LISTEN c1; LISTEN c2;
step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..0a01e777b98 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,38 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Listener session for cross-session notification test with channel 'ch'.
+
+session listener_ch
+step lch_listen { LISTEN ch; }
+step lch_check { SELECT 1 AS x; }
+teardown { UNLISTEN *; }
+
+# Notifier session for cross-session notification test with channel 'ch'.
+
+session notifier_ch
+step nch_notify { NOTIFY ch, 'aa'; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +106,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
@@ -73,6 +137,10 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
# and notify queue is not empty
permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
+# Check that notifications sent from a backend that has not done LISTEN
+# are properly delivered to a listener in another backend.
+permutation lch_listen nch_notify lch_check
+
# Verify that pg_notification_queue_usage correctly reports a non-zero result,
# after submitting notifications while another connection is listening for
# those notifications and waiting inside an active transaction. We have to
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v28.patch (43.5K, 3-0002-optimize_listen_notify-v28.patch)
download | inline diff:
From 4c9a83b319e1ceb8ca63274d6729ffbf2a774c4d Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sat, 15 Nov 2025 22:18:50 +0100
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of listening backends (ProcNumbers). This allows the sender to
target only those backends actually listening on the channels for which
it has queued notifications.
At commit time:
* AtCommit_Notify updates the shared channelHash to reflect any LISTEN
or UNLISTEN actions performed in the transaction.
* SignalBackends consults this hash to find the backends that are
listening on the channels being notified in the current database, and
signals only those.
Each backend's entry in AsyncQueueControl now includes a wakeupPending
flag to prevent duplicate signals while a previous wakeup is still being
processed.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
To handle advancing backends correctly, each backend's entry tracks both
whether it is currently advancing (isAdvancing) and the target position
it is advancing to (advancingPos). This allows SignalBackends to signal
advancing backends only when their target position would leave them
behind the new queue head, while safely direct-advancing idle backends
that would not be interested in the newly written notifications.
Idle backends that are stationary at a position before the old queue
head are signaled, since they might be interested in the notifications
in between their current position and the old queue head.
Other notes
-----------
* Maintains dual data structures: a shared channelHash for determining
which backends to signal, and a local per-backend listenChannels list
for fast lock-free lookups during notification processing. This avoids
contention on the shared hash during the high-frequency IsListeningOn
checks that occur for every notification read from the queue.
* Backends remain registered in the global listener list as long as
listenChannels is non-empty.
* Adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
* No user-visible behavioral changes; this is an internal optimization
only.
---
src/backend/commands/async.c | 713 +++++++++++++++---
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 610 insertions(+), 107 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index e1cf659485a..dbc90b887b2 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,24 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
- * Then we signal any backends that may be interested in our messages
- * (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * make any actual updates to the local listen state (listenChannelsHash) and
+ * shared channel hash table (channelHash). Then we signal any backends
+ * that may be interested in our messages (including our own backend,
+ * if listening). This is done by SignalBackends(), which consults the
+ * shared channel hash table to identify listeners for the channels that
+ * have pending notifications in the current database. Each selected
+ * backend is marked as having a wakeup pending to avoid duplicate signals,
+ * and a PROCSIG_NOTIFY_INTERRUPT signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, or within the range
+ * written, avoiding unnecessary wakeups for idle listeners that have
+ * nothing to read. Backends that cannot be direct advanced are signaled
+ * if they are stuck behind the old queue head, or advancing to a position
+ * before the new queue head, since otherwise notifications could be delayed.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -137,14 +147,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +175,29 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ProcNumber array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -224,11 +260,14 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
- * we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +285,9 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
+ bool isAdvancing; /* backend is advancing its position */
+ QueuePosition advancingPos; /* target position backend is advancing to */
} QueueBackendStatus;
/*
@@ -260,9 +302,10 @@ typedef struct QueueBackendStatus
* (since no other backend will inspect it).
*
* When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
- * entries of other backends and also change the head pointer. When holding
- * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointers.
+ * entries of other backends and also change the head pointer. They can
+ * also advance other backends' queue positions, unless they are not
+ * in the process of doing that themselves. When holding both NotifyQueueLock and
+ * NotifyQueueTailLock in EXCLUSIVE mode, backends can change the tail pointers.
*
* SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
* the control lock for the pg_notify SLRU buffers.
@@ -288,11 +331,16 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +349,9 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
+#define QUEUE_BACKEND_IS_ADVANCING(i) (asyncQueueControl->backend[i].isAdvancing)
+#define QUEUE_BACKEND_ADVANCING_POS(i) (asyncQueueControl->backend[i].advancingPos)
/*
* The SLRU buffer area through which we access the notification queue
@@ -313,16 +364,16 @@ static SlruCtlData NotifyCtlData;
#define QUEUE_FULL_WARN_INTERVAL 5000 /* warn at most once every 5s */
/*
- * listenChannels identifies the channels we are actually listening to
- * (ie, have committed a LISTEN on). It is a simple list of channel names,
+ * listenChannelsHash identifies the channels we are actually listening to
+ * (ie, have committed a LISTEN on). It is a hash table of channel names,
* allocated in TopMemoryContext.
*/
-static List *listenChannels = NIL; /* list of C strings */
+static HTAB *listenChannelsHash = NULL;
/*
* State for pending LISTEN/UNLISTEN actions consists of an ordered list of
* all actions requested in the current transaction. As explained above,
- * we don't actually change listenChannels until we reach transaction commit.
+ * we don't actually change listenChannelsHash until we reach transaction commit.
*
* The list is kept in CurTransactionContext. In subtransactions, each
* subtransaction has its own list in its own CurTransactionContext, but
@@ -391,6 +442,7 @@ typedef struct NotificationList
int nestingLevel; /* current transaction nesting depth */
List *events; /* list of Notification structs */
HTAB *hashtab; /* hash of NotificationHash structs, or NULL */
+ HTAB *channelHashtab; /* hash of unique channel names, or NULL */
struct NotificationList *upper; /* details for upper transaction levels */
} NotificationList;
@@ -401,6 +453,11 @@ struct NotificationHash
Notification *event; /* => the actual Notification struct */
};
+struct ChannelHash
+{
+ char channel[NAMEDATALEN];
+};
+
static NotificationList *pendingNotifies = NULL;
/*
@@ -418,6 +475,20 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/*
+ * Queue head positions for direct advancement.
+ * These are captured during PreCommit_Notify while holding the heavyweight
+ * lock on database 0, ensuring no other backend can insert notifications
+ * between them. SignalBackends uses these to advance idle backends.
+ */
+static QueuePosition queueHeadBeforeWrite;
+static QueuePosition queueHeadAfterWrite;
+
+/*
+ * List of channels with pending notifications in the current transaction.
+ */
+static List *pendingNotifyChannels = NIL;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -428,7 +499,6 @@ bool Trace_notify = false;
int max_notify_queue_pages = 1048576;
/* local function prototypes */
-static inline int64 asyncQueuePageDiff(int64 p, int64 q);
static inline bool asyncQueuePagePrecedes(int64 p, int64 q);
static void queue_listen(ListenActionKind action, const char *channel);
static void Async_UnlistenOnExit(int code, Datum arg);
@@ -456,16 +526,9 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
-
-/*
- * Compute the difference between two queue page numbers.
- * Previously this function accounted for a wraparound.
- */
-static inline int64
-asyncQueuePageDiff(int64 p, int64 q)
-{
- return p - q;
-}
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+static void initChannelHash(void);
/*
* Determines whether p precedes q.
@@ -477,6 +540,105 @@ asyncQueuePagePrecedes(int64 p, int64 q)
return p < q;
}
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
+/*
+ * initListenChannelsHash
+ * Lazy initialization of the local listen channels hash table.
+ */
+static void
+initListenChannelsHash(void)
+{
+ HASHCTL hash_ctl;
+
+ /* Quick exit if we already did this */
+ if (listenChannelsHash != NULL)
+ return;
+
+ /* Initialize local hash table for this backend's listened channels */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+
+ listenChannelsHash =
+ hash_create("Listen Channels",
+ 64,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS);
+}
+
/*
* Report space needed for our shared memory area
*/
@@ -520,12 +682,17 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVANCING_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
+ QUEUE_BACKEND_IS_ADVANCING(i) = false;
}
}
@@ -656,6 +823,7 @@ Async_Notify(const char *channel, const char *payload)
notifies->events = list_make1(n);
/* We certainly don't need a hashtable yet */
notifies->hashtab = NULL;
+ notifies->channelHashtab = NULL;
notifies->upper = pendingNotifies;
pendingNotifies = notifies;
}
@@ -682,7 +850,7 @@ Async_Notify(const char *channel, const char *payload)
* Common code for listen, unlisten, unlisten all commands.
*
* Adds the request to the list of pending actions.
- * Actual update of the listenChannels list happens during transaction
+ * Actual update of the listenChannelsHash happens during transaction
* commit.
*/
static void
@@ -782,30 +950,49 @@ Async_UnlistenAll(void)
* SQL function: return a set of the channel names this backend is actively
* listening to.
*
- * Note: this coding relies on the fact that the listenChannels list cannot
+ * Note: this coding relies on the fact that the listenChannelsHash cannot
* change within a transaction.
*/
Datum
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ HASH_SEQ_STATUS *status;
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Initialize hash table iteration if we have any channels */
+ if (listenChannelsHash != NULL)
+ {
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ status = (HASH_SEQ_STATUS *) palloc(sizeof(HASH_SEQ_STATUS));
+ hash_seq_init(status, listenChannelsHash);
+ funcctx->user_fctx = status;
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ {
+ funcctx->user_fctx = NULL;
+ }
}
/* stuff done on every call of the function */
funcctx = SRF_PERCALL_SETUP();
+ status = (HASH_SEQ_STATUS *) funcctx->user_fctx;
- if (funcctx->call_cntr < list_length(listenChannels))
+ if (status != NULL)
{
- char *channel = (char *) list_nth(listenChannels,
- funcctx->call_cntr);
+ struct ChannelHash *entry;
- SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
+ entry = (struct ChannelHash *) hash_seq_search(status);
+ if (entry != NULL)
+ SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(entry->channel));
}
SRF_RETURN_DONE(funcctx);
@@ -893,6 +1080,36 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
+
+ /*
+ * Build list of unique channels for SignalBackends().
+ *
+ * If we have a channelHashtab, use it to efficiently get the unique
+ * channels. Otherwise, fall back to the linear approach.
+ */
+ pendingNotifyChannels = NIL;
+ if (pendingNotifies->channelHashtab != NULL)
+ {
+ HASH_SEQ_STATUS status;
+ struct ChannelHash *channelEntry;
+
+ hash_seq_init(&status, pendingNotifies->channelHashtab);
+ while ((channelEntry = (struct ChannelHash *) hash_seq_search(&status)) != NULL)
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channelEntry->channel);
+ }
+ else
+ {
+ /* Linear approach for small number of notifications */
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
+ }
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -921,6 +1138,22 @@ PreCommit_Notify(void)
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /*
+ * For the direct advancement optimization in SignalBackends(), we
+ * need to ensure that no other backend can insert queue entries
+ * between queueHeadBeforeWrite and queueHeadAfterWrite. The
+ * heavyweight lock above provides this guarantee, since it serializes
+ * all writers.
+ *
+ * Note: if the heavyweight lock were ever removed for scalability
+ * reasons, we could achieve the same guarantee by holding
+ * NotifyQueueLock in EXCLUSIVE mode across all our insertions, rather
+ * than releasing and reacquiring it for each page as we do below.
+ */
+
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(queueHeadBeforeWrite, 0, 0);
+
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
while (nextNotify != NULL)
@@ -938,12 +1171,20 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ if (firstIteration)
+ {
+ queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many notifications in the NOTIFY queue")));
nextNotify = asyncQueueAddEntries(nextNotify);
+ queueHeadAfterWrite = QUEUE_HEAD;
+
LWLockRelease(NotifyQueueLock);
}
@@ -956,7 +1197,7 @@ PreCommit_Notify(void)
*
* This is called at transaction commit, after committing to clog.
*
- * Update listenChannels and clear transaction-local state.
+ * Update listenChannelsHash and clear transaction-local state.
*
* If we issued any notifications in the transaction, send signals to
* listening backends (possibly including ourselves) to process them.
@@ -1001,7 +1242,8 @@ AtCommit_Notify(void)
}
/* If no longer listening to anything, get out of listener array */
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/*
@@ -1134,50 +1376,145 @@ Exec_ListenPreCommit(void)
static void
Exec_ListenCommit(const char *channel)
{
- MemoryContext oldcontext;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ProcNumber *listeners;
/* Do nothing if we are already listening on this channel */
if (IsListeningOn(channel))
return;
/*
- * Add the new channel name to listenChannels.
+ * Add the new channel name to listenChannelsHash.
*
* XXX It is theoretically possible to get an out-of-memory failure here,
* which would be bad because we already committed. For the moment it
* doesn't seem worth trying to guard against that, but maybe improve this
* later.
*/
- oldcontext = MemoryContextSwitchTo(TopMemoryContext);
- listenChannels = lappend(listenChannels, pstrdup(channel));
- MemoryContextSwitchTo(oldcontext);
+ initListenChannelsHash();
+ (void) hash_search(listenChannelsHash, channel, HASH_ENTER, NULL);
+
+ /* Now update the shared channelHash for SignalBackends() to use */
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * For new entries, we initialize listenersArray to InvalidDsaPointer as a
+ * marker. This handles both the initial creation and potential retry
+ * after OOM.
+ */
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ dshash_release_lock(channelHash, entry);
+ return; /* Already registered */
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners] = MyProcNumber;
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Unlisten the specified channel for this backend.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
- ListCell *q;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
- foreach(q, listenChannels)
+ /* Remove from our local cache */
+ if (listenChannelsHash != NULL)
+ (void) hash_search(listenChannelsHash, channel, HASH_REMOVE, NULL);
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(q);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i] == MyProcNumber)
{
- listenChannels = foreach_delete_current(listenChannels, q);
- pfree(lchan);
- break;
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ return;
}
}
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1192,34 +1529,68 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
- list_free_deep(listenChannels);
- listenChannels = NIL;
+ /* Clear our local cache */
+ if (listenChannelsHash != NULL)
+ {
+ hash_destroy(listenChannelsHash);
+ listenChannelsHash = NULL;
+ }
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ProcNumber *listeners;
+ int i;
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i] == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ProcNumber) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
}
/*
* Test whether we are actively listening on the given channel name.
*
* Note: this function is executed for every notification found in the queue.
- * Perhaps it is worth further optimization, eg convert the list to a sorted
- * array so we can binary-search it. In practice the list is likely to be
- * fairly short, though.
*/
static bool
IsListeningOn(const char *channel)
{
- ListCell *p;
+ if (listenChannelsHash == NULL)
+ return false;
- foreach(p, listenChannels)
- {
- char *lchan = (char *) lfirst(p);
-
- if (strcmp(lchan, channel) == 0)
- return true;
- }
- return false;
+ return (hash_search(listenChannelsHash, channel, HASH_FIND, NULL) != NULL);
}
/*
@@ -1229,7 +1600,7 @@ IsListeningOn(const char *channel)
static void
asyncQueueUnregister(void)
{
- Assert(listenChannels == NIL); /* else caller error */
+ Assert(listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0); /* else caller error */
if (!amRegisteredListener) /* nothing to do */
return;
@@ -1241,6 +1612,7 @@ asyncQueueUnregister(void)
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
/* and remove it from the list */
if (QUEUE_FIRST_LISTENER == MyProcNumber)
QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);
@@ -1565,12 +1937,21 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are not interested in our notifies, that are known
+ * to still be positioned at the old queue head, or anywhere in the
+ * queue region we just wrote, can be safely advanced directly to the
+ * new head, since that region is known to contain only our own
+ * notifications. This avoids unnecessary wakeups when there is
+ * nothing of interest to them.
+ *
+ * Backends that are not interested in our notifies, that are advancing
+ * to a target position before the new queue head, or that are not
+ * advancing and are stationary at a position before the old queue head
+ * needs to be signaled since notifications could otherwise be delayed.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +1964,13 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ ListCell *lc;
+
+ /*
+ * Attach to the channel hash if needed. We might not have one if this
+ * backend hasn't done LISTEN, but we need it to find listeners.
+ */
+ initChannelHash();
/*
* Identify backends that we need to signal. We don't want to send
@@ -1597,36 +1985,87 @@ SignalBackends(void)
count = 0;
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, pendingNotifyChannels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ProcNumber *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ if (pendingNotifies != NULL)
+ {
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ if (QUEUE_BACKEND_IS_ADVANCING(i) ?
+ QUEUE_POS_PRECEDES(QUEUE_BACKEND_ADVANCING_POS(i), queueHeadAfterWrite) :
+ QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
+ {
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
+ else if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
+ QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
+ {
+ Assert(!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite));
+
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1673,9 +2112,10 @@ AtAbort_Notify(void)
/*
* If we LISTEN but then roll back the transaction after PreCommit_Notify,
* we have registered as a listener but have not made any entry in
- * listenChannels. In that case, deregister again.
+ * listenChannelsHash. In that case, deregister again.
*/
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/* And clean up */
@@ -1854,20 +2294,29 @@ asyncQueueReadAllNotifications(void)
QueuePosition head;
Snapshot snapshot;
- /* Fetch current state */
+ /*
+ * Fetch current state, indicate to others that we have woken up, and that
+ * we now will be advancing our position.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ head = QUEUE_HEAD;
pos = QUEUE_BACKEND_POS(MyProcNumber);
- head = QUEUE_HEAD;
- LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
{
/* Nothing to do, we have read all notifications already. */
+ LWLockRelease(NotifyQueueLock);
return;
}
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = true;
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = head;
+
+ LWLockRelease(NotifyQueueLock);
+
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
* This is trickier than it might seem, because of race conditions.
@@ -1954,6 +2403,8 @@ asyncQueueReadAllNotifications(void)
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
+
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = false;
QUEUE_BACKEND_POS(MyProcNumber) = pos;
LWLockRelease(NotifyQueueLock);
@@ -2055,7 +2506,7 @@ asyncQueueProcessPageEntries(QueuePosition *current,
* over it on the first LISTEN in a session, and not get stuck on
* it indefinitely.
*/
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
continue;
if (TransactionIdDidCommit(qe->xid))
@@ -2310,7 +2761,7 @@ ProcessIncomingNotify(bool flush)
notifyInterruptPending = false;
/* Do nothing else if we aren't actively listening */
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
return;
if (Trace_notify)
@@ -2414,13 +2865,15 @@ AddEventToPendingNotifies(Notification *n)
{
Assert(pendingNotifies->events != NIL);
- /* Create the hash table if it's time to */
+ /* Create the hash tables if it's time to */
if (list_length(pendingNotifies->events) >= MIN_HASHABLE_NOTIFIES &&
pendingNotifies->hashtab == NULL)
{
HASHCTL hash_ctl;
ListCell *l;
+ Assert(pendingNotifies->channelHashtab == NULL);
+
/* Create the hash table */
hash_ctl.keysize = sizeof(Notification *);
hash_ctl.entrysize = sizeof(struct NotificationHash);
@@ -2433,10 +2886,22 @@ AddEventToPendingNotifies(Notification *n)
&hash_ctl,
HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
+ /* Create the channel hash table */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+ hash_ctl.hcxt = CurTransactionContext;
+ pendingNotifies->channelHashtab =
+ hash_create("Pending Notify Channels",
+ 64L,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
/* Insert all the already-existing events */
foreach(l, pendingNotifies->events)
{
Notification *oldn = (Notification *) lfirst(l);
+ char *channel = oldn->data;
bool found;
(void) hash_search(pendingNotifies->hashtab,
@@ -2444,22 +2909,42 @@ AddEventToPendingNotifies(Notification *n)
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Insert channel into channelHashtab */
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if multiple events on same channel */
}
}
/* Add new event to the list, in order */
pendingNotifies->events = lappend(pendingNotifies->events, n);
- /* Add event to the hash table if needed */
+ /* Add event to the hash tables if needed */
if (pendingNotifies->hashtab != NULL)
{
bool found;
+ Assert(pendingNotifies->channelHashtab != NULL);
+
(void) hash_search(pendingNotifies->hashtab,
&n,
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Add channel to channelHashtab */
+ {
+ char *channel = n->data;
+
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if we already have an event on this channel */
+ }
}
}
@@ -2497,7 +2982,7 @@ notification_match(const void *key1, const void *key2, Size keysize)
return 1; /* not equal */
}
-/* Clear the pendingActions and pendingNotifies lists. */
+/* Clear the pendingActions, pendingNotifies, and pendingNotifyChannels lists. */
static void
ClearPendingActionsAndNotifies(void)
{
@@ -2509,6 +2994,7 @@ ClearPendingActionsAndNotifies(void)
*/
pendingActions = NULL;
pendingNotifies = NULL;
+ pendingNotifyChannels = NIL;
}
/*
@@ -2519,3 +3005,16 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c1ac71ff7f2..7c2cf960093 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -369,6 +369,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 5b0ce383408..4236965e72a 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -101,6 +101,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bce72ae64..c9917e87d45 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-17 07:04 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
1 sibling, 0 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-11-17 07:04 UTC (permalink / raw)
To: pgsql-hackers
On Sat, Nov 15, 2025, at 22:53, Joel Jacobson wrote:
> On Fri, Nov 14, 2025, at 17:01, Joel Jacobson wrote:
>> On Thu, Nov 13, 2025, at 08:13, Joel Jacobson wrote:
>>> Attached, please find a new version rebased on top of the bug fix
>>> patches that just got committed in 0bdc777, 797e9ea, 8eeb4a0, and
>>> 1b46990.
>>
>> To help reviewers, here is a new write-up of the patch:
>> [...write-up...]
>
...
> The attached v28 is the same as v27, except some comments have been
> fixed to accurately reflect the code.
I note LISTEN/NOTIFY yet again made it to the front-page of Hacker News
due to complaints of being a bottleneck:
https://peterullrich.com/listen-to-database-changes-through-the-postgres-wal
https://news.ycombinator.com/item?id=45885768
Unfortunately, the article doesn't say if the workload in the example
is made up, or if it's based on actual numbers, and it doesn't say
if they listened to a single channel or multiple channels.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-18 08:15 Chao Li <[email protected]>
parent: Joel Jacobson <[email protected]>
1 sibling, 1 reply; 120+ messages in thread
From: Chao Li @ 2025-11-18 08:15 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: pgsql-hackers
Hi Joel,
> On Nov 16, 2025, at 05:53, Joel Jacobson <[email protected]> wrote:
>
> The attached v28 is the same as v27, except some comments have been
> fixed to accurately reflect the code.
>
> /Joel<0001-optimize_listen_notify-v28.patch><0002-optimize_listen_notify-v28.patch>
Thanks for the continuous effort on this patch. Finally, I got some time, after revisiting v28 throughoutly, I think it’s much better now. Just got 2 more comments:
1
```
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
```
DSA is created and pinned by the first backend and every backend isa_in_mapping, but I don’t see any unpin, is it a problem? If unpin is not needed, why are they provided?
2
```
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ProcNumber *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i = listeners[j];
+ int32 pid;
+ QueuePosition pos;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
```
SignalBackends() now holds the dshash entry lock for long time, while other backend’s LISTEN/UNLISTEN all needs to acquire the lock. So, my suggestion is to copy the listeners array to local then quickly release the lock.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-19 03:14 Joel Jacobson <[email protected]>
parent: Chao Li <[email protected]>
0 siblings, 0 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-11-19 03:14 UTC (permalink / raw)
To: Chao Li <[email protected]>; +Cc: pgsql-hackers
On Tue, Nov 18, 2025, at 09:15, Chao Li wrote:
> Thanks for the continuous effort on this patch. Finally, I got some
> time, after revisiting v28 throughoutly, I think it’s much better now.
Thanks for reviewing.
> Just got 2 more comments:
>
...
> DSA is created and pinned by the first backend and every backend
> isa_in_mapping, but I don’t see any unpin, is it a problem? If unpin is
> not needed, why are they provided?
No, this is not a problem.
The channel hash area is pinned "so that it will continue to exist even
if all backends detach from it", via dsa_pin(). Each backend's mapping
lives for session lifetime via dsa_pin_mapping(). We never need to unpin
either. This follows the same pattern as e.g.
logicalrep_launcher_attach_dshmem() in launcher.c.
dsm_unpin_mapping() was added in f7102b0 (2014), but I cannot find any
use of it in the sources, I think it's there mostly for API
completeness.
> SignalBackends() now holds the dshash entry lock for long time, while
> other backend’s LISTEN/UNLISTEN all needs to acquire the lock. So, my
> suggestion is to copy the listeners array to local then quickly release
> the lock.
Trying to optimize this further would mean increased code complexity,
since we would then have to worry and reason about stale reads.
I only think we should consider this if we find this to actually be a
bottleneck with the design, and my guess is that it's usually not
because:
1. dshash_find(..., false) in SignalBackends takes a shared lock, so
multiple concurrent SignalBackends() calls can read simultaneously.
2. The loop in SignalBackends is already I/O free, the region where we
do dshash_find(..., false) is within the same region that we hold the
exclusive lock; we're doing the expensive signaling after all locks have
been released.
3. We're already looping over numListeners while holding exclusive lock
on the channel in both Exec_ListenCommit and Exec_UnlistenCommit, so
what we're doing in SignalBackends isn't any worse.
4. We're not locking the entire channel hash, only the partition for one
channel at a time.
Just to be sure, I will do some LISTEN/UNLISTEN benchmarking to
investigate how the locking affects performance, and then we can
evaluate.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-20 20:26 Tom Lane <[email protected]>
parent: Arseniy Mukhin <[email protected]>
5 siblings, 2 replies; 120+ messages in thread
From: Tom Lane @ 2025-11-20 20:26 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Chao Li <[email protected]>; pgsql-hackers
I took a brief look through the v28 patch, and I'm fairly distressed
at how much new logic has been stuffed into what's effectively a
critical section. It's totally not okay for AtCommit_Notify or
anything it calls to throw an error; if it does, something
like this will happen:
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index e1cf659..ece820d 100644
*** a/src/backend/commands/async.c
--- b/src/backend/commands/async.c
*************** Exec_ListenCommit(const char *channel)
*** 1148,1153 ****
--- 1148,1154 ----
* doesn't seem worth trying to guard against that, but maybe improve this
* later.
*/
+ elog(ERROR, "phony OOM in Exec_ListenCommit");
oldcontext = MemoryContextSwitchTo(TopMemoryContext);
listenChannels = lappend(listenChannels, pstrdup(channel));
MemoryContextSwitchTo(oldcontext);
regression=# begin;
BEGIN
regression=*# listen foo;
LISTEN
regression=*# notify foo;
NOTIFY
regression=*# commit;
ERROR: phony OOM in Exec_ListenCommit
WARNING: AbortTransaction while in COMMIT state
PANIC: cannot abort transaction 21558, it was already committed
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
(The NOTIFY in my example could be replaced by anything that
causes the transaction to obtain an XID.)
Now, we had skated past this issue in a few places already,
such as the above-quoted fragment in Exec_ListenCommit, arguing
that the probability of failure there was small enough to tolerate.
But I see no such arguments being made in this patch, and I doubt
I'd believe it anyway for things like DSA segment creation.
So I think there needs to be a serious effort made to move as
much as we possibly can of the potentially-risky stuff into
PreCommit_Notify. In particular I think we ought to create
the shared channel hash entry then, and even insert our PID
into it. We could expand the listenersArray entries to include
both a PID and a boolean "is it REALLY listening?", and then
during Exec_ListenCommit we'd only be required to find an
entry we already added and set its boolean, so there's no OOM
hazard. Possibly do something similar with the local
listenChannelsHash, so as to remove that possibility of OOM
failure as well.
(An alternative design could be to go ahead and insert our
PID during PreCommit_Notify, and just tolerate the small
risk of getting signaled when we didn't need to be. But
then we'd need some mechanism for cleaning out the bogus
entry during AtAbort_Notify.)
I'm not sure what I think about the new logic in SignalBackends
from this standpoint. Making it very-low-probability-of-error
definitely needs some consideration though.
regards, tom lane
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-22 21:30 Joel Jacobson <[email protected]>
parent: Tom Lane <[email protected]>
1 sibling, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-11-22 21:30 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Chao Li <[email protected]>; pgsql-hackers
On Thu, Nov 20, 2025, at 21:26, Tom Lane wrote:
> I took a brief look through the v28 patch, and I'm fairly distressed
> at how much new logic has been stuffed into what's effectively a
> critical section. It's totally not okay for AtCommit_Notify or
> anything it calls to throw an error
Right, I agree. Thanks for guidance.
> So I think there needs to be a serious effort made to move as
> much as we possibly can of the potentially-risky stuff into
> PreCommit_Notify. In particular I think we ought to create
> the shared channel hash entry then, and even insert our PID
> into it. We could expand the listenersArray entries to include
> both a PID and a boolean "is it REALLY listening?", and then
> during Exec_ListenCommit we'd only be required to find an
> entry we already added and set its boolean, so there's no OOM
> hazard. Possibly do something similar with the local
> listenChannelsHash, so as to remove that possibility of OOM
> failure as well.
Thanks for the idea, I like this approach. I've expanded the
listenersArray like suggested, and moved all risky stuff from
Exec_ListenCommit to PreCommit_Notify.
> (An alternative design could be to go ahead and insert our
> PID during PreCommit_Notify, and just tolerate the small
> risk of getting signaled when we didn't need to be. But
> then we'd need some mechanism for cleaning out the bogus
> entry during AtAbort_Notify.)
We seem to need a cleanup mechanism also with the boolean field design,
since a channel could end up being added only to listenChannelsHash, but
not channelHash, which would trick IsListeningOn() into falsely thinking
we're listening on such channel when we're not. This could happen if
successfully adding the channel to listenChannelsHash, but OOM when
trying to add it to channelHash.
AtAbort_Notify now handles such half-state, by reconciling all channels
that had LISTEN_LISTEN actions, using the channelHash as the single
source of truth, removing channels from both listenChannelsHash and
channelHash, unless the active field is true (which means we were
already listening to the channel due to a previous transaction).
I've tested triggering the cleanup logic by adding elog ERROR that
triggered after listenChannelsHash insert, and another test that
triggered after channelHash insert, and it cleaned it up correctly. I
haven't created a test for it in tree though, would we want that?
> I'm not sure what I think about the new logic in SignalBackends
> from this standpoint. Making it very-low-probability-of-error
> definitely needs some consideration though.
The initChannelHash call in SignalBackends is now gone, moved to
PreCommit_Notify (called if there are any pendingNotifies).
I also took the liberty of fixing the XXX comment, to lazily preallocate
the signals arrays during PreCommit_Notify. It felt too inconsistent to
just leave that unfixed, but maybe should be a separate commit?
I wonder how risky the remaining new logic in SignalBackends is. For
instance, I looked at dshash_find(..., false), and note it calls
LWLockAcquire which in turn could elog ERROR if num locks is exceeded,
but in master we're already calling LWLockAcquire in SignalBackends, so
should be fine I guess?
Apart from that, I don't see any new logic in SignalBackends, that could
potentially be risky.
/Joel
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v29.patch (9.3K, 2-0001-optimize_listen_notify-v29.patch)
download | inline diff:
From 81ab1ad3f6fe8a11a8877838303c88bc33c872fd Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 114 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 68 +++++++++++
2 files changed, 181 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..443a6eb669f 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 7 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
@@ -104,6 +205,17 @@ step l2commit: COMMIT;
listener2: NOTIFY "c1" with payload "" from notifier
step l2stop: UNLISTEN *;
+starting permutation: lch_listen nch_notify lch_check
+step lch_listen: LISTEN ch;
+step nch_notify: NOTIFY ch, 'aa';
+step lch_check: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener_ch: NOTIFY "ch" with payload "aa" from notifier_ch
+
starting permutation: llisten lbegin usage bignotify usage
step llisten: LISTEN c1; LISTEN c2;
step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..0a01e777b98 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,38 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Listener session for cross-session notification test with channel 'ch'.
+
+session listener_ch
+step lch_listen { LISTEN ch; }
+step lch_check { SELECT 1 AS x; }
+teardown { UNLISTEN *; }
+
+# Notifier session for cross-session notification test with channel 'ch'.
+
+session notifier_ch
+step nch_notify { NOTIFY ch, 'aa'; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +106,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
@@ -73,6 +137,10 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
# and notify queue is not empty
permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
+# Check that notifications sent from a backend that has not done LISTEN
+# are properly delivered to a listener in another backend.
+permutation lch_listen nch_notify lch_check
+
# Verify that pg_notification_queue_usage correctly reports a non-zero result,
# after submitting notifications while another connection is listening for
# those notifications and waiting inside an active transaction. We have to
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v29.patch (58.3K, 3-0002-optimize_listen_notify-v29.patch)
download | inline diff:
From c8ae6295084aa2e95333ac7449b0984e6d819e44 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sat, 15 Nov 2025 22:18:50 +0100
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of ListenerEntry structures (containing ProcNumber and active
flag) representing the backends listening on each channel. This allows
the sender to target only those backends actually listening on the
channels for which it has queued notifications.
Critical section safety
-----------------------
To avoid ERROR→PANIC after transaction commit, all risky operations
(DSA allocations, dshash operations, memory allocations) are moved to
PreCommit_Notify where failures can still safely abort.
Each listener entry includes an 'active' flag. During PreCommit, LISTEN
operations insert entries with active=false to pre-allocate all required
space. After commit to clog, AtCommit_Notify simply flips active flags
to true—a boolean assignment that has acceptably low failure risk.
Similarly, signal arrays (notifySignalPids/notifySignalProcs) use a
hybrid allocation strategy: allocated once in TopMemoryContext during
the first NOTIFY (in PreCommit where OOM can still abort), then reused
permanently across all transactions.
At commit time:
* Exec_ListenPreCommit performs all risky allocations with active=false
* Exec_ListenCommit flips active flags to true (post-commit)
* AtCommit_Notify updates the shared channelHash to reflect any LISTEN
or UNLISTEN actions performed in the transaction.
* SignalBackends consults this hash to find the backends that are
listening on the channels being notified in the current database, and
signals only those with active=true.
Each backend's entry in AsyncQueueControl now includes a wakeupPending
flag to prevent duplicate signals while a previous wakeup is still being
processed.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
To handle advancing backends correctly, each backend's entry tracks both
whether it is currently advancing (isAdvancing) and the target position
it is advancing to (advancingPos). This allows SignalBackends to signal
advancing backends only when their target position would leave them
behind the new queue head, while safely direct-advancing idle backends
that would not be interested in the newly written notifications.
Idle backends that are stationary at a position before the old queue
head are signaled, since they might be interested in the notifications
in between their current position and the old queue head.
Other notes
-----------
* Maintains dual data structures: a shared channelHash for determining
which backends to signal, and a local per-backend listenChannels list
for fast lock-free lookups during notification processing. This avoids
contention on the shared hash during the high-frequency IsListeningOn
checks that occur for every notification read from the queue.
* Backends remain registered in the global listener list as long as
listenChannels is non-empty.
* AtAbort_Notify cleanly handles both local and shared hash cleanup,
removing inactive (uncommitted) entries while preserving active
(committed) entries from previous transactions.
* Adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
* No user-visible behavioral changes; this is an internal optimization
only.
---
src/backend/commands/async.c | 1084 +++++++++++++----
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 3 +
4 files changed, 876 insertions(+), 213 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index e1cf659485a..2a82f4a0130 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,24 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
- * Then we signal any backends that may be interested in our messages
- * (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * make any actual updates to the local listen state (listenChannelsHash) and
+ * shared channel hash table (channelHash). Then we signal any backends
+ * that may be interested in our messages (including our own backend,
+ * if listening). This is done by SignalBackends(), which consults the
+ * shared channel hash table to identify listeners for the channels that
+ * have pending notifications in the current database. Each selected
+ * backend is marked as having a wakeup pending to avoid duplicate signals,
+ * and a PROCSIG_NOTIFY_INTERRUPT signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, or within the range
+ * written, avoiding unnecessary wakeups for idle listeners that have
+ * nothing to read. Backends that cannot be direct advanced are signaled
+ * if they are stuck behind the old queue head, or advancing to a position
+ * before the new queue head, since otherwise notifications could be delayed.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -137,14 +147,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +175,43 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ListenerEntry structs representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+/*
+ * Individual listener entry in the channel's listeners array.
+ *
+ * We store both the ProcNumber and an active flag. During PreCommit,
+ * we insert entries with active=false to pre-allocate space and avoid
+ * OOM failures after transaction commit. Then in AtCommit, we just set
+ * the active flag to true, which has acceptably low risk of failure.
+ */
+typedef struct ListenerEntry
+{
+ ProcNumber procno; /* backend's ProcNumber */
+ bool active; /* true if actually listening */
+} ListenerEntry;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ListenerEntry array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -224,11 +274,14 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
- * we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +299,9 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
+ bool isAdvancing; /* backend is advancing its position */
+ QueuePosition advancingPos; /* target position backend is advancing to */
} QueueBackendStatus;
/*
@@ -260,9 +316,10 @@ typedef struct QueueBackendStatus
* (since no other backend will inspect it).
*
* When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
- * entries of other backends and also change the head pointer. When holding
- * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointers.
+ * entries of other backends and also change the head pointer. They can
+ * also advance other backends' queue positions, unless they are not
+ * in the process of doing that themselves. When holding both NotifyQueueLock and
+ * NotifyQueueTailLock in EXCLUSIVE mode, backends can change the tail pointers.
*
* SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
* the control lock for the pg_notify SLRU buffers.
@@ -288,11 +345,16 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +363,9 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
+#define QUEUE_BACKEND_IS_ADVANCING(i) (asyncQueueControl->backend[i].isAdvancing)
+#define QUEUE_BACKEND_ADVANCING_POS(i) (asyncQueueControl->backend[i].advancingPos)
/*
* The SLRU buffer area through which we access the notification queue
@@ -313,16 +378,16 @@ static SlruCtlData NotifyCtlData;
#define QUEUE_FULL_WARN_INTERVAL 5000 /* warn at most once every 5s */
/*
- * listenChannels identifies the channels we are actually listening to
- * (ie, have committed a LISTEN on). It is a simple list of channel names,
+ * listenChannelsHash identifies the channels we are actually listening to
+ * (ie, have committed a LISTEN on). It is a hash table of channel names,
* allocated in TopMemoryContext.
*/
-static List *listenChannels = NIL; /* list of C strings */
+static HTAB *listenChannelsHash = NULL;
/*
* State for pending LISTEN/UNLISTEN actions consists of an ordered list of
* all actions requested in the current transaction. As explained above,
- * we don't actually change listenChannels until we reach transaction commit.
+ * we don't actually change listenChannelsHash until we reach transaction commit.
*
* The list is kept in CurTransactionContext. In subtransactions, each
* subtransaction has its own list in its own CurTransactionContext, but
@@ -391,6 +456,7 @@ typedef struct NotificationList
int nestingLevel; /* current transaction nesting depth */
List *events; /* list of Notification structs */
HTAB *hashtab; /* hash of NotificationHash structs, or NULL */
+ HTAB *channelHashtab; /* hash of unique channel names, or NULL */
struct NotificationList *upper; /* details for upper transaction levels */
} NotificationList;
@@ -401,6 +467,11 @@ struct NotificationHash
Notification *event; /* => the actual Notification struct */
};
+struct ChannelHash
+{
+ char channel[NAMEDATALEN];
+};
+
static NotificationList *pendingNotifies = NULL;
/*
@@ -418,6 +489,26 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/*
+ * Queue head positions for direct advancement.
+ * These are captured during PreCommit_Notify while holding the heavyweight
+ * lock on database 0, ensuring no other backend can insert notifications
+ * between them. SignalBackends uses these to advance idle backends.
+ */
+static QueuePosition queueHeadBeforeWrite;
+static QueuePosition queueHeadAfterWrite;
+
+/*
+ * List of channels with pending notifications in the current transaction.
+ */
+static List *pendingNotifyChannels = NIL;
+
+/*
+ * Arrays for SignalBackends.
+ */
+static int32 *notifySignalPids = NULL;
+static ProcNumber *notifySignalProcs = NULL;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -428,11 +519,10 @@ bool Trace_notify = false;
int max_notify_queue_pages = 1048576;
/* local function prototypes */
-static inline int64 asyncQueuePageDiff(int64 p, int64 q);
static inline bool asyncQueuePagePrecedes(int64 p, int64 q);
static void queue_listen(ListenActionKind action, const char *channel);
static void Async_UnlistenOnExit(int code, Datum arg);
-static void Exec_ListenPreCommit(void);
+static void Exec_ListenPreCommit(const char *channel);
static void Exec_ListenCommit(const char *channel);
static void Exec_UnlistenCommit(const char *channel);
static void Exec_UnlistenAllCommit(void);
@@ -456,16 +546,9 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
-
-/*
- * Compute the difference between two queue page numbers.
- * Previously this function accounted for a wraparound.
- */
-static inline int64
-asyncQueuePageDiff(int64 p, int64 q)
-{
- return p - q;
-}
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+static void initChannelHash(void);
/*
* Determines whether p precedes q.
@@ -477,6 +560,126 @@ asyncQueuePagePrecedes(int64 p, int64 q)
return p < q;
}
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
+/*
+ * initListenChannelsHash
+ * Lazy initialization of the local listen channels hash table.
+ */
+static void
+initListenChannelsHash(void)
+{
+ HASHCTL hash_ctl;
+
+ /* Quick exit if we already did this */
+ if (listenChannelsHash != NULL)
+ return;
+
+ /* Initialize local hash table for this backend's listened channels */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+
+ listenChannelsHash =
+ hash_create("Listen Channels",
+ 64,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS);
+}
+
+/*
+ * initSignalArrays
+ * Lazy initialization of the signal arrays.
+ */
+static void
+initSignalArrays(void)
+{
+ MemoryContext oldcontext;
+
+ if (notifySignalProcs != NULL)
+ return;
+
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (notifySignalPids == NULL)
+ notifySignalPids = (int32 *) palloc(MaxBackends * sizeof(int32));
+ notifySignalProcs = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+
+ MemoryContextSwitchTo(oldcontext);
+}
+
/*
* Report space needed for our shared memory area
*/
@@ -520,12 +723,17 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVANCING_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
+ QUEUE_BACKEND_IS_ADVANCING(i) = false;
}
}
@@ -656,6 +864,7 @@ Async_Notify(const char *channel, const char *payload)
notifies->events = list_make1(n);
/* We certainly don't need a hashtable yet */
notifies->hashtab = NULL;
+ notifies->channelHashtab = NULL;
notifies->upper = pendingNotifies;
pendingNotifies = notifies;
}
@@ -682,7 +891,7 @@ Async_Notify(const char *channel, const char *payload)
* Common code for listen, unlisten, unlisten all commands.
*
* Adds the request to the list of pending actions.
- * Actual update of the listenChannels list happens during transaction
+ * Actual update of the listenChannelsHash happens during transaction
* commit.
*/
static void
@@ -782,30 +991,49 @@ Async_UnlistenAll(void)
* SQL function: return a set of the channel names this backend is actively
* listening to.
*
- * Note: this coding relies on the fact that the listenChannels list cannot
+ * Note: this coding relies on the fact that the listenChannelsHash cannot
* change within a transaction.
*/
Datum
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ HASH_SEQ_STATUS *status;
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Initialize hash table iteration if we have any channels */
+ if (listenChannelsHash != NULL)
+ {
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ status = (HASH_SEQ_STATUS *) palloc(sizeof(HASH_SEQ_STATUS));
+ hash_seq_init(status, listenChannelsHash);
+ funcctx->user_fctx = status;
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ {
+ funcctx->user_fctx = NULL;
+ }
}
/* stuff done on every call of the function */
funcctx = SRF_PERCALL_SETUP();
+ status = (HASH_SEQ_STATUS *) funcctx->user_fctx;
- if (funcctx->call_cntr < list_length(listenChannels))
+ if (status != NULL)
{
- char *channel = (char *) list_nth(listenChannels,
- funcctx->call_cntr);
+ struct ChannelHash *entry;
- SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
+ entry = (struct ChannelHash *) hash_seq_search(status);
+ if (entry != NULL)
+ SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(entry->channel));
}
SRF_RETURN_DONE(funcctx);
@@ -877,7 +1105,7 @@ PreCommit_Notify(void)
switch (actrec->action)
{
case LISTEN_LISTEN:
- Exec_ListenPreCommit();
+ Exec_ListenPreCommit(actrec->channel);
break;
case LISTEN_UNLISTEN:
/* there is no Exec_UnlistenPreCommit() */
@@ -893,6 +1121,36 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
+
+ /*
+ * Build list of unique channels for SignalBackends().
+ *
+ * If we have a channelHashtab, use it to efficiently get the unique
+ * channels. Otherwise, fall back to the linear approach.
+ */
+ pendingNotifyChannels = NIL;
+ if (pendingNotifies->channelHashtab != NULL)
+ {
+ HASH_SEQ_STATUS status;
+ struct ChannelHash *channelEntry;
+
+ hash_seq_init(&status, pendingNotifies->channelHashtab);
+ while ((channelEntry = (struct ChannelHash *) hash_seq_search(&status)) != NULL)
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channelEntry->channel);
+ }
+ else
+ {
+ /* Linear approach for small number of notifications */
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
+ }
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -902,6 +1160,14 @@ PreCommit_Notify(void)
*/
(void) GetCurrentTransactionId();
+ /*
+ * We will be calling SignalBackends() at AtCommit_Notify time, so
+ * make sure its auxiliary data structures exist now, where an ERROR
+ * will still abort the transaction cleanly.
+ */
+ initSignalArrays();
+ initChannelHash();
+
/*
* Serialize writers by acquiring a special lock that we hold till
* after commit. This ensures that queue entries appear in commit
@@ -921,6 +1187,22 @@ PreCommit_Notify(void)
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /*
+ * For the direct advancement optimization in SignalBackends(), we
+ * need to ensure that no other backend can insert queue entries
+ * between queueHeadBeforeWrite and queueHeadAfterWrite. The
+ * heavyweight lock above provides this guarantee, since it serializes
+ * all writers.
+ *
+ * Note: if the heavyweight lock were ever removed for scalability
+ * reasons, we could achieve the same guarantee by holding
+ * NotifyQueueLock in EXCLUSIVE mode across all our insertions, rather
+ * than releasing and reacquiring it for each page as we do below.
+ */
+
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(queueHeadBeforeWrite, 0, 0);
+
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
while (nextNotify != NULL)
@@ -938,12 +1220,20 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ if (firstIteration)
+ {
+ queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many notifications in the NOTIFY queue")));
nextNotify = asyncQueueAddEntries(nextNotify);
+ queueHeadAfterWrite = QUEUE_HEAD;
+
LWLockRelease(NotifyQueueLock);
}
@@ -956,7 +1246,7 @@ PreCommit_Notify(void)
*
* This is called at transaction commit, after committing to clog.
*
- * Update listenChannels and clear transaction-local state.
+ * Update listenChannelsHash and clear transaction-local state.
*
* If we issued any notifications in the transaction, send signals to
* listening backends (possibly including ourselves) to process them.
@@ -1001,7 +1291,8 @@ AtCommit_Notify(void)
}
/* If no longer listening to anything, get out of listener array */
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/*
@@ -1037,147 +1328,270 @@ AtCommit_Notify(void)
* This function must make sure we are ready to catch any incoming messages.
*/
static void
-Exec_ListenPreCommit(void)
+Exec_ListenPreCommit(const char *channel)
{
- QueuePosition head;
- QueuePosition max;
- ProcNumber prevListener;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ListenerEntry *listeners;
/*
- * Nothing to do if we are already listening to something, nor if we
- * already ran this routine in this transaction.
+ * If this is our first LISTEN in this transaction, register as a
+ * listener.
*/
- if (amRegisteredListener)
- return;
-
- if (Trace_notify)
- elog(DEBUG1, "Exec_ListenPreCommit(%d)", MyProcPid);
-
- /*
- * Before registering, make sure we will unlisten before dying. (Note:
- * this action does not get undone if we abort later.)
- */
- if (!unlistenExitRegistered)
+ if (!amRegisteredListener)
{
- before_shmem_exit(Async_UnlistenOnExit, 0);
- unlistenExitRegistered = true;
- }
+ QueuePosition head;
+ QueuePosition max;
+ ProcNumber prevListener;
- /*
- * This is our first LISTEN, so establish our pointer.
- *
- * We set our pointer to the global tail pointer and then move it forward
- * over already-committed notifications. This ensures we cannot miss any
- * not-yet-committed notifications. We might get a few more but that
- * doesn't hurt.
- *
- * In some scenarios there might be a lot of committed notifications that
- * have not yet been pruned away (because some backend is being lazy about
- * reading them). To reduce our startup time, we can look at other
- * backends and adopt the maximum "pos" pointer of any backend that's in
- * our database; any notifications it's already advanced over are surely
- * committed and need not be re-examined by us. (We must consider only
- * backends connected to our DB, because others will not have bothered to
- * check committed-ness of notifications in our DB.)
- *
- * We need exclusive lock here so we can look at other backends' entries
- * and manipulate the list links.
- */
- LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- head = QUEUE_HEAD;
- max = QUEUE_TAIL;
- prevListener = INVALID_PROC_NUMBER;
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
- {
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
- max = QUEUE_POS_MAX(max, QUEUE_BACKEND_POS(i));
- /* Also find last listening backend before this one */
- if (i < MyProcNumber)
- prevListener = i;
- }
- QUEUE_BACKEND_POS(MyProcNumber) = max;
- QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
- QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
- /* Insert backend into list of listeners at correct position */
- if (prevListener != INVALID_PROC_NUMBER)
- {
- QUEUE_NEXT_LISTENER(MyProcNumber) = QUEUE_NEXT_LISTENER(prevListener);
- QUEUE_NEXT_LISTENER(prevListener) = MyProcNumber;
- }
- else
- {
- QUEUE_NEXT_LISTENER(MyProcNumber) = QUEUE_FIRST_LISTENER;
- QUEUE_FIRST_LISTENER = MyProcNumber;
- }
- LWLockRelease(NotifyQueueLock);
+ if (Trace_notify)
+ elog(DEBUG1, "Exec_ListenPreCommit(%s,%d)", channel, MyProcPid);
- /* Now we are listed in the global array, so remember we're listening */
- amRegisteredListener = true;
+ /*
+ * Before registering, make sure we will unlisten before dying. (Note:
+ * this action does not get undone if we abort later.)
+ */
+ if (!unlistenExitRegistered)
+ {
+ before_shmem_exit(Async_UnlistenOnExit, 0);
+ unlistenExitRegistered = true;
+ }
- /*
- * Try to move our pointer forward as far as possible. This will skip
- * over already-committed notifications, which we want to do because they
- * might be quite stale. Note that we are not yet listening on anything,
- * so we won't deliver such notifications to our frontend. Also, although
- * our transaction might have executed NOTIFY, those message(s) aren't
- * queued yet so we won't skip them here.
- */
- if (!QUEUE_POS_EQUAL(max, head))
- asyncQueueReadAllNotifications();
-}
+ /*
+ * This is our first LISTEN, so establish our pointer.
+ *
+ * We set our pointer to the global tail pointer and then move it
+ * forward over already-committed notifications. This ensures we
+ * cannot miss any not-yet-committed notifications. We might get a
+ * few more but that doesn't hurt.
+ *
+ * In some scenarios there might be a lot of committed notifications
+ * that have not yet been pruned away (because some backend is being
+ * lazy about reading them). To reduce our startup time, we can look
+ * at other backends and adopt the maximum "pos" pointer of any
+ * backend that's in our database; any notifications it's already
+ * advanced over are surely committed and need not be re-examined by
+ * us. (We must consider only backends connected to our DB, because
+ * others will not have bothered to check committed-ness of
+ * notifications in our DB.)
+ *
+ * We need exclusive lock here so we can look at other backends'
+ * entries and manipulate the list links.
+ */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ head = QUEUE_HEAD;
+ max = QUEUE_TAIL;
+ prevListener = INVALID_PROC_NUMBER;
+ for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ {
+ if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ max = QUEUE_POS_MAX(max, QUEUE_BACKEND_POS(i));
+ /* Also find last listening backend before this one */
+ if (i < MyProcNumber)
+ prevListener = i;
+ }
+ QUEUE_BACKEND_POS(MyProcNumber) = max;
+ QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
+ QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
+ /* Insert backend into list of listeners at correct position */
+ if (prevListener != INVALID_PROC_NUMBER)
+ {
+ QUEUE_NEXT_LISTENER(MyProcNumber) = QUEUE_NEXT_LISTENER(prevListener);
+ QUEUE_NEXT_LISTENER(prevListener) = MyProcNumber;
+ }
+ else
+ {
+ QUEUE_NEXT_LISTENER(MyProcNumber) = QUEUE_FIRST_LISTENER;
+ QUEUE_FIRST_LISTENER = MyProcNumber;
+ }
+ LWLockRelease(NotifyQueueLock);
-/*
- * Exec_ListenCommit --- subroutine for AtCommit_Notify
- *
- * Add the channel to the list of channels we are listening on.
- */
-static void
-Exec_ListenCommit(const char *channel)
-{
- MemoryContext oldcontext;
+ /* Now we are listed in the global array, so remember we're listening */
+ amRegisteredListener = true;
+
+ /*
+ * Try to move our pointer forward as far as possible. This will skip
+ * over already-committed notifications, which we want to do because
+ * they might be quite stale. Note that we are not yet listening on
+ * anything, so we won't deliver such notifications to our frontend.
+ * Also, although our transaction might have executed NOTIFY, those
+ * message(s) aren't queued yet so we won't skip them here.
+ */
+ if (!QUEUE_POS_EQUAL(max, head))
+ asyncQueueReadAllNotifications();
+ }
/* Do nothing if we are already listening on this channel */
if (IsListeningOn(channel))
return;
/*
- * Add the new channel name to listenChannels.
- *
- * XXX It is theoretically possible to get an out-of-memory failure here,
- * which would be bad because we already committed. For the moment it
- * doesn't seem worth trying to guard against that, but maybe improve this
- * later.
- */
- oldcontext = MemoryContextSwitchTo(TopMemoryContext);
- listenChannels = lappend(listenChannels, pstrdup(channel));
- MemoryContextSwitchTo(oldcontext);
+ * Add the channel to listenChannelsHash. This can OOM, but we're still
+ * in PreCommit so the transaction can abort safely.
+ */
+ initListenChannelsHash();
+ (void) hash_search(listenChannelsHash, channel, HASH_ENTER, NULL);
+
+ /*
+ * Now update the shared channelHash. We insert an entry with
+ * active=false, which will be flipped to true in Exec_ListenCommit.
+ */
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * Find or create the channel entry. For new entries, we initialize
+ * listenersArray to InvalidDsaPointer as a marker.
+ */
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ListenerEntry) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ListenerEntry) * new_size);
+ ListenerEntry *new_listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ListenerEntry) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners].procno = MyProcNumber;
+ listeners[entry->numListeners].active = false;
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
+}
+
+/*
+ * Exec_ListenCommit --- subroutine for AtCommit_Notify
+ *
+ * Activate the channel entry that was pre-allocated in Exec_ListenPreCommit.
+ * This is called after commit to clog, so it's important to have very low
+ * probability of failure. By design, all we do here is set the active
+ * flag.
+ */
+static void
+Exec_ListenCommit(const char *channel)
+{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ListenerEntry *listeners;
+ int i;
+
+ if (Trace_notify)
+ elog(DEBUG1, "Exec_ListenCommit(%s,%d)", channel, MyProcPid);
+
+ /*
+ * The entry has been created in Exec_ListenPreCommit. If we get
+ * here, channelHash and the entry must exist.
+ */
+ Assert(channelHash != NULL);
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, true);
+ Assert(entry != NULL);
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i].procno == MyProcNumber)
+ {
+ listeners[i].active = true;
+ dshash_release_lock(channelHash, entry);
+ return;
+ }
+ }
+
+ /* If the entry is not found, it's a bug. */
+ Assert(false);
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Unlisten the specified channel for this backend.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
- ListCell *q;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ListenerEntry *listeners;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
- foreach(q, listenChannels)
+ /* Remove from our local cache */
+ if (listenChannelsHash != NULL)
+ (void) hash_search(listenChannelsHash, channel, HASH_REMOVE, NULL);
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(q);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i].procno == MyProcNumber)
{
- listenChannels = foreach_delete_current(listenChannels, q);
- pfree(lchan);
- break;
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ListenerEntry) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ return;
}
}
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1192,34 +1606,68 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
- list_free_deep(listenChannels);
- listenChannels = NIL;
+ /* Clear our local cache */
+ if (listenChannelsHash != NULL)
+ {
+ hash_destroy(listenChannelsHash);
+ listenChannelsHash = NULL;
+ }
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ListenerEntry *listeners;
+ int i;
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i].procno == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ListenerEntry) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
}
/*
* Test whether we are actively listening on the given channel name.
*
* Note: this function is executed for every notification found in the queue.
- * Perhaps it is worth further optimization, eg convert the list to a sorted
- * array so we can binary-search it. In practice the list is likely to be
- * fairly short, though.
*/
static bool
IsListeningOn(const char *channel)
{
- ListCell *p;
+ if (listenChannelsHash == NULL)
+ return false;
- foreach(p, listenChannels)
- {
- char *lchan = (char *) lfirst(p);
-
- if (strcmp(lchan, channel) == 0)
- return true;
- }
- return false;
+ return (hash_search(listenChannelsHash, channel, HASH_FIND, NULL) != NULL);
}
/*
@@ -1229,7 +1677,7 @@ IsListeningOn(const char *channel)
static void
asyncQueueUnregister(void)
{
- Assert(listenChannels == NIL); /* else caller error */
+ Assert(listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0); /* else caller error */
if (!amRegisteredListener) /* nothing to do */
return;
@@ -1241,6 +1689,7 @@ asyncQueueUnregister(void)
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
/* and remove it from the list */
if (QUEUE_FIRST_LISTENER == MyProcNumber)
QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);
@@ -1565,12 +2014,21 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are not interested in our notifies, that are known
+ * to still be positioned at the old queue head, or anywhere in the
+ * queue region we just wrote, can be safely advanced directly to the
+ * new head, since that region is known to contain only our own
+ * notifications. This avoids unnecessary wakeups when there is
+ * nothing of interest to them.
+ *
+ * Backends that are not interested in our notifies, that are advancing
+ * to a target position before the new queue head, or that are not
+ * advancing and are stationary at a position before the old queue head
+ * needs to be signaled since notifications could otherwise be delayed.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1580,60 +2038,118 @@ asyncQueueFillWarning(void)
static void
SignalBackends(void)
{
- int32 *pids;
- ProcNumber *procnos;
int count;
+ ListCell *lc;
+ Assert(channelHash != NULL || pendingNotifyChannels == NIL);
+ Assert(notifySignalPids != NULL);
+ Assert(notifySignalProcs != NULL);
/*
* Identify backends that we need to signal. We don't want to send
* signals while holding the NotifyQueueLock, so this loop just builds a
* list of target PIDs.
- *
- * XXX in principle these pallocs could fail, which would be bad. Maybe
- * preallocate the arrays? They're not that large, though.
*/
- pids = (int32 *) palloc(MaxBackends * sizeof(int32));
- procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
count = 0;
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, pendingNotifyChannels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ListenerEntry *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i;
+ int32 pid;
+ QueuePosition pos;
+
/*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
+ * Only signal backends that have active=true. Backends with
+ * active=false have done LISTEN in PreCommit but not yet
+ * committed, so they're not really listening yet.
*/
+ if (!listeners[j].active)
+ continue;
+
+ i = listeners[j].procno;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ notifySignalPids[count] = pid;
+ notifySignalProcs[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ if (pendingNotifies != NULL)
+ {
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ if (QUEUE_BACKEND_IS_ADVANCING(i) ?
+ QUEUE_POS_PRECEDES(QUEUE_BACKEND_ADVANCING_POS(i), queueHeadAfterWrite) :
+ QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
+ {
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ notifySignalPids[count] = pid;
+ notifySignalProcs[count] = i;
+ count++;
+ }
+ else if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
+ QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
+ {
+ Assert(!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite));
+
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
/* Now send signals */
for (int i = 0; i < count; i++)
{
- int32 pid = pids[i];
+ int32 pid = notifySignalPids[i];
/*
* If we are signaling our own process, no need to involve the kernel;
@@ -1651,12 +2167,10 @@ SignalBackends(void)
* NotifyQueueLock; which is unlikely but certainly possible. So we
* just log a low-level debug message if it happens.
*/
- if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]) < 0)
+ if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, notifySignalProcs[i]) < 0)
elog(DEBUG3, "could not signal backend with PID %d: %m", pid);
}
- pfree(pids);
- pfree(procnos);
}
/*
@@ -1673,12 +2187,97 @@ AtAbort_Notify(void)
/*
* If we LISTEN but then roll back the transaction after PreCommit_Notify,
* we have registered as a listener but have not made any entry in
- * listenChannels. In that case, deregister again.
+ * listenChannelsHash. In that case, deregister again.
*/
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
- /* And clean up */
+ /*
+ * Remove any channels we added during Exec_ListenPreCommit. We need to
+ * clean up both the local listenChannelsHash and any inactive entries in
+ * the shared channelHash to avoid accumulating stale data.
+ */
+ if (pendingActions != NULL)
+ {
+ ListCell *p;
+
+ foreach(p, pendingActions->actions)
+ {
+ ListenAction *actrec = (ListenAction *) lfirst(p);
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ListenerEntry *listeners;
+ bool removeFromLocal;
+ bool found;
+ int i;
+
+ if (actrec->action != LISTEN_LISTEN)
+ continue;
+
+ /*
+ * For each LISTEN action, determine if we should clean up the
+ * local and/or shared hash entries. If we have an active=true
+ * entry in the shared hash, we were already listening from a
+ * previous transaction, so leave everything alone. Otherwise,
+ * clean up what this transaction added.
+ */
+ removeFromLocal = true;
+ found = false;
+
+ if (channelHash != NULL)
+ {
+ ChannelHashPrepareKey(&key, MyDatabaseId, actrec->channel);
+ entry = dshash_find(channelHash, &key, true);
+
+ if (entry != NULL)
+ {
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i].procno == MyProcNumber)
+ {
+ found = true;
+
+ if (listeners[i].active)
+ {
+ /* Already committed - leave both hashes alone */
+ removeFromLocal = false;
+ }
+ else
+ {
+ /* Inactive - remove from shared hash */
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ListenerEntry) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ dshash_release_lock(channelHash, entry);
+ }
+ break;
+ }
+ }
+
+ if (!found)
+ dshash_release_lock(channelHash, entry);
+ }
+ }
+
+ /* Remove from local hash if appropriate */
+ if (removeFromLocal && listenChannelsHash != NULL)
+ (void) hash_search(listenChannelsHash, actrec->channel,
+ HASH_REMOVE, NULL);
+ }
+ }
+
ClearPendingActionsAndNotifies();
}
@@ -1854,20 +2453,29 @@ asyncQueueReadAllNotifications(void)
QueuePosition head;
Snapshot snapshot;
- /* Fetch current state */
+ /*
+ * Fetch current state, indicate to others that we have woken up, and that
+ * we now will be advancing our position.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ head = QUEUE_HEAD;
pos = QUEUE_BACKEND_POS(MyProcNumber);
- head = QUEUE_HEAD;
- LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
{
/* Nothing to do, we have read all notifications already. */
+ LWLockRelease(NotifyQueueLock);
return;
}
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = true;
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = head;
+
+ LWLockRelease(NotifyQueueLock);
+
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
* This is trickier than it might seem, because of race conditions.
@@ -1954,6 +2562,8 @@ asyncQueueReadAllNotifications(void)
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
+
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = false;
QUEUE_BACKEND_POS(MyProcNumber) = pos;
LWLockRelease(NotifyQueueLock);
@@ -2055,7 +2665,7 @@ asyncQueueProcessPageEntries(QueuePosition *current,
* over it on the first LISTEN in a session, and not get stuck on
* it indefinitely.
*/
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
continue;
if (TransactionIdDidCommit(qe->xid))
@@ -2310,7 +2920,7 @@ ProcessIncomingNotify(bool flush)
notifyInterruptPending = false;
/* Do nothing else if we aren't actively listening */
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
return;
if (Trace_notify)
@@ -2414,13 +3024,15 @@ AddEventToPendingNotifies(Notification *n)
{
Assert(pendingNotifies->events != NIL);
- /* Create the hash table if it's time to */
+ /* Create the hash tables if it's time to */
if (list_length(pendingNotifies->events) >= MIN_HASHABLE_NOTIFIES &&
pendingNotifies->hashtab == NULL)
{
HASHCTL hash_ctl;
ListCell *l;
+ Assert(pendingNotifies->channelHashtab == NULL);
+
/* Create the hash table */
hash_ctl.keysize = sizeof(Notification *);
hash_ctl.entrysize = sizeof(struct NotificationHash);
@@ -2433,10 +3045,22 @@ AddEventToPendingNotifies(Notification *n)
&hash_ctl,
HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
+ /* Create the channel hash table */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+ hash_ctl.hcxt = CurTransactionContext;
+ pendingNotifies->channelHashtab =
+ hash_create("Pending Notify Channels",
+ 64L,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
/* Insert all the already-existing events */
foreach(l, pendingNotifies->events)
{
Notification *oldn = (Notification *) lfirst(l);
+ char *channel = oldn->data;
bool found;
(void) hash_search(pendingNotifies->hashtab,
@@ -2444,22 +3068,42 @@ AddEventToPendingNotifies(Notification *n)
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Insert channel into channelHashtab */
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if multiple events on same channel */
}
}
/* Add new event to the list, in order */
pendingNotifies->events = lappend(pendingNotifies->events, n);
- /* Add event to the hash table if needed */
+ /* Add event to the hash tables if needed */
if (pendingNotifies->hashtab != NULL)
{
bool found;
+ Assert(pendingNotifies->channelHashtab != NULL);
+
(void) hash_search(pendingNotifies->hashtab,
&n,
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Add channel to channelHashtab */
+ {
+ char *channel = n->data;
+
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if we already have an event on this channel */
+ }
}
}
@@ -2497,7 +3141,7 @@ notification_match(const void *key1, const void *key2, Size keysize)
return 1; /* not equal */
}
-/* Clear the pendingActions and pendingNotifies lists. */
+/* Clear the pendingActions, pendingNotifies, and pendingNotifyChannels lists. */
static void
ClearPendingActionsAndNotifies(void)
{
@@ -2509,6 +3153,7 @@ ClearPendingActionsAndNotifies(void)
*/
pendingActions = NULL;
pendingNotifies = NULL;
+ pendingNotifyChannels = NIL;
}
/*
@@ -2519,3 +3164,16 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c1ac71ff7f2..7c2cf960093 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -369,6 +369,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 5b0ce383408..4236965e72a 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -101,6 +101,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c751c25a04d..3d371c2808d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
@@ -1564,6 +1566,7 @@ ListDictionary
ListParsedLex
ListenAction
ListenActionKind
+ListenerEntry
ListenStmt
LoInfo
LoadStmt
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-23 15:49 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-11-23 15:49 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Chao Li <[email protected]>; pgsql-hackers
On Sat, Nov 22, 2025, at 22:30, Joel Jacobson wrote:
> On Thu, Nov 20, 2025, at 21:26, Tom Lane wrote:
>> I took a brief look through the v28 patch, and I'm fairly distressed
>> at how much new logic has been stuffed into what's effectively a
>> critical section. It's totally not okay for AtCommit_Notify or
>> anything it calls to throw an error
>
> Right, I agree. Thanks for guidance.
>
>> So I think there needs to be a serious effort made to move as
>> much as we possibly can of the potentially-risky stuff into
>> PreCommit_Notify. In particular I think we ought to create
>> the shared channel hash entry then, and even insert our PID
>> into it. We could expand the listenersArray entries to include
>> both a PID and a boolean "is it REALLY listening?", and then
>> during Exec_ListenCommit we'd only be required to find an
>> entry we already added and set its boolean, so there's no OOM
>> hazard. Possibly do something similar with the local
>> listenChannelsHash, so as to remove that possibility of OOM
>> failure as well.
>
> Thanks for the idea, I like this approach. I've expanded the
> listenersArray like suggested, and moved all risky stuff from
> Exec_ListenCommit to PreCommit_Notify.
>
>> (An alternative design could be to go ahead and insert our
>> PID during PreCommit_Notify, and just tolerate the small
>> risk of getting signaled when we didn't need to be. But
>> then we'd need some mechanism for cleaning out the bogus
>> entry during AtAbort_Notify.)
>
> We seem to need a cleanup mechanism also with the boolean field design,
> since a channel could end up being added only to listenChannelsHash, but
> not channelHash, which would trick IsListeningOn() into falsely thinking
> we're listening on such channel when we're not. This could happen if
> successfully adding the channel to listenChannelsHash, but OOM when
> trying to add it to channelHash.
>
> AtAbort_Notify now handles such half-state, by reconciling all channels
> that had LISTEN_LISTEN actions, using the channelHash as the single
> source of truth, removing channels from both listenChannelsHash and
> channelHash, unless the active field is true (which means we were
> already listening to the channel due to a previous transaction).
>
> I've tested triggering the cleanup logic by adding elog ERROR that
> triggered after listenChannelsHash insert, and another test that
> triggered after channelHash insert, and it cleaned it up correctly. I
> haven't created a test for it in tree though, would we want that?
>
>> I'm not sure what I think about the new logic in SignalBackends
>> from this standpoint. Making it very-low-probability-of-error
>> definitely needs some consideration though.
>
> The initChannelHash call in SignalBackends is now gone, moved to
> PreCommit_Notify (called if there are any pendingNotifies).
>
> I also took the liberty of fixing the XXX comment, to lazily preallocate
> the signals arrays during PreCommit_Notify. It felt too inconsistent to
> just leave that unfixed, but maybe should be a separate commit?
I've extracted the preallocation of signals arrays into a separate patch:
https://commitfest.postgresql.org/patch/6248/
> I wonder how risky the remaining new logic in SignalBackends is. For
> instance, I looked at dshash_find(..., false), and note it calls
> LWLockAcquire which in turn could elog ERROR if num locks is exceeded,
> but in master we're already calling LWLockAcquire in SignalBackends, so
> should be fine I guess?
>
> Apart from that, I don't see any new logic in SignalBackends, that could
> potentially be risky.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-23 20:43 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 0 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-11-23 20:43 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Chao Li <[email protected]>; pgsql-hackers
On Sun, Nov 23, 2025, at 16:49, Joel Jacobson wrote:
> On Sat, Nov 22, 2025, at 22:30, Joel Jacobson wrote:
>> On Thu, Nov 20, 2025, at 21:26, Tom Lane wrote:
>>> I took a brief look through the v28 patch, and I'm fairly distressed
>>> at how much new logic has been stuffed into what's effectively a
>>> critical section. It's totally not okay for AtCommit_Notify or
>>> anything it calls to throw an error
>>
>> Right, I agree. Thanks for guidance.
>>
>>> So I think there needs to be a serious effort made to move as
>>> much as we possibly can of the potentially-risky stuff into
>>> PreCommit_Notify. In particular I think we ought to create
>>> the shared channel hash entry then, and even insert our PID
>>> into it. We could expand the listenersArray entries to include
>>> both a PID and a boolean "is it REALLY listening?", and then
>>> during Exec_ListenCommit we'd only be required to find an
>>> entry we already added and set its boolean, so there's no OOM
>>> hazard. Possibly do something similar with the local
>>> listenChannelsHash, so as to remove that possibility of OOM
>>> failure as well.
>>
>> Thanks for the idea, I like this approach. I've expanded the
>> listenersArray like suggested, and moved all risky stuff from
>> Exec_ListenCommit to PreCommit_Notify.
>>
>>> (An alternative design could be to go ahead and insert our
>>> PID during PreCommit_Notify, and just tolerate the small
>>> risk of getting signaled when we didn't need to be. But
>>> then we'd need some mechanism for cleaning out the bogus
>>> entry during AtAbort_Notify.)
>>
>> We seem to need a cleanup mechanism also with the boolean field design,
>> since a channel could end up being added only to listenChannelsHash, but
>> not channelHash, which would trick IsListeningOn() into falsely thinking
>> we're listening on such channel when we're not. This could happen if
>> successfully adding the channel to listenChannelsHash, but OOM when
>> trying to add it to channelHash.
>>
>> AtAbort_Notify now handles such half-state, by reconciling all channels
>> that had LISTEN_LISTEN actions, using the channelHash as the single
>> source of truth, removing channels from both listenChannelsHash and
>> channelHash, unless the active field is true (which means we were
>> already listening to the channel due to a previous transaction).
>>
>> I've tested triggering the cleanup logic by adding elog ERROR that
>> triggered after listenChannelsHash insert, and another test that
>> triggered after channelHash insert, and it cleaned it up correctly. I
>> haven't created a test for it in tree though, would we want that?
>>
>>> I'm not sure what I think about the new logic in SignalBackends
>>> from this standpoint. Making it very-low-probability-of-error
>>> definitely needs some consideration though.
>>
>> The initChannelHash call in SignalBackends is now gone, moved to
>> PreCommit_Notify (called if there are any pendingNotifies).
>>
>> I also took the liberty of fixing the XXX comment, to lazily preallocate
>> the signals arrays during PreCommit_Notify. It felt too inconsistent to
>> just leave that unfixed, but maybe should be a separate commit?
>
> I've extracted the preallocation of signals arrays into a separate patch:
> https://commitfest.postgresql.org/patch/6248/
New version of the optimization patch, without the preallocation of
signals arrays part (since submitted as a separate patch instead).
>> I wonder how risky the remaining new logic in SignalBackends is. For
>> instance, I looked at dshash_find(..., false), and note it calls
>> LWLockAcquire which in turn could elog ERROR if num locks is exceeded,
>> but in master we're already calling LWLockAcquire in SignalBackends, so
>> should be fine I guess?
>>
>> Apart from that, I don't see any new logic in SignalBackends, that could
>> potentially be risky.
/Joel
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v30.patch (9.3K, 2-0001-optimize_listen_notify-v30.patch)
download | inline diff:
From 81ab1ad3f6fe8a11a8877838303c88bc33c872fd Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 114 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 68 +++++++++++
2 files changed, 181 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..443a6eb669f 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 7 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
@@ -104,6 +205,17 @@ step l2commit: COMMIT;
listener2: NOTIFY "c1" with payload "" from notifier
step l2stop: UNLISTEN *;
+starting permutation: lch_listen nch_notify lch_check
+step lch_listen: LISTEN ch;
+step nch_notify: NOTIFY ch, 'aa';
+step lch_check: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener_ch: NOTIFY "ch" with payload "aa" from notifier_ch
+
starting permutation: llisten lbegin usage bignotify usage
step llisten: LISTEN c1; LISTEN c2;
step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..0a01e777b98 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,38 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Listener session for cross-session notification test with channel 'ch'.
+
+session listener_ch
+step lch_listen { LISTEN ch; }
+step lch_check { SELECT 1 AS x; }
+teardown { UNLISTEN *; }
+
+# Notifier session for cross-session notification test with channel 'ch'.
+
+session notifier_ch
+step nch_notify { NOTIFY ch, 'aa'; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +106,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
@@ -73,6 +137,10 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
# and notify queue is not empty
permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
+# Check that notifications sent from a backend that has not done LISTEN
+# are properly delivered to a listener in another backend.
+permutation lch_listen nch_notify lch_check
+
# Verify that pg_notification_queue_usage correctly reports a non-zero result,
# after submitting notifications while another connection is listening for
# those notifications and waiting inside an active transaction. We have to
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v30.patch (56.6K, 3-0002-optimize_listen_notify-v30.patch)
download | inline diff:
From ca52e21a54f6219c2d3ab539cf38eb3a7da311ec Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sat, 15 Nov 2025 22:18:50 +0100
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of ListenerEntry structures (containing ProcNumber and active
flag) representing the backends listening on each channel. This allows
the sender to target only those backends actually listening on the
channels for which it has queued notifications.
Critical section safety
-----------------------
To avoid ERROR→PANIC after transaction commit, all risky operations
(DSA allocations, dshash operations, memory allocations) are moved to
PreCommit_Notify where failures can still safely abort.
Each listener entry includes an 'active' flag. During PreCommit, LISTEN
operations insert entries with active=false to pre-allocate all required
space. After commit to clog, AtCommit_Notify simply flips active flags
to true—a boolean assignment that has acceptably low failure risk.
Similarly, signal arrays (notifySignalPids/notifySignalProcs) use a
hybrid allocation strategy: allocated once in TopMemoryContext during
the first NOTIFY (in PreCommit where OOM can still abort), then reused
permanently across all transactions.
At commit time:
* Exec_ListenPreCommit performs all risky allocations with active=false
* Exec_ListenCommit flips active flags to true (post-commit)
* AtCommit_Notify updates the shared channelHash to reflect any LISTEN
or UNLISTEN actions performed in the transaction.
* SignalBackends consults this hash to find the backends that are
listening on the channels being notified in the current database, and
signals only those with active=true.
Each backend's entry in AsyncQueueControl now includes a wakeupPending
flag to prevent duplicate signals while a previous wakeup is still being
processed.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
To handle advancing backends correctly, each backend's entry tracks both
whether it is currently advancing (isAdvancing) and the target position
it is advancing to (advancingPos). This allows SignalBackends to signal
advancing backends only when their target position would leave them
behind the new queue head, while safely direct-advancing idle backends
that would not be interested in the newly written notifications.
Idle backends that are stationary at a position before the old queue
head are signaled, since they might be interested in the notifications
in between their current position and the old queue head.
Other notes
-----------
* Maintains dual data structures: a shared channelHash for determining
which backends to signal, and a local per-backend listenChannels list
for fast lock-free lookups during notification processing. This avoids
contention on the shared hash during the high-frequency IsListeningOn
checks that occur for every notification read from the queue.
* Backends remain registered in the global listener list as long as
listenChannels is non-empty.
* AtAbort_Notify cleanly handles both local and shared hash cleanup,
removing inactive (uncommitted) entries while preserving active
(committed) entries from previous transactions.
* Adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
* No user-visible behavioral changes; this is an internal optimization
only.
---
src/backend/commands/async.c | 1042 +++++++++++++----
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 3 +
4 files changed, 845 insertions(+), 202 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index e1cf659485a..e76b2c55ca2 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -68,16 +70,24 @@
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
- * Then we signal any backends that may be interested in our messages
- * (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * make any actual updates to the local listen state (listenChannelsHash) and
+ * shared channel hash table (channelHash). Then we signal any backends
+ * that may be interested in our messages (including our own backend,
+ * if listening). This is done by SignalBackends(), which consults the
+ * shared channel hash table to identify listeners for the channels that
+ * have pending notifications in the current database. Each selected
+ * backend is marked as having a wakeup pending to avoid duplicate signals,
+ * and a PROCSIG_NOTIFY_INTERRUPT signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, or within the range
+ * written, avoiding unnecessary wakeups for idle listeners that have
+ * nothing to read. Backends that cannot be direct advanced are signaled
+ * if they are stuck behind the old queue head, or advancing to a position
+ * before the new queue head, since otherwise notifications could be delayed.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -137,14 +147,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +175,43 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ListenerEntry structs representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+/*
+ * Individual listener entry in the channel's listeners array.
+ *
+ * We store both the ProcNumber and an active flag. During PreCommit,
+ * we insert entries with active=false to pre-allocate space and avoid
+ * OOM failures after transaction commit. Then in AtCommit, we just set
+ * the active flag to true, which has acceptably low risk of failure.
+ */
+typedef struct ListenerEntry
+{
+ ProcNumber procno; /* backend's ProcNumber */
+ bool active; /* true if actually listening */
+} ListenerEntry;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ListenerEntry array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -224,11 +274,14 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
- * we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +299,9 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
+ bool isAdvancing; /* backend is advancing its position */
+ QueuePosition advancingPos; /* target position backend is advancing to */
} QueueBackendStatus;
/*
@@ -260,9 +316,10 @@ typedef struct QueueBackendStatus
* (since no other backend will inspect it).
*
* When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
- * entries of other backends and also change the head pointer. When holding
- * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointers.
+ * entries of other backends and also change the head pointer. They can
+ * also advance other backends' queue positions, unless they are not
+ * in the process of doing that themselves. When holding both NotifyQueueLock and
+ * NotifyQueueTailLock in EXCLUSIVE mode, backends can change the tail pointers.
*
* SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
* the control lock for the pg_notify SLRU buffers.
@@ -288,11 +345,16 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +363,9 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
+#define QUEUE_BACKEND_IS_ADVANCING(i) (asyncQueueControl->backend[i].isAdvancing)
+#define QUEUE_BACKEND_ADVANCING_POS(i) (asyncQueueControl->backend[i].advancingPos)
/*
* The SLRU buffer area through which we access the notification queue
@@ -313,16 +378,16 @@ static SlruCtlData NotifyCtlData;
#define QUEUE_FULL_WARN_INTERVAL 5000 /* warn at most once every 5s */
/*
- * listenChannels identifies the channels we are actually listening to
- * (ie, have committed a LISTEN on). It is a simple list of channel names,
+ * listenChannelsHash identifies the channels we are actually listening to
+ * (ie, have committed a LISTEN on). It is a hash table of channel names,
* allocated in TopMemoryContext.
*/
-static List *listenChannels = NIL; /* list of C strings */
+static HTAB *listenChannelsHash = NULL;
/*
* State for pending LISTEN/UNLISTEN actions consists of an ordered list of
* all actions requested in the current transaction. As explained above,
- * we don't actually change listenChannels until we reach transaction commit.
+ * we don't actually change listenChannelsHash until we reach transaction commit.
*
* The list is kept in CurTransactionContext. In subtransactions, each
* subtransaction has its own list in its own CurTransactionContext, but
@@ -391,6 +456,7 @@ typedef struct NotificationList
int nestingLevel; /* current transaction nesting depth */
List *events; /* list of Notification structs */
HTAB *hashtab; /* hash of NotificationHash structs, or NULL */
+ HTAB *channelHashtab; /* hash of unique channel names, or NULL */
struct NotificationList *upper; /* details for upper transaction levels */
} NotificationList;
@@ -401,6 +467,11 @@ struct NotificationHash
Notification *event; /* => the actual Notification struct */
};
+struct ChannelHash
+{
+ char channel[NAMEDATALEN];
+};
+
static NotificationList *pendingNotifies = NULL;
/*
@@ -418,6 +489,20 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/*
+ * Queue head positions for direct advancement.
+ * These are captured during PreCommit_Notify while holding the heavyweight
+ * lock on database 0, ensuring no other backend can insert notifications
+ * between them. SignalBackends uses these to advance idle backends.
+ */
+static QueuePosition queueHeadBeforeWrite;
+static QueuePosition queueHeadAfterWrite;
+
+/*
+ * List of channels with pending notifications in the current transaction.
+ */
+static List *pendingNotifyChannels = NIL;
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -428,11 +513,10 @@ bool Trace_notify = false;
int max_notify_queue_pages = 1048576;
/* local function prototypes */
-static inline int64 asyncQueuePageDiff(int64 p, int64 q);
static inline bool asyncQueuePagePrecedes(int64 p, int64 q);
static void queue_listen(ListenActionKind action, const char *channel);
static void Async_UnlistenOnExit(int code, Datum arg);
-static void Exec_ListenPreCommit(void);
+static void Exec_ListenPreCommit(const char *channel);
static void Exec_ListenCommit(const char *channel);
static void Exec_UnlistenCommit(const char *channel);
static void Exec_UnlistenAllCommit(void);
@@ -456,16 +540,9 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
-
-/*
- * Compute the difference between two queue page numbers.
- * Previously this function accounted for a wraparound.
- */
-static inline int64
-asyncQueuePageDiff(int64 p, int64 q)
-{
- return p - q;
-}
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+static void initChannelHash(void);
/*
* Determines whether p precedes q.
@@ -477,6 +554,105 @@ asyncQueuePagePrecedes(int64 p, int64 q)
return p < q;
}
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
+/*
+ * initListenChannelsHash
+ * Lazy initialization of the local listen channels hash table.
+ */
+static void
+initListenChannelsHash(void)
+{
+ HASHCTL hash_ctl;
+
+ /* Quick exit if we already did this */
+ if (listenChannelsHash != NULL)
+ return;
+
+ /* Initialize local hash table for this backend's listened channels */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+
+ listenChannelsHash =
+ hash_create("Listen Channels",
+ 64,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS);
+}
+
/*
* Report space needed for our shared memory area
*/
@@ -520,12 +696,17 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVANCING_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
+ QUEUE_BACKEND_IS_ADVANCING(i) = false;
}
}
@@ -656,6 +837,7 @@ Async_Notify(const char *channel, const char *payload)
notifies->events = list_make1(n);
/* We certainly don't need a hashtable yet */
notifies->hashtab = NULL;
+ notifies->channelHashtab = NULL;
notifies->upper = pendingNotifies;
pendingNotifies = notifies;
}
@@ -682,7 +864,7 @@ Async_Notify(const char *channel, const char *payload)
* Common code for listen, unlisten, unlisten all commands.
*
* Adds the request to the list of pending actions.
- * Actual update of the listenChannels list happens during transaction
+ * Actual update of the listenChannelsHash happens during transaction
* commit.
*/
static void
@@ -782,30 +964,49 @@ Async_UnlistenAll(void)
* SQL function: return a set of the channel names this backend is actively
* listening to.
*
- * Note: this coding relies on the fact that the listenChannels list cannot
+ * Note: this coding relies on the fact that the listenChannelsHash cannot
* change within a transaction.
*/
Datum
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ HASH_SEQ_STATUS *status;
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Initialize hash table iteration if we have any channels */
+ if (listenChannelsHash != NULL)
+ {
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ status = (HASH_SEQ_STATUS *) palloc(sizeof(HASH_SEQ_STATUS));
+ hash_seq_init(status, listenChannelsHash);
+ funcctx->user_fctx = status;
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ {
+ funcctx->user_fctx = NULL;
+ }
}
/* stuff done on every call of the function */
funcctx = SRF_PERCALL_SETUP();
+ status = (HASH_SEQ_STATUS *) funcctx->user_fctx;
- if (funcctx->call_cntr < list_length(listenChannels))
+ if (status != NULL)
{
- char *channel = (char *) list_nth(listenChannels,
- funcctx->call_cntr);
+ struct ChannelHash *entry;
- SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
+ entry = (struct ChannelHash *) hash_seq_search(status);
+ if (entry != NULL)
+ SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(entry->channel));
}
SRF_RETURN_DONE(funcctx);
@@ -877,7 +1078,7 @@ PreCommit_Notify(void)
switch (actrec->action)
{
case LISTEN_LISTEN:
- Exec_ListenPreCommit();
+ Exec_ListenPreCommit(actrec->channel);
break;
case LISTEN_UNLISTEN:
/* there is no Exec_UnlistenPreCommit() */
@@ -893,6 +1094,36 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
+
+ /*
+ * Build list of unique channels for SignalBackends().
+ *
+ * If we have a channelHashtab, use it to efficiently get the unique
+ * channels. Otherwise, fall back to the linear approach.
+ */
+ pendingNotifyChannels = NIL;
+ if (pendingNotifies->channelHashtab != NULL)
+ {
+ HASH_SEQ_STATUS status;
+ struct ChannelHash *channelEntry;
+
+ hash_seq_init(&status, pendingNotifies->channelHashtab);
+ while ((channelEntry = (struct ChannelHash *) hash_seq_search(&status)) != NULL)
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channelEntry->channel);
+ }
+ else
+ {
+ /* Linear approach for small number of notifications */
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
+ }
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -902,6 +1133,13 @@ PreCommit_Notify(void)
*/
(void) GetCurrentTransactionId();
+ /*
+ * We will be calling SignalBackends() at AtCommit_Notify time, so
+ * make sure its auxiliary data structures exist now, where an ERROR
+ * will still abort the transaction cleanly.
+ */
+ initChannelHash();
+
/*
* Serialize writers by acquiring a special lock that we hold till
* after commit. This ensures that queue entries appear in commit
@@ -921,6 +1159,22 @@ PreCommit_Notify(void)
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /*
+ * For the direct advancement optimization in SignalBackends(), we
+ * need to ensure that no other backend can insert queue entries
+ * between queueHeadBeforeWrite and queueHeadAfterWrite. The
+ * heavyweight lock above provides this guarantee, since it serializes
+ * all writers.
+ *
+ * Note: if the heavyweight lock were ever removed for scalability
+ * reasons, we could achieve the same guarantee by holding
+ * NotifyQueueLock in EXCLUSIVE mode across all our insertions, rather
+ * than releasing and reacquiring it for each page as we do below.
+ */
+
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(queueHeadBeforeWrite, 0, 0);
+
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
while (nextNotify != NULL)
@@ -938,12 +1192,20 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ if (firstIteration)
+ {
+ queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many notifications in the NOTIFY queue")));
nextNotify = asyncQueueAddEntries(nextNotify);
+ queueHeadAfterWrite = QUEUE_HEAD;
+
LWLockRelease(NotifyQueueLock);
}
@@ -956,7 +1218,7 @@ PreCommit_Notify(void)
*
* This is called at transaction commit, after committing to clog.
*
- * Update listenChannels and clear transaction-local state.
+ * Update listenChannelsHash and clear transaction-local state.
*
* If we issued any notifications in the transaction, send signals to
* listening backends (possibly including ourselves) to process them.
@@ -1001,7 +1263,8 @@ AtCommit_Notify(void)
}
/* If no longer listening to anything, get out of listener array */
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/*
@@ -1037,147 +1300,270 @@ AtCommit_Notify(void)
* This function must make sure we are ready to catch any incoming messages.
*/
static void
-Exec_ListenPreCommit(void)
+Exec_ListenPreCommit(const char *channel)
{
- QueuePosition head;
- QueuePosition max;
- ProcNumber prevListener;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ListenerEntry *listeners;
/*
- * Nothing to do if we are already listening to something, nor if we
- * already ran this routine in this transaction.
+ * If this is our first LISTEN in this transaction, register as a
+ * listener.
*/
- if (amRegisteredListener)
- return;
-
- if (Trace_notify)
- elog(DEBUG1, "Exec_ListenPreCommit(%d)", MyProcPid);
-
- /*
- * Before registering, make sure we will unlisten before dying. (Note:
- * this action does not get undone if we abort later.)
- */
- if (!unlistenExitRegistered)
+ if (!amRegisteredListener)
{
- before_shmem_exit(Async_UnlistenOnExit, 0);
- unlistenExitRegistered = true;
- }
+ QueuePosition head;
+ QueuePosition max;
+ ProcNumber prevListener;
- /*
- * This is our first LISTEN, so establish our pointer.
- *
- * We set our pointer to the global tail pointer and then move it forward
- * over already-committed notifications. This ensures we cannot miss any
- * not-yet-committed notifications. We might get a few more but that
- * doesn't hurt.
- *
- * In some scenarios there might be a lot of committed notifications that
- * have not yet been pruned away (because some backend is being lazy about
- * reading them). To reduce our startup time, we can look at other
- * backends and adopt the maximum "pos" pointer of any backend that's in
- * our database; any notifications it's already advanced over are surely
- * committed and need not be re-examined by us. (We must consider only
- * backends connected to our DB, because others will not have bothered to
- * check committed-ness of notifications in our DB.)
- *
- * We need exclusive lock here so we can look at other backends' entries
- * and manipulate the list links.
- */
- LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- head = QUEUE_HEAD;
- max = QUEUE_TAIL;
- prevListener = INVALID_PROC_NUMBER;
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
- {
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
- max = QUEUE_POS_MAX(max, QUEUE_BACKEND_POS(i));
- /* Also find last listening backend before this one */
- if (i < MyProcNumber)
- prevListener = i;
- }
- QUEUE_BACKEND_POS(MyProcNumber) = max;
- QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
- QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
- /* Insert backend into list of listeners at correct position */
- if (prevListener != INVALID_PROC_NUMBER)
- {
- QUEUE_NEXT_LISTENER(MyProcNumber) = QUEUE_NEXT_LISTENER(prevListener);
- QUEUE_NEXT_LISTENER(prevListener) = MyProcNumber;
- }
- else
- {
- QUEUE_NEXT_LISTENER(MyProcNumber) = QUEUE_FIRST_LISTENER;
- QUEUE_FIRST_LISTENER = MyProcNumber;
- }
- LWLockRelease(NotifyQueueLock);
+ if (Trace_notify)
+ elog(DEBUG1, "Exec_ListenPreCommit(%s,%d)", channel, MyProcPid);
- /* Now we are listed in the global array, so remember we're listening */
- amRegisteredListener = true;
+ /*
+ * Before registering, make sure we will unlisten before dying. (Note:
+ * this action does not get undone if we abort later.)
+ */
+ if (!unlistenExitRegistered)
+ {
+ before_shmem_exit(Async_UnlistenOnExit, 0);
+ unlistenExitRegistered = true;
+ }
- /*
- * Try to move our pointer forward as far as possible. This will skip
- * over already-committed notifications, which we want to do because they
- * might be quite stale. Note that we are not yet listening on anything,
- * so we won't deliver such notifications to our frontend. Also, although
- * our transaction might have executed NOTIFY, those message(s) aren't
- * queued yet so we won't skip them here.
- */
- if (!QUEUE_POS_EQUAL(max, head))
- asyncQueueReadAllNotifications();
-}
+ /*
+ * This is our first LISTEN, so establish our pointer.
+ *
+ * We set our pointer to the global tail pointer and then move it
+ * forward over already-committed notifications. This ensures we
+ * cannot miss any not-yet-committed notifications. We might get a
+ * few more but that doesn't hurt.
+ *
+ * In some scenarios there might be a lot of committed notifications
+ * that have not yet been pruned away (because some backend is being
+ * lazy about reading them). To reduce our startup time, we can look
+ * at other backends and adopt the maximum "pos" pointer of any
+ * backend that's in our database; any notifications it's already
+ * advanced over are surely committed and need not be re-examined by
+ * us. (We must consider only backends connected to our DB, because
+ * others will not have bothered to check committed-ness of
+ * notifications in our DB.)
+ *
+ * We need exclusive lock here so we can look at other backends'
+ * entries and manipulate the list links.
+ */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ head = QUEUE_HEAD;
+ max = QUEUE_TAIL;
+ prevListener = INVALID_PROC_NUMBER;
+ for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ {
+ if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ max = QUEUE_POS_MAX(max, QUEUE_BACKEND_POS(i));
+ /* Also find last listening backend before this one */
+ if (i < MyProcNumber)
+ prevListener = i;
+ }
+ QUEUE_BACKEND_POS(MyProcNumber) = max;
+ QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
+ QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
+ /* Insert backend into list of listeners at correct position */
+ if (prevListener != INVALID_PROC_NUMBER)
+ {
+ QUEUE_NEXT_LISTENER(MyProcNumber) = QUEUE_NEXT_LISTENER(prevListener);
+ QUEUE_NEXT_LISTENER(prevListener) = MyProcNumber;
+ }
+ else
+ {
+ QUEUE_NEXT_LISTENER(MyProcNumber) = QUEUE_FIRST_LISTENER;
+ QUEUE_FIRST_LISTENER = MyProcNumber;
+ }
+ LWLockRelease(NotifyQueueLock);
-/*
- * Exec_ListenCommit --- subroutine for AtCommit_Notify
- *
- * Add the channel to the list of channels we are listening on.
- */
-static void
-Exec_ListenCommit(const char *channel)
-{
- MemoryContext oldcontext;
+ /* Now we are listed in the global array, so remember we're listening */
+ amRegisteredListener = true;
+
+ /*
+ * Try to move our pointer forward as far as possible. This will skip
+ * over already-committed notifications, which we want to do because
+ * they might be quite stale. Note that we are not yet listening on
+ * anything, so we won't deliver such notifications to our frontend.
+ * Also, although our transaction might have executed NOTIFY, those
+ * message(s) aren't queued yet so we won't skip them here.
+ */
+ if (!QUEUE_POS_EQUAL(max, head))
+ asyncQueueReadAllNotifications();
+ }
/* Do nothing if we are already listening on this channel */
if (IsListeningOn(channel))
return;
/*
- * Add the new channel name to listenChannels.
- *
- * XXX It is theoretically possible to get an out-of-memory failure here,
- * which would be bad because we already committed. For the moment it
- * doesn't seem worth trying to guard against that, but maybe improve this
- * later.
- */
- oldcontext = MemoryContextSwitchTo(TopMemoryContext);
- listenChannels = lappend(listenChannels, pstrdup(channel));
- MemoryContextSwitchTo(oldcontext);
+ * Add the channel to listenChannelsHash. This can OOM, but we're still
+ * in PreCommit so the transaction can abort safely.
+ */
+ initListenChannelsHash();
+ (void) hash_search(listenChannelsHash, channel, HASH_ENTER, NULL);
+
+ /*
+ * Now update the shared channelHash. We insert an entry with
+ * active=false, which will be flipped to true in Exec_ListenCommit.
+ */
+ initChannelHash();
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /*
+ * Find or create the channel entry. For new entries, we initialize
+ * listenersArray to InvalidDsaPointer as a marker.
+ */
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ /* First listener for this channel */
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ListenerEntry) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ListenerEntry) * new_size);
+ ListenerEntry *new_listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ new_array);
+
+ memcpy(new_listeners, listeners,
+ sizeof(ListenerEntry) * entry->numListeners);
+
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners].procno = MyProcNumber;
+ listeners[entry->numListeners].active = false;
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
+}
+
+/*
+ * Exec_ListenCommit --- subroutine for AtCommit_Notify
+ *
+ * Activate the channel entry that was pre-allocated in Exec_ListenPreCommit.
+ * This is called after commit to clog, so it's important to have very low
+ * probability of failure. By design, all we do here is set the active
+ * flag.
+ */
+static void
+Exec_ListenCommit(const char *channel)
+{
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ListenerEntry *listeners;
+ int i;
+
+ if (Trace_notify)
+ elog(DEBUG1, "Exec_ListenCommit(%s,%d)", channel, MyProcPid);
+
+ /*
+ * The entry has been created in Exec_ListenPreCommit. If we get here,
+ * channelHash and the entry must exist.
+ */
+ Assert(channelHash != NULL);
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, true);
+ Assert(entry != NULL);
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i].procno == MyProcNumber)
+ {
+ listeners[i].active = true;
+ dshash_release_lock(channelHash, entry);
+ return;
+ }
+ }
+
+ /* If the entry is not found, it's a bug. */
+ Assert(false);
+ dshash_release_lock(channelHash, entry);
}
/*
* Exec_UnlistenCommit --- subroutine for AtCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Unlisten the specified channel for this backend.
*/
static void
Exec_UnlistenCommit(const char *channel)
{
- ListCell *q;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ListenerEntry *listeners;
+ int i;
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
- foreach(q, listenChannels)
+ /* Remove from our local cache */
+ if (listenChannelsHash != NULL)
+ (void) hash_search(listenChannelsHash, channel, HASH_REMOVE, NULL);
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+
+ /* Look up the channel with exclusive lock so we can modify it */
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(q);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i].procno == MyProcNumber)
{
- listenChannels = foreach_delete_current(listenChannels, q);
- pfree(lchan);
- break;
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ListenerEntry) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ /* Last listener for this channel */
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ {
+ dshash_release_lock(channelHash, entry);
+ }
+
+ return;
}
}
+ dshash_release_lock(channelHash, entry);
+
/*
* We do not complain about unlistening something not being listened;
* should we?
@@ -1192,34 +1578,68 @@ Exec_UnlistenCommit(const char *channel)
static void
Exec_UnlistenAllCommit(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
- list_free_deep(listenChannels);
- listenChannels = NIL;
+ /* Clear our local cache */
+ if (listenChannelsHash != NULL)
+ {
+ hash_destroy(listenChannelsHash);
+ listenChannelsHash = NULL;
+ }
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ListenerEntry *listeners;
+ int i;
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i].procno == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ListenerEntry) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
}
/*
* Test whether we are actively listening on the given channel name.
*
* Note: this function is executed for every notification found in the queue.
- * Perhaps it is worth further optimization, eg convert the list to a sorted
- * array so we can binary-search it. In practice the list is likely to be
- * fairly short, though.
*/
static bool
IsListeningOn(const char *channel)
{
- ListCell *p;
+ if (listenChannelsHash == NULL)
+ return false;
- foreach(p, listenChannels)
- {
- char *lchan = (char *) lfirst(p);
-
- if (strcmp(lchan, channel) == 0)
- return true;
- }
- return false;
+ return (hash_search(listenChannelsHash, channel, HASH_FIND, NULL) != NULL);
}
/*
@@ -1229,7 +1649,7 @@ IsListeningOn(const char *channel)
static void
asyncQueueUnregister(void)
{
- Assert(listenChannels == NIL); /* else caller error */
+ Assert(listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0); /* else caller error */
if (!amRegisteredListener) /* nothing to do */
return;
@@ -1241,6 +1661,7 @@ asyncQueueUnregister(void)
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
/* and remove it from the list */
if (QUEUE_FIRST_LISTENER == MyProcNumber)
QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);
@@ -1565,12 +1986,21 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are not interested in our notifies, that are known
+ * to still be positioned at the old queue head, or anywhere in the
+ * queue region we just wrote, can be safely advanced directly to the
+ * new head, since that region is known to contain only our own
+ * notifications. This avoids unnecessary wakeups when there is
+ * nothing of interest to them.
+ *
+ * Backends that are not interested in our notifies, that are advancing
+ * to a target position before the new queue head, or that are not
+ * advancing and are stationary at a position before the old queue head
+ * needs to be signaled since notifications could otherwise be delayed.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1583,6 +2013,9 @@ SignalBackends(void)
int32 *pids;
ProcNumber *procnos;
int count;
+ ListCell *lc;
+
+ Assert(channelHash != NULL || pendingNotifyChannels == NIL);
/*
* Identify backends that we need to signal. We don't want to send
@@ -1597,36 +2030,97 @@ SignalBackends(void)
count = 0;
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, pendingNotifyChannels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ListenerEntry *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue; /* No listeners registered for this channel */
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i;
+ int32 pid;
+ QueuePosition pos;
+
/*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
+ * Only signal backends that have active=true. Backends with
+ * active=false have done LISTEN in PreCommit but not yet
+ * committed, so they're not really listening yet.
*/
+ if (!listeners[j].active)
+ continue;
+
+ i = listeners[j].procno;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ if (pendingNotifies != NULL)
+ {
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ if (QUEUE_BACKEND_IS_ADVANCING(i) ?
+ QUEUE_POS_PRECEDES(QUEUE_BACKEND_ADVANCING_POS(i), queueHeadAfterWrite) :
+ QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
+ {
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ pids[count] = pid;
+ procnos[count] = i;
+ count++;
+ }
+ else if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
+ QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
+ {
+ Assert(!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite));
+
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
@@ -1673,12 +2167,97 @@ AtAbort_Notify(void)
/*
* If we LISTEN but then roll back the transaction after PreCommit_Notify,
* we have registered as a listener but have not made any entry in
- * listenChannels. In that case, deregister again.
+ * listenChannelsHash. In that case, deregister again.
*/
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
- /* And clean up */
+ /*
+ * Remove any channels we added during Exec_ListenPreCommit. We need to
+ * clean up both the local listenChannelsHash and any inactive entries in
+ * the shared channelHash to avoid accumulating stale data.
+ */
+ if (pendingActions != NULL)
+ {
+ ListCell *p;
+
+ foreach(p, pendingActions->actions)
+ {
+ ListenAction *actrec = (ListenAction *) lfirst(p);
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ListenerEntry *listeners;
+ bool removeFromLocal;
+ bool found;
+ int i;
+
+ if (actrec->action != LISTEN_LISTEN)
+ continue;
+
+ /*
+ * For each LISTEN action, determine if we should clean up the
+ * local and/or shared hash entries. If we have an active=true
+ * entry in the shared hash, we were already listening from a
+ * previous transaction, so leave everything alone. Otherwise,
+ * clean up what this transaction added.
+ */
+ removeFromLocal = true;
+ found = false;
+
+ if (channelHash != NULL)
+ {
+ ChannelHashPrepareKey(&key, MyDatabaseId, actrec->channel);
+ entry = dshash_find(channelHash, &key, true);
+
+ if (entry != NULL)
+ {
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i].procno == MyProcNumber)
+ {
+ found = true;
+
+ if (listeners[i].active)
+ {
+ /* Already committed - leave both hashes alone */
+ removeFromLocal = false;
+ }
+ else
+ {
+ /* Inactive - remove from shared hash */
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ListenerEntry) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ dshash_release_lock(channelHash, entry);
+ }
+ break;
+ }
+ }
+
+ if (!found)
+ dshash_release_lock(channelHash, entry);
+ }
+ }
+
+ /* Remove from local hash if appropriate */
+ if (removeFromLocal && listenChannelsHash != NULL)
+ (void) hash_search(listenChannelsHash, actrec->channel,
+ HASH_REMOVE, NULL);
+ }
+ }
+
ClearPendingActionsAndNotifies();
}
@@ -1854,20 +2433,29 @@ asyncQueueReadAllNotifications(void)
QueuePosition head;
Snapshot snapshot;
- /* Fetch current state */
+ /*
+ * Fetch current state, indicate to others that we have woken up, and that
+ * we now will be advancing our position.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ head = QUEUE_HEAD;
pos = QUEUE_BACKEND_POS(MyProcNumber);
- head = QUEUE_HEAD;
- LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
{
/* Nothing to do, we have read all notifications already. */
+ LWLockRelease(NotifyQueueLock);
return;
}
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = true;
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = head;
+
+ LWLockRelease(NotifyQueueLock);
+
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
* This is trickier than it might seem, because of race conditions.
@@ -1954,6 +2542,8 @@ asyncQueueReadAllNotifications(void)
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
+
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = false;
QUEUE_BACKEND_POS(MyProcNumber) = pos;
LWLockRelease(NotifyQueueLock);
@@ -2055,7 +2645,7 @@ asyncQueueProcessPageEntries(QueuePosition *current,
* over it on the first LISTEN in a session, and not get stuck on
* it indefinitely.
*/
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
continue;
if (TransactionIdDidCommit(qe->xid))
@@ -2310,7 +2900,7 @@ ProcessIncomingNotify(bool flush)
notifyInterruptPending = false;
/* Do nothing else if we aren't actively listening */
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
return;
if (Trace_notify)
@@ -2414,13 +3004,15 @@ AddEventToPendingNotifies(Notification *n)
{
Assert(pendingNotifies->events != NIL);
- /* Create the hash table if it's time to */
+ /* Create the hash tables if it's time to */
if (list_length(pendingNotifies->events) >= MIN_HASHABLE_NOTIFIES &&
pendingNotifies->hashtab == NULL)
{
HASHCTL hash_ctl;
ListCell *l;
+ Assert(pendingNotifies->channelHashtab == NULL);
+
/* Create the hash table */
hash_ctl.keysize = sizeof(Notification *);
hash_ctl.entrysize = sizeof(struct NotificationHash);
@@ -2433,10 +3025,22 @@ AddEventToPendingNotifies(Notification *n)
&hash_ctl,
HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
+ /* Create the channel hash table */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+ hash_ctl.hcxt = CurTransactionContext;
+ pendingNotifies->channelHashtab =
+ hash_create("Pending Notify Channels",
+ 64L,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
/* Insert all the already-existing events */
foreach(l, pendingNotifies->events)
{
Notification *oldn = (Notification *) lfirst(l);
+ char *channel = oldn->data;
bool found;
(void) hash_search(pendingNotifies->hashtab,
@@ -2444,22 +3048,42 @@ AddEventToPendingNotifies(Notification *n)
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Insert channel into channelHashtab */
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if multiple events on same channel */
}
}
/* Add new event to the list, in order */
pendingNotifies->events = lappend(pendingNotifies->events, n);
- /* Add event to the hash table if needed */
+ /* Add event to the hash tables if needed */
if (pendingNotifies->hashtab != NULL)
{
bool found;
+ Assert(pendingNotifies->channelHashtab != NULL);
+
(void) hash_search(pendingNotifies->hashtab,
&n,
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Add channel to channelHashtab */
+ {
+ char *channel = n->data;
+
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if we already have an event on this channel */
+ }
}
}
@@ -2497,7 +3121,7 @@ notification_match(const void *key1, const void *key2, Size keysize)
return 1; /* not equal */
}
-/* Clear the pendingActions and pendingNotifies lists. */
+/* Clear the pendingActions, pendingNotifies, and pendingNotifyChannels lists. */
static void
ClearPendingActionsAndNotifies(void)
{
@@ -2509,6 +3133,7 @@ ClearPendingActionsAndNotifies(void)
*/
pendingActions = NULL;
pendingNotifies = NULL;
+ pendingNotifyChannels = NIL;
}
/*
@@ -2519,3 +3144,16 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c1ac71ff7f2..7c2cf960093 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -369,6 +369,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 5b0ce383408..4236965e72a 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -101,6 +101,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c751c25a04d..3d371c2808d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -412,6 +412,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
@@ -1564,6 +1566,7 @@ ListDictionary
ListParsedLex
ListenAction
ListenActionKind
+ListenerEntry
ListenStmt
LoInfo
LoadStmt
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-25 20:14 Joel Jacobson <[email protected]>
parent: Tom Lane <[email protected]>
1 sibling, 0 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-11-25 20:14 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Chao Li <[email protected]>; pgsql-hackers
On Thu, Nov 20, 2025, at 21:26, Tom Lane wrote:
> So I think there needs to be a serious effort made to move as
> much as we possibly can of the potentially-risky stuff into
> PreCommit_Notify. In particular I think we ought to create
> the shared channel hash entry then, and even insert our PID
> into it. We could expand the listenersArray entries to include
> both a PID and a boolean "is it REALLY listening?", and then
> during Exec_ListenCommit we'd only be required to find an
> entry we already added and set its boolean, so there's no OOM
> hazard. Possibly do something similar with the local
> listenChannelsHash, so as to remove that possibility of OOM
> failure as well.
>
> (An alternative design could be to go ahead and insert our
> PID during PreCommit_Notify, and just tolerate the small
> risk of getting signaled when we didn't need to be. But
> then we'd need some mechanism for cleaning out the bogus
> entry during AtAbort_Notify.)
[...back from a little detour with new insights...]
It looks to me like it would be best with two boolean fields; one
boolean to stage the updates during PreCommit_Notify, that each
pendingActions could flip back and forth, and another boolean that
represents the current value, which we would overwrite with the staged
value during AtCommit_Notify.
This way, cleanup for the rare edge-case when we did PreCommit_Notify
followed by AtAbort_Notify, seems simple; we just need to go through all
entires and delete those where current=false, since those entries were
newly added by PreCommit_Notify, i.e. we were not listening to those
channels since before. Probably also setting a flag in
PreCommit_Notify, so that we only need to do cleanup in AtAbort_Notify
if we actually hit PreCommit_Notify.
I haven't implemented this yet, but I have a good feeling about this
approach. Just wanted to share the plan before I start working, in case
anyone see any flaw with it, or see a better approach.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-11-25 20:17 Tom Lane <[email protected]>
parent: Arseniy Mukhin <[email protected]>
5 siblings, 1 reply; 120+ messages in thread
From: Tom Lane @ 2025-11-25 20:17 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: Chao Li <[email protected]>; pgsql-hackers
"Joel Jacobson" <[email protected]> writes:
> It looks to me like it would be best with two boolean fields; one
> boolean to stage the updates during PreCommit_Notify, that each
> pendingActions could flip back and forth, and another boolean that
> represents the current value, which we would overwrite with the staged
> value during AtCommit_Notify.
+1, I had a feeling that a single boolean wouldn't quite do it.
(There are various ways we could define the states, but what
you say above seems pretty reasonable.)
regards, tom lane
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-12-26 20:12 Joel Jacobson <[email protected]>
parent: Tom Lane <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-12-26 20:12 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Chao Li <[email protected]>; pgsql-hackers
On Tue, Nov 25, 2025, at 21:17, Tom Lane wrote:
> "Joel Jacobson" <[email protected]> writes:
>> It looks to me like it would be best with two boolean fields; one
>> boolean to stage the updates during PreCommit_Notify, that each
>> pendingActions could flip back and forth, and another boolean that
>> represents the current value, which we would overwrite with the staged
>> value during AtCommit_Notify.
>
> +1, I had a feeling that a single boolean wouldn't quite do it.
> (There are various ways we could define the states, but what
> you say above seems pretty reasonable.)
I've implemented the two boolean approach and think it's good.
The signals arrays are now preallocated during PreCommit_Notify.
More details in the patch message under "Two-phase staging pattern".
/Joel
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v31.patch (9.3K, 2-0001-optimize_listen_notify-v31.patch)
download | inline diff:
From cdaf9bbb5f1f734884e0204a4c9e3944431b6d81 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Wed, 8 Oct 2025 09:30:54 +0200
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 114 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 68 +++++++++++
2 files changed, 181 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..443a6eb669f 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 7 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,105 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +194,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
@@ -104,6 +205,17 @@ step l2commit: COMMIT;
listener2: NOTIFY "c1" with payload "" from notifier
step l2stop: UNLISTEN *;
+starting permutation: lch_listen nch_notify lch_check
+step lch_listen: LISTEN ch;
+step nch_notify: NOTIFY ch, 'aa';
+step lch_check: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener_ch: NOTIFY "ch" with payload "aa" from notifier_ch
+
starting permutation: llisten lbegin usage bignotify usage
step llisten: LISTEN c1; LISTEN c2;
step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..0a01e777b98 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -53,6 +67,38 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Listener session for cross-session notification test with channel 'ch'.
+
+session listener_ch
+step lch_listen { LISTEN ch; }
+step lch_check { SELECT 1 AS x; }
+teardown { UNLISTEN *; }
+
+# Notifier session for cross-session notification test with channel 'ch'.
+
+session notifier_ch
+step nch_notify { NOTIFY ch, 'aa'; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +106,24 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
@@ -73,6 +137,10 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
# and notify queue is not empty
permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
+# Check that notifications sent from a backend that has not done LISTEN
+# are properly delivered to a listener in another backend.
+permutation lch_listen nch_notify lch_check
+
# Verify that pg_notification_queue_usage correctly reports a non-zero result,
# after submitting notifications while another connection is listening for
# those notifications and waiting inside an active transaction. We have to
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v31.patch (53.3K, 3-0002-optimize_listen_notify-v31.patch)
download | inline diff:
From fd5b0349d1d826d56cb887bb60c7f34d08811267 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sat, 15 Nov 2025 22:18:50 +0100
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of listening backends (ProcNumbers). This allows the sender to
target only those backends actually listening on the channels for which
it has queued notifications.
Two-phase staging pattern
-------------------------
To ensure transaction safety, LISTEN/UNLISTEN operations use a two-phase
staging pattern. Memory allocation and hash table modifications happen
in PreCommit_Notify (before committing to clog), where failures can
safely abort the transaction. After committing to clog, AtCommit_Notify
only looks up entries that were already added during PreCommit_Notify
and sets their boolean flags, so there is no OOM hazard.
Each listener entry in the shared hash uses a ListenerEntry struct
containing the backend's ProcNumber and two boolean flags: "staged" is
set during PreCommit_Notify, while "current" is copied from staged
during AtCommit_Notify and is what other backends read.
For LISTEN, PreCommit_Notify allocates memory and adds an entry with
staged=true and current=false, then AtCommit_Notify copies staged to
current. For UNLISTEN, PreCommit_Notify sets staged=false on the
existing entry, then AtCommit_Notify copies staged to current and
removes the entry if false.
On abort, staged changes are reverted to match current, and entries
where current=false (never committed) are removed.
Signal arrays for sending notifications are also preallocated in
PreCommit_Notify to avoid allocation failures after committing to clog.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
To handle advancing backends correctly, each backend's entry tracks both
whether it is currently advancing (isAdvancing) and the target position
it is advancing to (advancingPos). This allows SignalBackends to signal
advancing backends only when their target position would leave them
behind the new queue head, while safely direct-advancing idle backends
that would not be interested in the newly written notifications.
Idle backends that are stationary at a position before the old queue
head are signaled, since they might be interested in the notifications
in between their current position and the old queue head.
Other notes
-----------
The patch maintains dual data structures: a shared channelHash for
determining which backends to signal, and a local per-backend
listenChannelsHash for fast lock-free lookups during notification
processing. This avoids contention on the shared hash during the
high-frequency IsListeningOn checks that occur for every notification
read from the queue. Backends remain registered in the global listener
list as long as listenChannelsHash is non-empty.
This patch adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
There are no user-visible behavioral changes; this is an internal
optimization only.
---
src/backend/commands/async.c | 951 ++++++++++++++----
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 778 insertions(+), 177 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index eb86402cae4..430fa2f3f00 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -64,20 +66,33 @@
* notifications, we can still call elog(ERROR, ...) and the transaction
* will roll back.
*
+ * PreCommit_Notify() also stages any pending LISTEN/UNLISTEN actions by
+ * adding entries to listenChannelsHash and the shared channelHash with
+ * staged=true (for LISTEN) or staged=false (for UNLISTEN). This is done
+ * before committing to clog so that failures can safely abort.
+ *
* Once we have put all of the notifications into the queue, we return to
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
- * Then we signal any backends that may be interested in our messages
- * (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * commit the staged listen/unlisten changes by copying staged to current,
+ * removing entries where current becomes false. Then we signal any backends
+ * that may be interested in our messages (including our own backend,
+ * if listening). This is done by SignalBackends(), which consults the
+ * shared channel hash table to identify listeners for the channels that
+ * have pending notifications in the current database. Each selected
+ * backend is marked as having a wakeup pending to avoid duplicate signals,
+ * and a PROCSIG_NOTIFY_INTERRUPT signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, or within the range
+ * written, avoiding unnecessary wakeups for idle listeners that have
+ * nothing to read. Backends that cannot be direct advanced are signaled
+ * if they are stuck behind the old queue head, or advancing to a position
+ * before the new queue head, since otherwise notifications could be delayed.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -137,14 +152,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +180,37 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+
+typedef struct ListenerEntry
+{
+ ProcNumber procNo;
+ bool staged;
+ bool current;
+} ListenerEntry;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ListenerEntry array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -224,11 +273,14 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
- * we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +298,9 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
+ bool isAdvancing; /* backend is advancing its position */
+ QueuePosition advancingPos; /* target position backend is advancing to */
} QueueBackendStatus;
/*
@@ -260,9 +315,10 @@ typedef struct QueueBackendStatus
* (since no other backend will inspect it).
*
* When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
- * entries of other backends and also change the head pointer. When holding
- * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointers.
+ * entries of other backends and also change the head pointer. They can
+ * also advance other backends' queue positions, unless they are not
+ * in the process of doing that themselves. When holding both NotifyQueueLock and
+ * NotifyQueueTailLock in EXCLUSIVE mode, backends can change the tail pointers.
*
* SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
* the control lock for the pg_notify SLRU buffers.
@@ -288,11 +344,16 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +362,9 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
+#define QUEUE_BACKEND_IS_ADVANCING(i) (asyncQueueControl->backend[i].isAdvancing)
+#define QUEUE_BACKEND_ADVANCING_POS(i) (asyncQueueControl->backend[i].advancingPos)
/*
* The SLRU buffer area through which we access the notification queue
@@ -313,16 +377,18 @@ static SlruCtlData NotifyCtlData;
#define QUEUE_FULL_WARN_INTERVAL 5000 /* warn at most once every 5s */
/*
- * listenChannels identifies the channels we are actually listening to
- * (ie, have committed a LISTEN on). It is a simple list of channel names,
- * allocated in TopMemoryContext.
+ * listenChannelsHash identifies the channels we are listening to.
+ * Entries are added during PreCommit_Notify (before committing to clog) and
+ * removed on abort if the LISTEN was never committed. It is a hash table
+ * of channel names, allocated in TopMemoryContext.
*/
-static List *listenChannels = NIL; /* list of C strings */
+static HTAB *listenChannelsHash = NULL;
/*
* State for pending LISTEN/UNLISTEN actions consists of an ordered list of
- * all actions requested in the current transaction. As explained above,
- * we don't actually change listenChannels until we reach transaction commit.
+ * all actions requested in the current transaction. During PreCommit_Notify,
+ * we stage these changes in listenChannelsHash and the shared channelHash.
+ * On abort, AtAbort_Notify cleans up any staged-but-uncommitted entries.
*
* The list is kept in CurTransactionContext. In subtransactions, each
* subtransaction has its own list in its own CurTransactionContext, but
@@ -391,6 +457,7 @@ typedef struct NotificationList
int nestingLevel; /* current transaction nesting depth */
List *events; /* list of Notification structs */
HTAB *hashtab; /* hash of NotificationHash structs, or NULL */
+ HTAB *channelHashtab; /* hash of unique channel names, or NULL */
struct NotificationList *upper; /* details for upper transaction levels */
} NotificationList;
@@ -401,6 +468,11 @@ struct NotificationHash
Notification *event; /* => the actual Notification struct */
};
+struct ChannelHash
+{
+ char channel[NAMEDATALEN];
+};
+
static NotificationList *pendingNotifies = NULL;
/*
@@ -418,6 +490,36 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/*
+ * Queue head positions for direct advancement.
+ * These are captured during PreCommit_Notify while holding the heavyweight
+ * lock on database 0, ensuring no other backend can insert notifications
+ * between them. SignalBackends uses these to advance idle backends.
+ */
+static QueuePosition queueHeadBeforeWrite;
+static QueuePosition queueHeadAfterWrite;
+
+/*
+ * List of channels with pending notifications in the current transaction.
+ */
+static List *pendingNotifyChannels = NIL;
+
+/*
+ * List of channels with staged listen/unlisten changes in the current
+ * transaction. Populated during PreCommit_Notify and used by AtCommit_Notify
+ * to copy staged values to current.
+ */
+static List *pendingListenChannels = NIL;
+
+/*
+ * Preallocated arrays for SignalBackends to avoid memory allocation after
+ * committing to clog. Allocated in PreCommit_Notify when there are pending
+ * notifications.
+ */
+static int32 *signalPids = NULL;
+static ProcNumber *signalProcnos = NULL;
+
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -428,14 +530,14 @@ bool Trace_notify = false;
int max_notify_queue_pages = 1048576;
/* local function prototypes */
-static inline int64 asyncQueuePageDiff(int64 p, int64 q);
static inline bool asyncQueuePagePrecedes(int64 p, int64 q);
static void queue_listen(ListenActionKind action, const char *channel);
static void Async_UnlistenOnExit(int code, Datum arg);
static void Exec_ListenPreCommit(void);
-static void Exec_ListenCommit(const char *channel);
-static void Exec_UnlistenCommit(const char *channel);
-static void Exec_UnlistenAllCommit(void);
+static void Exec_ListenPreCommitStage(const char *channel);
+static void Exec_UnlistenPreCommitStage(const char *channel);
+static void Exec_UnlistenAllPreCommitStage(void);
+static void CleanupListenersOnExit(void);
static bool IsListeningOn(const char *channel);
static void asyncQueueUnregister(void);
static bool asyncQueueIsFull(void);
@@ -456,16 +558,9 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
-
-/*
- * Compute the difference between two queue page numbers.
- * Previously this function accounted for a wraparound.
- */
-static inline int64
-asyncQueuePageDiff(int64 p, int64 q)
-{
- return p - q;
-}
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+static void initChannelHash(void);
/*
* Determines whether p precedes q.
@@ -477,6 +572,105 @@ asyncQueuePagePrecedes(int64 p, int64 q)
return p < q;
}
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
+/*
+ * initListenChannelsHash
+ * Lazy initialization of the local listen channels hash table.
+ */
+static void
+initListenChannelsHash(void)
+{
+ HASHCTL hash_ctl;
+
+ /* Quick exit if we already did this */
+ if (listenChannelsHash != NULL)
+ return;
+
+ /* Initialize local hash table for this backend's listened channels */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+
+ listenChannelsHash =
+ hash_create("Listen Channels",
+ 64,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS);
+}
+
/*
* Report space needed for our shared memory area
*/
@@ -520,12 +714,17 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVANCING_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
+ QUEUE_BACKEND_IS_ADVANCING(i) = false;
}
}
@@ -656,6 +855,7 @@ Async_Notify(const char *channel, const char *payload)
notifies->events = list_make1(n);
/* We certainly don't need a hashtable yet */
notifies->hashtab = NULL;
+ notifies->channelHashtab = NULL;
notifies->upper = pendingNotifies;
pendingNotifies = notifies;
}
@@ -682,8 +882,8 @@ Async_Notify(const char *channel, const char *payload)
* Common code for listen, unlisten, unlisten all commands.
*
* Adds the request to the list of pending actions.
- * Actual update of the listenChannels list happens during transaction
- * commit.
+ * Actual update of listenChannelsHash and channelHash happens during
+ * PreCommit_Notify, with staged changes committed in AtCommit_Notify.
*/
static void
queue_listen(ListenActionKind action, const char *channel)
@@ -782,30 +982,49 @@ Async_UnlistenAll(void)
* SQL function: return a set of the channel names this backend is actively
* listening to.
*
- * Note: this coding relies on the fact that the listenChannels list cannot
+ * Note: this coding relies on the fact that the listenChannelsHash cannot
* change within a transaction.
*/
Datum
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ HASH_SEQ_STATUS *status;
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Initialize hash table iteration if we have any channels */
+ if (listenChannelsHash != NULL)
+ {
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ status = (HASH_SEQ_STATUS *) palloc(sizeof(HASH_SEQ_STATUS));
+ hash_seq_init(status, listenChannelsHash);
+ funcctx->user_fctx = status;
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ {
+ funcctx->user_fctx = NULL;
+ }
}
/* stuff done on every call of the function */
funcctx = SRF_PERCALL_SETUP();
+ status = (HASH_SEQ_STATUS *) funcctx->user_fctx;
- if (funcctx->call_cntr < list_length(listenChannels))
+ if (status != NULL)
{
- char *channel = (char *) list_nth(listenChannels,
- funcctx->call_cntr);
+ struct ChannelHash *entry;
- SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
+ entry = (struct ChannelHash *) hash_seq_search(status);
+ if (entry != NULL)
+ SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(entry->channel));
}
SRF_RETURN_DONE(funcctx);
@@ -821,7 +1040,7 @@ pg_listening_channels(PG_FUNCTION_ARGS)
static void
Async_UnlistenOnExit(int code, Datum arg)
{
- Exec_UnlistenAllCommit();
+ CleanupListenersOnExit();
asyncQueueUnregister();
}
@@ -868,8 +1087,24 @@ PreCommit_Notify(void)
elog(DEBUG1, "PreCommit_Notify");
/* Preflight for any pending listen/unlisten actions */
+ if (pendingNotifies != NULL || pendingActions != NULL)
+ initChannelHash();
+
+ if (pendingNotifies != NULL)
+ {
+ if (signalPids == NULL)
+ signalPids = MemoryContextAlloc(TopMemoryContext,
+ MaxBackends * sizeof(int32));
+
+ if (signalProcnos == NULL)
+ signalProcnos = MemoryContextAlloc(TopMemoryContext,
+ MaxBackends * sizeof(ProcNumber));
+ }
+
if (pendingActions != NULL)
{
+ initListenChannelsHash();
+
foreach(p, pendingActions->actions)
{
ListenAction *actrec = (ListenAction *) lfirst(p);
@@ -878,12 +1113,13 @@ PreCommit_Notify(void)
{
case LISTEN_LISTEN:
Exec_ListenPreCommit();
+ Exec_ListenPreCommitStage(actrec->channel);
break;
case LISTEN_UNLISTEN:
- /* there is no Exec_UnlistenPreCommit() */
+ Exec_UnlistenPreCommitStage(actrec->channel);
break;
case LISTEN_UNLISTEN_ALL:
- /* there is no Exec_UnlistenAllPreCommit() */
+ Exec_UnlistenAllPreCommitStage();
break;
}
}
@@ -893,6 +1129,36 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
+
+ /*
+ * Build list of unique channels for SignalBackends().
+ *
+ * If we have a channelHashtab, use it to efficiently get the unique
+ * channels. Otherwise, fall back to the linear approach.
+ */
+ pendingNotifyChannels = NIL;
+ if (pendingNotifies->channelHashtab != NULL)
+ {
+ HASH_SEQ_STATUS status;
+ struct ChannelHash *channelEntry;
+
+ hash_seq_init(&status, pendingNotifies->channelHashtab);
+ while ((channelEntry = (struct ChannelHash *) hash_seq_search(&status)) != NULL)
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channelEntry->channel);
+ }
+ else
+ {
+ /* Linear approach for small number of notifications */
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
+ }
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -921,6 +1187,22 @@ PreCommit_Notify(void)
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /*
+ * For the direct advancement optimization in SignalBackends(), we
+ * need to ensure that no other backend can insert queue entries
+ * between queueHeadBeforeWrite and queueHeadAfterWrite. The
+ * heavyweight lock above provides this guarantee, since it serializes
+ * all writers.
+ *
+ * Note: if the heavyweight lock were ever removed for scalability
+ * reasons, we could achieve the same guarantee by holding
+ * NotifyQueueLock in EXCLUSIVE mode across all our insertions, rather
+ * than releasing and reacquiring it for each page as we do below.
+ */
+
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(queueHeadBeforeWrite, 0, 0);
+
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
while (nextNotify != NULL)
@@ -938,12 +1220,20 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ if (firstIteration)
+ {
+ queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many notifications in the NOTIFY queue")));
nextNotify = asyncQueueAddEntries(nextNotify);
+ queueHeadAfterWrite = QUEUE_HEAD;
+
LWLockRelease(NotifyQueueLock);
}
@@ -956,7 +1246,7 @@ PreCommit_Notify(void)
*
* This is called at transaction commit, after committing to clog.
*
- * Update listenChannels and clear transaction-local state.
+ * Update listenChannelsHash and clear transaction-local state.
*
* If we issued any notifications in the transaction, send signals to
* listening backends (possibly including ourselves) to process them.
@@ -966,7 +1256,6 @@ PreCommit_Notify(void)
void
AtCommit_Notify(void)
{
- ListCell *p;
/*
* Allow transactions that have not executed LISTEN/UNLISTEN/NOTIFY to
@@ -978,30 +1267,60 @@ AtCommit_Notify(void)
if (Trace_notify)
elog(DEBUG1, "AtCommit_Notify");
- /* Perform any pending listen/unlisten actions */
- if (pendingActions != NULL)
+ /* Commit staged listen/unlisten changes by copying staged to current */
+ if (pendingListenChannels != NIL)
{
- foreach(p, pendingActions->actions)
+ ListCell *lc;
+
+ foreach(lc, pendingListenChannels)
{
- ListenAction *actrec = (ListenAction *) lfirst(p);
+ char *channel = (char *) lfirst(lc);
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ListenerEntry *listeners;
- switch (actrec->action)
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ continue;
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA, entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
{
- case LISTEN_LISTEN:
- Exec_ListenCommit(actrec->channel);
- break;
- case LISTEN_UNLISTEN:
- Exec_UnlistenCommit(actrec->channel);
- break;
- case LISTEN_UNLISTEN_ALL:
- Exec_UnlistenAllCommit();
+ if (listeners[i].procNo == MyProcNumber)
+ {
+ listeners[i].current = listeners[i].staged;
+
+ if (!listeners[i].current)
+ {
+ (void) hash_search(listenChannelsHash, channel,
+ HASH_REMOVE, NULL);
+
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ListenerEntry) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ entry = NULL;
+ }
+ }
break;
+ }
}
+
+ if (entry != NULL)
+ dshash_release_lock(channelHash, entry);
}
}
/* If no longer listening to anything, get out of listener array */
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/*
@@ -1127,99 +1446,210 @@ Exec_ListenPreCommit(void)
}
/*
- * Exec_ListenCommit --- subroutine for AtCommit_Notify
+ * Exec_ListenPreCommitStage --- subroutine for PreCommit_Notify
*
- * Add the channel to the list of channels we are listening on.
+ * Stage a LISTEN by adding entries to listenChannelsHash and the shared
+ * channelHash with staged=true, current=false. The staged value is copied
+ * to current in AtCommit_Notify.
*/
static void
-Exec_ListenCommit(const char *channel)
+Exec_ListenPreCommitStage(const char *channel)
{
- MemoryContext oldcontext;
-
- /* Do nothing if we are already listening on this channel */
- if (IsListeningOn(channel))
- return;
-
- /*
- * Add the new channel name to listenChannels.
- *
- * XXX It is theoretically possible to get an out-of-memory failure here,
- * which would be bad because we already committed. For the moment it
- * doesn't seem worth trying to guard against that, but maybe improve this
- * later.
- */
- oldcontext = MemoryContextSwitchTo(TopMemoryContext);
- listenChannels = lappend(listenChannels, pstrdup(channel));
- MemoryContextSwitchTo(oldcontext);
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ListenerEntry *listeners;
+
+ (void) hash_search(listenChannelsHash, channel, HASH_ENTER, NULL);
+
+ pendingListenChannels = lappend(pendingListenChannels, pstrdup(channel));
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ entry->listenersArray = InvalidDsaPointer;
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ListenerEntry) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->numListeners = 0;
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA, entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i].procNo == MyProcNumber)
+ {
+ listeners[i].staged = true;
+ dshash_release_lock(channelHash, entry);
+ return;
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ListenerEntry) * new_size);
+ ListenerEntry *new_listeners = (ListenerEntry *) dsa_get_address(channelDSA, new_array);
+
+ memcpy(new_listeners, listeners, sizeof(ListenerEntry) * entry->numListeners);
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners].procNo = MyProcNumber;
+ listeners[entry->numListeners].staged = true;
+ listeners[entry->numListeners].current = false;
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
- * Exec_UnlistenCommit --- subroutine for AtCommit_Notify
+ * Exec_UnlistenPreCommitStage --- subroutine for PreCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Stage an UNLISTEN by setting staged=false on our entry in channelHash.
+ * The staged value is copied to current in AtCommit_Notify, and the entry
+ * is removed if current becomes false.
*/
static void
-Exec_UnlistenCommit(const char *channel)
+Exec_UnlistenPreCommitStage(const char *channel)
{
- ListCell *q;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ListenerEntry *listeners;
- if (Trace_notify)
- elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
- foreach(q, listenChannels)
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA, entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(q);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i].procNo == MyProcNumber)
{
- listenChannels = foreach_delete_current(listenChannels, q);
- pfree(lchan);
+ listeners[i].staged = false;
+
+ pendingListenChannels = lappend(pendingListenChannels, pstrdup(channel));
break;
}
}
- /*
- * We do not complain about unlistening something not being listened;
- * should we?
- */
+ dshash_release_lock(channelHash, entry);
}
/*
- * Exec_UnlistenAllCommit --- subroutine for AtCommit_Notify
+ * Exec_UnlistenAllPreCommitStage --- subroutine for PreCommit_Notify
*
- * Unlisten on all channels for this backend.
+ * Stage UNLISTEN * by setting staged=false on all our entries in channelHash.
*/
static void
-Exec_UnlistenAllCommit(void)
+Exec_UnlistenAllPreCommitStage(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ListenerEntry *listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i].procNo == MyProcNumber && listeners[i].current)
+ {
+ listeners[i].staged = false;
+ pendingListenChannels = lappend(pendingListenChannels,
+ pstrdup(entry->key.channel));
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
+}
+
+/*
+ * CleanupListenersOnExit --- called from Async_UnlistenOnExit
+ *
+ * Remove this backend from all channels in the shared hash.
+ */
+static void
+CleanupListenersOnExit(void)
+{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
- elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ elog(DEBUG1, "CleanupListenersOnExit(%d)", MyProcPid);
- list_free_deep(listenChannels);
- listenChannels = NIL;
+ /* Clear our local cache */
+ if (listenChannelsHash != NULL)
+ {
+ hash_destroy(listenChannelsHash);
+ listenChannelsHash = NULL;
+ }
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ListenerEntry *listeners;
+ int i;
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i].procNo == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ListenerEntry) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
}
/*
* Test whether we are actively listening on the given channel name.
*
* Note: this function is executed for every notification found in the queue.
- * Perhaps it is worth further optimization, eg convert the list to a sorted
- * array so we can binary-search it. In practice the list is likely to be
- * fairly short, though.
*/
static bool
IsListeningOn(const char *channel)
{
- ListCell *p;
+ if (listenChannelsHash == NULL)
+ return false;
- foreach(p, listenChannels)
- {
- char *lchan = (char *) lfirst(p);
-
- if (strcmp(lchan, channel) == 0)
- return true;
- }
- return false;
+ return (hash_search(listenChannelsHash, channel, HASH_FIND, NULL) != NULL);
}
/*
@@ -1229,7 +1659,7 @@ IsListeningOn(const char *channel)
static void
asyncQueueUnregister(void)
{
- Assert(listenChannels == NIL); /* else caller error */
+ Assert(listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0); /* else caller error */
if (!amRegisteredListener) /* nothing to do */
return;
@@ -1241,6 +1671,7 @@ asyncQueueUnregister(void)
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
/* and remove it from the list */
if (QUEUE_FIRST_LISTENER == MyProcNumber)
QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);
@@ -1565,12 +1996,21 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are not interested in our notifies, that are known
+ * to still be positioned at the old queue head, or anywhere in the
+ * queue region we just wrote, can be safely advanced directly to the
+ * new head, since that region is known to contain only our own
+ * notifications. This avoids unnecessary wakeups when there is
+ * nothing of interest to them.
+ *
+ * Backends that are not interested in our notifies, that are advancing
+ * to a target position before the new queue head, or that are not
+ * advancing and are stationary at a position before the old queue head
+ * needs to be signaled since notifications could otherwise be delayed.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1580,60 +2020,106 @@ asyncQueueFillWarning(void)
static void
SignalBackends(void)
{
- int32 *pids;
- ProcNumber *procnos;
int count;
+ ListCell *lc;
- /*
- * Identify backends that we need to signal. We don't want to send
- * signals while holding the NotifyQueueLock, so this loop just builds a
- * list of target PIDs.
- *
- * XXX in principle these pallocs could fail, which would be bad. Maybe
- * preallocate the arrays? They're not that large, though.
- */
- pids = (int32 *) palloc(MaxBackends * sizeof(int32));
- procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+ Assert(signalPids != NULL && signalProcnos != NULL);
count = 0;
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, pendingNotifyChannels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ListenerEntry *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue;
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i;
+ int32 pid;
+ QueuePosition pos;
+
+ if (!listeners[j].current)
+ continue;
+
+ i = listeners[j].procNo;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ signalPids[count] = pid;
+ signalProcnos[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ if (pendingNotifies != NULL)
+ {
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ if (QUEUE_BACKEND_IS_ADVANCING(i) ?
+ QUEUE_POS_PRECEDES(QUEUE_BACKEND_ADVANCING_POS(i), queueHeadAfterWrite) :
+ QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
+ {
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ signalPids[count] = pid;
+ signalProcnos[count] = i;
+ count++;
+ }
+ else if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
+ QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
+ {
+ Assert(!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite));
+
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
/* Now send signals */
for (int i = 0; i < count; i++)
{
- int32 pid = pids[i];
+ int32 pid = signalPids[i];
/*
* If we are signaling our own process, no need to involve the kernel;
@@ -1651,12 +2137,9 @@ SignalBackends(void)
* NotifyQueueLock; which is unlikely but certainly possible. So we
* just log a low-level debug message if it happens.
*/
- if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]) < 0)
+ if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, signalProcnos[i]) < 0)
elog(DEBUG3, "could not signal backend with PID %d: %m", pid);
}
-
- pfree(pids);
- pfree(procnos);
}
/*
@@ -1664,18 +2147,72 @@ SignalBackends(void)
*
* This is called at transaction abort.
*
- * Gets rid of pending actions and outbound notifies that we would have
- * executed if the transaction got committed.
+ * Revert any staged listen/unlisten changes and clean up transaction state.
*/
void
AtAbort_Notify(void)
{
/*
- * If we LISTEN but then roll back the transaction after PreCommit_Notify,
- * we have registered as a listener but have not made any entry in
- * listenChannels. In that case, deregister again.
+ * Revert staged listen/unlisten changes. For new LISTENs (current=false),
+ * remove from both local and shared hash. For UNLISTENs (current=true),
+ * just revert staged back to current.
*/
- if (amRegisteredListener && listenChannels == NIL)
+ if (pendingListenChannels != NIL && channelHash != NULL)
+ {
+ ListCell *lc;
+
+ foreach(lc, pendingListenChannels)
+ {
+ char *channel = (char *) lfirst(lc);
+ ChannelHashKey key;
+ ChannelEntry *entry;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, true);
+ if (entry != NULL)
+ {
+ ListenerEntry *listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i].procNo == MyProcNumber)
+ {
+ if (!listeners[i].current)
+ {
+ /* New LISTEN being aborted: remove from local and shared */
+ if (listenChannelsHash != NULL)
+ (void) hash_search(listenChannelsHash, channel,
+ HASH_REMOVE, NULL);
+
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ListenerEntry) * (entry->numListeners - i));
+ }
+ else
+ {
+ /* UNLISTEN being aborted: revert staged, keep local entry */
+ listeners[i].staged = listeners[i].current;
+ }
+ break;
+ }
+ }
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ dshash_release_lock(channelHash, entry);
+ }
+ }
+ }
+
+
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/* And clean up */
@@ -1854,20 +2391,29 @@ asyncQueueReadAllNotifications(void)
QueuePosition head;
Snapshot snapshot;
- /* Fetch current state */
+ /*
+ * Fetch current state, indicate to others that we have woken up, and that
+ * we now will be advancing our position.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ head = QUEUE_HEAD;
pos = QUEUE_BACKEND_POS(MyProcNumber);
- head = QUEUE_HEAD;
- LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
{
/* Nothing to do, we have read all notifications already. */
+ LWLockRelease(NotifyQueueLock);
return;
}
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = true;
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = head;
+
+ LWLockRelease(NotifyQueueLock);
+
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
* This is trickier than it might seem, because of race conditions.
@@ -1954,6 +2500,8 @@ asyncQueueReadAllNotifications(void)
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
+
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = false;
QUEUE_BACKEND_POS(MyProcNumber) = pos;
LWLockRelease(NotifyQueueLock);
@@ -2051,7 +2599,7 @@ asyncQueueProcessPageEntries(QueuePosition *current,
* over it on the first LISTEN in a session, and not get stuck on
* it indefinitely.
*/
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
continue;
if (TransactionIdDidCommit(qe->xid))
@@ -2306,7 +2854,7 @@ ProcessIncomingNotify(bool flush)
notifyInterruptPending = false;
/* Do nothing else if we aren't actively listening */
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
return;
if (Trace_notify)
@@ -2410,13 +2958,15 @@ AddEventToPendingNotifies(Notification *n)
{
Assert(pendingNotifies->events != NIL);
- /* Create the hash table if it's time to */
+ /* Create the hash tables if it's time to */
if (list_length(pendingNotifies->events) >= MIN_HASHABLE_NOTIFIES &&
pendingNotifies->hashtab == NULL)
{
HASHCTL hash_ctl;
ListCell *l;
+ Assert(pendingNotifies->channelHashtab == NULL);
+
/* Create the hash table */
hash_ctl.keysize = sizeof(Notification *);
hash_ctl.entrysize = sizeof(struct NotificationHash);
@@ -2429,10 +2979,22 @@ AddEventToPendingNotifies(Notification *n)
&hash_ctl,
HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
+ /* Create the channel hash table */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+ hash_ctl.hcxt = CurTransactionContext;
+ pendingNotifies->channelHashtab =
+ hash_create("Pending Notify Channels",
+ 64L,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
/* Insert all the already-existing events */
foreach(l, pendingNotifies->events)
{
Notification *oldn = (Notification *) lfirst(l);
+ char *channel = oldn->data;
bool found;
(void) hash_search(pendingNotifies->hashtab,
@@ -2440,22 +3002,42 @@ AddEventToPendingNotifies(Notification *n)
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Insert channel into channelHashtab */
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if multiple events on same channel */
}
}
/* Add new event to the list, in order */
pendingNotifies->events = lappend(pendingNotifies->events, n);
- /* Add event to the hash table if needed */
+ /* Add event to the hash tables if needed */
if (pendingNotifies->hashtab != NULL)
{
bool found;
+ Assert(pendingNotifies->channelHashtab != NULL);
+
(void) hash_search(pendingNotifies->hashtab,
&n,
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Add channel to channelHashtab */
+ {
+ char *channel = n->data;
+
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if we already have an event on this channel */
+ }
}
}
@@ -2493,7 +3075,7 @@ notification_match(const void *key1, const void *key2, Size keysize)
return 1; /* not equal */
}
-/* Clear the pendingActions and pendingNotifies lists. */
+/* Clear the pendingActions, pendingNotifies, and pendingNotifyChannels lists. */
static void
ClearPendingActionsAndNotifies(void)
{
@@ -2505,6 +3087,8 @@ ClearPendingActionsAndNotifies(void)
*/
pendingActions = NULL;
pendingNotifies = NULL;
+ pendingNotifyChannels = NIL;
+ pendingListenChannels = NIL;
}
/*
@@ -2515,3 +3099,16 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index dcfadbd5aae..32b0b21f184 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -371,6 +371,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 533344509e9..277a78e7954 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -102,6 +102,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5c88fa92f4e..973d4a449fd 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -421,6 +421,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-12-27 12:40 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 1 reply; 120+ messages in thread
From: Joel Jacobson @ 2025-12-27 12:40 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Chao Li <[email protected]>; pgsql-hackers
On Fri, Dec 26, 2025, at 21:12, Joel Jacobson wrote:
> On Tue, Nov 25, 2025, at 21:17, Tom Lane wrote:
>> "Joel Jacobson" <[email protected]> writes:
>>> It looks to me like it would be best with two boolean fields; one
>>> boolean to stage the updates during PreCommit_Notify, that each
>>> pendingActions could flip back and forth, and another boolean that
>>> represents the current value, which we would overwrite with the staged
>>> value during AtCommit_Notify.
>>
>> +1, I had a feeling that a single boolean wouldn't quite do it.
>> (There are various ways we could define the states, but what
>> you say above seems pretty reasonable.)
>
> I've implemented the two boolean approach and think it's good.
>
> The signals arrays are now preallocated during PreCommit_Notify.
>
> More details in the patch message under "Two-phase staging pattern".
New version with some fixes.
I should have mentioned that v31 is based on v28 (v29 and v30 were discarded).
Here is also a write-up of changes from v28 to v31:
0001: No changes.
0002:
* To avoid post-commit OOM hazards, we now allocate hash table entries
during PreCommit_Notify. Each listener entry has two boolean flags;
staged and current. For each LISTEN/UNLISTEN action the staged flag
is set/unset during PreCommit_Notify. The last action's staged value
per channel is then copied from staged to current during
AtCommit_Notify.
* On abort, AtAbort_Notify reverts staged changes.
* The signal arrays are now preallocated during PreCommit_Notify.
* Renamed Exec_UnlistenAllCommit to CleanupListenersOnExit for the
exit-handler path, since it has different semantics (unconditional
removal rather than staged/current handling).
In case someone has already started reviewing v31,
these are the changes I made in v32:
0001:
* Added test: Check UNLISTEN * cancels a LISTEN in the same transaction
0002:
* Fixed initialization of QueueBackendStatus fields, corrected the
LISTEN + UNLISTEN same-transaction case, restructured AtAbort_Notify
to mirror AtCommit_Notify, and added a guard for OOM during staging.
/Joel
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v32.patch (9.9K, 2-0001-optimize_listen_notify-v32.patch)
download | inline diff:
From 9550c98af2f24fb7653e9f18e451cf0131224a72 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sat, 27 Dec 2025 08:06:21 +0100
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
* Check UNLISTEN * cancels a LISTEN in the same transaction
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 124 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 72 +++++++++++
2 files changed, 195 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..5d6bcce2b02 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 7 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,115 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: lunlisten_all notify1 lcheck
+step lunlisten_all: BEGIN; LISTEN c1; UNLISTEN *; COMMIT;
+step notify1: NOTIFY c1;
+step lcheck: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +204,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
@@ -104,6 +215,17 @@ step l2commit: COMMIT;
listener2: NOTIFY "c1" with payload "" from notifier
step l2stop: UNLISTEN *;
+starting permutation: lch_listen nch_notify lch_check
+step lch_listen: LISTEN ch;
+step nch_notify: NOTIFY ch, 'aa';
+step lch_check: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener_ch: NOTIFY "ch" with payload "aa" from notifier_ch
+
starting permutation: llisten lbegin usage bignotify usage
step llisten: LISTEN c1; LISTEN c2;
step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..d09c2297f09 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -43,6 +57,7 @@ step lcheck { SELECT 1 AS x; }
step lbegin { BEGIN; }
step lbegins { BEGIN ISOLATION LEVEL SERIALIZABLE; }
step lcommit { COMMIT; }
+step lunlisten_all { BEGIN; LISTEN c1; UNLISTEN *; COMMIT; }
teardown { UNLISTEN *; }
# In some tests we need a second listener, just to block the queue.
@@ -53,6 +68,38 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Listener session for cross-session notification test with channel 'ch'.
+
+session listener_ch
+step lch_listen { LISTEN ch; }
+step lch_check { SELECT 1 AS x; }
+teardown { UNLISTEN *; }
+
+# Notifier session for cross-session notification test with channel 'ch'.
+
+session notifier_ch
+step nch_notify { NOTIFY ch, 'aa'; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +107,27 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check UNLISTEN * cancels a LISTEN in the same transaction.
+permutation lunlisten_all notify1 lcheck
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
@@ -73,6 +141,10 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
# and notify queue is not empty
permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
+# Check that notifications sent from a backend that has not done LISTEN
+# are properly delivered to a listener in another backend.
+permutation lch_listen nch_notify lch_check
+
# Verify that pg_notification_queue_usage correctly reports a non-zero result,
# after submitting notifications while another connection is listening for
# those notifications and waiting inside an active transaction. We have to
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v32.patch (54.0K, 3-0002-optimize_listen_notify-v32.patch)
download | inline diff:
From 67ea5434e40b88e996c7ce1c8f417801afababc1 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sat, 27 Dec 2025 08:07:03 +0100
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of listening backends (ProcNumbers). This allows the sender to
target only those backends actually listening on the channels for which
it has queued notifications.
Two-phase staging pattern
-------------------------
To ensure transaction safety, LISTEN/UNLISTEN operations use a two-phase
staging pattern. Memory allocation and hash table modifications happen
in PreCommit_Notify (before committing to clog), where failures can
safely abort the transaction. After committing to clog, AtCommit_Notify
only looks up entries that were already added during PreCommit_Notify
and sets their boolean flags, so there is no OOM hazard.
Each listener entry in the shared hash uses a ListenerEntry struct
containing the backend's ProcNumber and two boolean flags: "staged" is
set during PreCommit_Notify, while "current" is copied from staged
during AtCommit_Notify and is what other backends read.
For LISTEN, PreCommit_Notify allocates memory and adds an entry with
staged=true and current=false, then AtCommit_Notify copies staged to
current. For UNLISTEN, PreCommit_Notify sets staged=false on the
existing entry, then AtCommit_Notify copies staged to current and
removes the entry if false.
On abort, staged changes are reverted to match current, and entries
where current=false (never committed) are removed.
Signal arrays for sending notifications are also preallocated in
PreCommit_Notify to avoid allocation failures after committing to clog.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
To handle advancing backends correctly, each backend's entry tracks both
whether it is currently advancing (isAdvancing) and the target position
it is advancing to (advancingPos). This allows SignalBackends to signal
advancing backends only when their target position would leave them
behind the new queue head, while safely direct-advancing idle backends
that would not be interested in the newly written notifications.
Idle backends that are stationary at a position before the old queue
head are signaled, since they might be interested in the notifications
in between their current position and the old queue head.
Other notes
-----------
The patch maintains dual data structures: a shared channelHash for
determining which backends to signal, and a local per-backend
listenChannelsHash for fast lock-free lookups during notification
processing. This avoids contention on the shared hash during the
high-frequency IsListeningOn checks that occur for every notification
read from the queue. Backends remain registered in the global listener
list as long as listenChannelsHash is non-empty.
This patch adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
There are no user-visible behavioral changes; this is an internal
optimization only.
---
src/backend/commands/async.c | 960 ++++++++++++++----
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
4 files changed, 787 insertions(+), 177 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index eb86402cae4..a9fbadc95b9 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -64,20 +66,33 @@
* notifications, we can still call elog(ERROR, ...) and the transaction
* will roll back.
*
+ * PreCommit_Notify() also stages any pending LISTEN/UNLISTEN actions by
+ * adding entries to listenChannelsHash and the shared channelHash with
+ * staged=true (for LISTEN) or staged=false (for UNLISTEN). This is done
+ * before committing to clog so that failures can safely abort.
+ *
* Once we have put all of the notifications into the queue, we return to
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
- * Then we signal any backends that may be interested in our messages
- * (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * commit the staged listen/unlisten changes by copying staged to current,
+ * removing entries where current becomes false. Then we signal any backends
+ * that may be interested in our messages (including our own backend,
+ * if listening). This is done by SignalBackends(), which consults the
+ * shared channel hash table to identify listeners for the channels that
+ * have pending notifications in the current database. Each selected
+ * backend is marked as having a wakeup pending to avoid duplicate signals,
+ * and a PROCSIG_NOTIFY_INTERRUPT signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, or within the range
+ * written, avoiding unnecessary wakeups for idle listeners that have
+ * nothing to read. Backends that cannot be direct advanced are signaled
+ * if they are stuck behind the old queue head, or advancing to a position
+ * before the new queue head, since otherwise notifications could be delayed.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -137,14 +152,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +180,37 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelHashKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+
+typedef struct ListenerEntry
+{
+ ProcNumber procNo;
+ bool staged;
+ bool current;
+} ListenerEntry;
+
+typedef struct ChannelEntry
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ListenerEntry array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelEntry;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -224,11 +273,14 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
- * we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +298,9 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
+ bool isAdvancing; /* backend is advancing its position */
+ QueuePosition advancingPos; /* target position backend is advancing to */
} QueueBackendStatus;
/*
@@ -260,9 +315,10 @@ typedef struct QueueBackendStatus
* (since no other backend will inspect it).
*
* When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
- * entries of other backends and also change the head pointer. When holding
- * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointers.
+ * entries of other backends and also change the head pointer. They can
+ * also advance other backends' queue positions, unless they are not
+ * in the process of doing that themselves. When holding both NotifyQueueLock and
+ * NotifyQueueTailLock in EXCLUSIVE mode, backends can change the tail pointers.
*
* SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
* the control lock for the pg_notify SLRU buffers.
@@ -288,11 +344,16 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +362,9 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
+#define QUEUE_BACKEND_IS_ADVANCING(i) (asyncQueueControl->backend[i].isAdvancing)
+#define QUEUE_BACKEND_ADVANCING_POS(i) (asyncQueueControl->backend[i].advancingPos)
/*
* The SLRU buffer area through which we access the notification queue
@@ -313,16 +377,18 @@ static SlruCtlData NotifyCtlData;
#define QUEUE_FULL_WARN_INTERVAL 5000 /* warn at most once every 5s */
/*
- * listenChannels identifies the channels we are actually listening to
- * (ie, have committed a LISTEN on). It is a simple list of channel names,
- * allocated in TopMemoryContext.
+ * listenChannelsHash identifies the channels we are listening to.
+ * Entries are added during PreCommit_Notify (before committing to clog) and
+ * removed on abort if the LISTEN was never committed. It is a hash table
+ * of channel names, allocated in TopMemoryContext.
*/
-static List *listenChannels = NIL; /* list of C strings */
+static HTAB *listenChannelsHash = NULL;
/*
* State for pending LISTEN/UNLISTEN actions consists of an ordered list of
- * all actions requested in the current transaction. As explained above,
- * we don't actually change listenChannels until we reach transaction commit.
+ * all actions requested in the current transaction. During PreCommit_Notify,
+ * we stage these changes in listenChannelsHash and the shared channelHash.
+ * On abort, AtAbort_Notify cleans up any staged-but-uncommitted entries.
*
* The list is kept in CurTransactionContext. In subtransactions, each
* subtransaction has its own list in its own CurTransactionContext, but
@@ -391,6 +457,7 @@ typedef struct NotificationList
int nestingLevel; /* current transaction nesting depth */
List *events; /* list of Notification structs */
HTAB *hashtab; /* hash of NotificationHash structs, or NULL */
+ HTAB *channelHashtab; /* hash of unique channel names, or NULL */
struct NotificationList *upper; /* details for upper transaction levels */
} NotificationList;
@@ -401,6 +468,11 @@ struct NotificationHash
Notification *event; /* => the actual Notification struct */
};
+struct ChannelHash
+{
+ char channel[NAMEDATALEN];
+};
+
static NotificationList *pendingNotifies = NULL;
/*
@@ -418,6 +490,36 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/*
+ * Queue head positions for direct advancement.
+ * These are captured during PreCommit_Notify while holding the heavyweight
+ * lock on database 0, ensuring no other backend can insert notifications
+ * between them. SignalBackends uses these to advance idle backends.
+ */
+static QueuePosition queueHeadBeforeWrite;
+static QueuePosition queueHeadAfterWrite;
+
+/*
+ * List of channels with pending notifications in the current transaction.
+ */
+static List *pendingNotifyChannels = NIL;
+
+/*
+ * List of channels with staged listen/unlisten changes in the current
+ * transaction. Populated during PreCommit_Notify and used by AtCommit_Notify
+ * to copy staged values to current.
+ */
+static List *pendingListenChannels = NIL;
+
+/*
+ * Preallocated arrays for SignalBackends to avoid memory allocation after
+ * committing to clog. Allocated in PreCommit_Notify when there are pending
+ * notifications.
+ */
+static int32 *signalPids = NULL;
+static ProcNumber *signalProcnos = NULL;
+
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -428,14 +530,14 @@ bool Trace_notify = false;
int max_notify_queue_pages = 1048576;
/* local function prototypes */
-static inline int64 asyncQueuePageDiff(int64 p, int64 q);
static inline bool asyncQueuePagePrecedes(int64 p, int64 q);
static void queue_listen(ListenActionKind action, const char *channel);
static void Async_UnlistenOnExit(int code, Datum arg);
static void Exec_ListenPreCommit(void);
-static void Exec_ListenCommit(const char *channel);
-static void Exec_UnlistenCommit(const char *channel);
-static void Exec_UnlistenAllCommit(void);
+static void Exec_ListenPreCommitStage(const char *channel);
+static void Exec_UnlistenPreCommitStage(const char *channel);
+static void Exec_UnlistenAllPreCommitStage(void);
+static void CleanupListenersOnExit(void);
static bool IsListeningOn(const char *channel);
static void asyncQueueUnregister(void);
static bool asyncQueueIsFull(void);
@@ -456,16 +558,9 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
-
-/*
- * Compute the difference between two queue page numbers.
- * Previously this function accounted for a wraparound.
- */
-static inline int64
-asyncQueuePageDiff(int64 p, int64 q)
-{
- return p - q;
-}
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+static void initChannelHash(void);
/*
* Determines whether p precedes q.
@@ -477,6 +572,105 @@ asyncQueuePagePrecedes(int64 p, int64 q)
return p < q;
}
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelEntry),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
+/*
+ * initListenChannelsHash
+ * Lazy initialization of the local listen channels hash table.
+ */
+static void
+initListenChannelsHash(void)
+{
+ HASHCTL hash_ctl;
+
+ /* Quick exit if we already did this */
+ if (listenChannelsHash != NULL)
+ return;
+
+ /* Initialize local hash table for this backend's listened channels */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+
+ listenChannelsHash =
+ hash_create("Listen Channels",
+ 64,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS);
+}
+
/*
* Report space needed for our shared memory area
*/
@@ -520,12 +714,17 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVANCING_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
+ QUEUE_BACKEND_IS_ADVANCING(i) = false;
}
}
@@ -656,6 +855,7 @@ Async_Notify(const char *channel, const char *payload)
notifies->events = list_make1(n);
/* We certainly don't need a hashtable yet */
notifies->hashtab = NULL;
+ notifies->channelHashtab = NULL;
notifies->upper = pendingNotifies;
pendingNotifies = notifies;
}
@@ -682,8 +882,8 @@ Async_Notify(const char *channel, const char *payload)
* Common code for listen, unlisten, unlisten all commands.
*
* Adds the request to the list of pending actions.
- * Actual update of the listenChannels list happens during transaction
- * commit.
+ * Actual update of listenChannelsHash and channelHash happens during
+ * PreCommit_Notify, with staged changes committed in AtCommit_Notify.
*/
static void
queue_listen(ListenActionKind action, const char *channel)
@@ -782,30 +982,49 @@ Async_UnlistenAll(void)
* SQL function: return a set of the channel names this backend is actively
* listening to.
*
- * Note: this coding relies on the fact that the listenChannels list cannot
+ * Note: this coding relies on the fact that the listenChannelsHash cannot
* change within a transaction.
*/
Datum
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ HASH_SEQ_STATUS *status;
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Initialize hash table iteration if we have any channels */
+ if (listenChannelsHash != NULL)
+ {
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ status = (HASH_SEQ_STATUS *) palloc(sizeof(HASH_SEQ_STATUS));
+ hash_seq_init(status, listenChannelsHash);
+ funcctx->user_fctx = status;
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ {
+ funcctx->user_fctx = NULL;
+ }
}
/* stuff done on every call of the function */
funcctx = SRF_PERCALL_SETUP();
+ status = (HASH_SEQ_STATUS *) funcctx->user_fctx;
- if (funcctx->call_cntr < list_length(listenChannels))
+ if (status != NULL)
{
- char *channel = (char *) list_nth(listenChannels,
- funcctx->call_cntr);
+ struct ChannelHash *entry;
- SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
+ entry = (struct ChannelHash *) hash_seq_search(status);
+ if (entry != NULL)
+ SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(entry->channel));
}
SRF_RETURN_DONE(funcctx);
@@ -821,7 +1040,7 @@ pg_listening_channels(PG_FUNCTION_ARGS)
static void
Async_UnlistenOnExit(int code, Datum arg)
{
- Exec_UnlistenAllCommit();
+ CleanupListenersOnExit();
asyncQueueUnregister();
}
@@ -868,8 +1087,24 @@ PreCommit_Notify(void)
elog(DEBUG1, "PreCommit_Notify");
/* Preflight for any pending listen/unlisten actions */
+ if (pendingNotifies != NULL || pendingActions != NULL)
+ initChannelHash();
+
+ if (pendingNotifies != NULL)
+ {
+ if (signalPids == NULL)
+ signalPids = MemoryContextAlloc(TopMemoryContext,
+ MaxBackends * sizeof(int32));
+
+ if (signalProcnos == NULL)
+ signalProcnos = MemoryContextAlloc(TopMemoryContext,
+ MaxBackends * sizeof(ProcNumber));
+ }
+
if (pendingActions != NULL)
{
+ initListenChannelsHash();
+
foreach(p, pendingActions->actions)
{
ListenAction *actrec = (ListenAction *) lfirst(p);
@@ -878,12 +1113,13 @@ PreCommit_Notify(void)
{
case LISTEN_LISTEN:
Exec_ListenPreCommit();
+ Exec_ListenPreCommitStage(actrec->channel);
break;
case LISTEN_UNLISTEN:
- /* there is no Exec_UnlistenPreCommit() */
+ Exec_UnlistenPreCommitStage(actrec->channel);
break;
case LISTEN_UNLISTEN_ALL:
- /* there is no Exec_UnlistenAllPreCommit() */
+ Exec_UnlistenAllPreCommitStage();
break;
}
}
@@ -893,6 +1129,36 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
+
+ /*
+ * Build list of unique channels for SignalBackends().
+ *
+ * If we have a channelHashtab, use it to efficiently get the unique
+ * channels. Otherwise, fall back to the linear approach.
+ */
+ pendingNotifyChannels = NIL;
+ if (pendingNotifies->channelHashtab != NULL)
+ {
+ HASH_SEQ_STATUS status;
+ struct ChannelHash *channelEntry;
+
+ hash_seq_init(&status, pendingNotifies->channelHashtab);
+ while ((channelEntry = (struct ChannelHash *) hash_seq_search(&status)) != NULL)
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channelEntry->channel);
+ }
+ else
+ {
+ /* Linear approach for small number of notifications */
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
+ }
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -921,6 +1187,22 @@ PreCommit_Notify(void)
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /*
+ * For the direct advancement optimization in SignalBackends(), we
+ * need to ensure that no other backend can insert queue entries
+ * between queueHeadBeforeWrite and queueHeadAfterWrite. The
+ * heavyweight lock above provides this guarantee, since it serializes
+ * all writers.
+ *
+ * Note: if the heavyweight lock were ever removed for scalability
+ * reasons, we could achieve the same guarantee by holding
+ * NotifyQueueLock in EXCLUSIVE mode across all our insertions, rather
+ * than releasing and reacquiring it for each page as we do below.
+ */
+
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(queueHeadBeforeWrite, 0, 0);
+
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
while (nextNotify != NULL)
@@ -938,12 +1220,20 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ if (firstIteration)
+ {
+ queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many notifications in the NOTIFY queue")));
nextNotify = asyncQueueAddEntries(nextNotify);
+ queueHeadAfterWrite = QUEUE_HEAD;
+
LWLockRelease(NotifyQueueLock);
}
@@ -956,7 +1246,7 @@ PreCommit_Notify(void)
*
* This is called at transaction commit, after committing to clog.
*
- * Update listenChannels and clear transaction-local state.
+ * Update listenChannelsHash and clear transaction-local state.
*
* If we issued any notifications in the transaction, send signals to
* listening backends (possibly including ourselves) to process them.
@@ -966,7 +1256,6 @@ PreCommit_Notify(void)
void
AtCommit_Notify(void)
{
- ListCell *p;
/*
* Allow transactions that have not executed LISTEN/UNLISTEN/NOTIFY to
@@ -978,30 +1267,62 @@ AtCommit_Notify(void)
if (Trace_notify)
elog(DEBUG1, "AtCommit_Notify");
- /* Perform any pending listen/unlisten actions */
- if (pendingActions != NULL)
+ /* Commit staged listen/unlisten changes by copying staged to current */
+ if (pendingListenChannels != NIL)
{
- foreach(p, pendingActions->actions)
+ ListCell *lc;
+
+ foreach(lc, pendingListenChannels)
{
- ListenAction *actrec = (ListenAction *) lfirst(p);
+ char *channel = (char *) lfirst(lc);
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ListenerEntry *listeners;
- switch (actrec->action)
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ continue;
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA, entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
{
- case LISTEN_LISTEN:
- Exec_ListenCommit(actrec->channel);
- break;
- case LISTEN_UNLISTEN:
- Exec_UnlistenCommit(actrec->channel);
- break;
- case LISTEN_UNLISTEN_ALL:
- Exec_UnlistenAllCommit();
+ if (listeners[i].procNo == MyProcNumber)
+ {
+ /* Commit staged value to current */
+ listeners[i].current = listeners[i].staged;
+
+ if (!listeners[i].current)
+ {
+ /* UNLISTEN committed: remove from local and shared */
+ (void) hash_search(listenChannelsHash, channel,
+ HASH_REMOVE, NULL);
+
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ListenerEntry) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ entry = NULL;
+ }
+ }
break;
+ }
}
+
+ if (entry != NULL)
+ dshash_release_lock(channelHash, entry);
}
}
/* If no longer listening to anything, get out of listener array */
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/*
@@ -1098,6 +1419,9 @@ Exec_ListenPreCommit(void)
QUEUE_BACKEND_POS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = false;
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = max;
/* Insert backend into list of listeners at correct position */
if (prevListener != INVALID_PROC_NUMBER)
{
@@ -1127,99 +1451,213 @@ Exec_ListenPreCommit(void)
}
/*
- * Exec_ListenCommit --- subroutine for AtCommit_Notify
+ * Exec_ListenPreCommitStage --- subroutine for PreCommit_Notify
*
- * Add the channel to the list of channels we are listening on.
+ * Stage a LISTEN by adding entries to listenChannelsHash and the shared
+ * channelHash with staged=true, current=false. The staged value is copied
+ * to current in AtCommit_Notify.
*/
static void
-Exec_ListenCommit(const char *channel)
+Exec_ListenPreCommitStage(const char *channel)
{
- MemoryContext oldcontext;
-
- /* Do nothing if we are already listening on this channel */
- if (IsListeningOn(channel))
- return;
-
- /*
- * Add the new channel name to listenChannels.
- *
- * XXX It is theoretically possible to get an out-of-memory failure here,
- * which would be bad because we already committed. For the moment it
- * doesn't seem worth trying to guard against that, but maybe improve this
- * later.
- */
- oldcontext = MemoryContextSwitchTo(TopMemoryContext);
- listenChannels = lappend(listenChannels, pstrdup(channel));
- MemoryContextSwitchTo(oldcontext);
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ bool found;
+ ListenerEntry *listeners;
+
+ (void) hash_search(listenChannelsHash, channel, HASH_ENTER, NULL);
+
+ pendingListenChannels = lappend(pendingListenChannels, pstrdup(channel));
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ {
+ entry->listenersArray = InvalidDsaPointer;
+ entry->numListeners = 0;
+ entry->allocatedListeners = 0;
+ }
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ListenerEntry) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA, entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i].procNo == MyProcNumber)
+ {
+ listeners[i].staged = true;
+ dshash_release_lock(channelHash, entry);
+ return;
+ }
+ }
+
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ListenerEntry) * new_size);
+ ListenerEntry *new_listeners = (ListenerEntry *) dsa_get_address(channelDSA, new_array);
+
+ memcpy(new_listeners, listeners, sizeof(ListenerEntry) * entry->numListeners);
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners].procNo = MyProcNumber;
+ listeners[entry->numListeners].staged = true;
+ listeners[entry->numListeners].current = false;
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
}
/*
- * Exec_UnlistenCommit --- subroutine for AtCommit_Notify
+ * Exec_UnlistenPreCommitStage --- subroutine for PreCommit_Notify
*
- * Remove the specified channel name from listenChannels.
+ * Stage an UNLISTEN by setting staged=false on our entry in channelHash.
+ * The staged value is copied to current in AtCommit_Notify, and the entry
+ * is removed if current becomes false.
*/
static void
-Exec_UnlistenCommit(const char *channel)
+Exec_UnlistenPreCommitStage(const char *channel)
{
- ListCell *q;
+ ChannelHashKey key;
+ ChannelEntry *entry;
+ ListenerEntry *listeners;
- if (Trace_notify)
- elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ return;
- foreach(q, listenChannels)
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA, entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
{
- char *lchan = (char *) lfirst(q);
-
- if (strcmp(lchan, channel) == 0)
+ if (listeners[i].procNo == MyProcNumber && listeners[i].staged)
{
- listenChannels = foreach_delete_current(listenChannels, q);
- pfree(lchan);
+ listeners[i].staged = false;
+
+ pendingListenChannels = lappend(pendingListenChannels, pstrdup(channel));
break;
}
}
- /*
- * We do not complain about unlistening something not being listened;
- * should we?
- */
+ dshash_release_lock(channelHash, entry);
}
/*
- * Exec_UnlistenAllCommit --- subroutine for AtCommit_Notify
+ * Exec_UnlistenAllPreCommitStage --- subroutine for PreCommit_Notify
*
- * Unlisten on all channels for this backend.
+ * Stage UNLISTEN * by setting staged=false on all our entries in channelHash.
*/
static void
-Exec_UnlistenAllCommit(void)
+Exec_UnlistenAllPreCommitStage(void)
{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ListenerEntry *listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i].procNo == MyProcNumber && listeners[i].staged)
+ {
+ listeners[i].staged = false;
+ pendingListenChannels = lappend(pendingListenChannels,
+ pstrdup(entry->key.channel));
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
+}
+
+/*
+ * CleanupListenersOnExit --- called from Async_UnlistenOnExit
+ *
+ * Remove this backend from all channels in the shared hash.
+ */
+static void
+CleanupListenersOnExit(void)
+{
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
if (Trace_notify)
- elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
+ elog(DEBUG1, "CleanupListenersOnExit(%d)", MyProcPid);
- list_free_deep(listenChannels);
- listenChannels = NIL;
+ /* Clear our local cache */
+ if (listenChannelsHash != NULL)
+ {
+ hash_destroy(listenChannelsHash);
+ listenChannelsHash = NULL;
+ }
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
+ return;
+
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
+ {
+ if (entry->key.dboid == MyDatabaseId)
+ {
+ ListenerEntry *listeners;
+ int i;
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i].procNo == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ListenerEntry) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
+ }
+ }
+ dshash_seq_term(&status);
}
/*
* Test whether we are actively listening on the given channel name.
*
* Note: this function is executed for every notification found in the queue.
- * Perhaps it is worth further optimization, eg convert the list to a sorted
- * array so we can binary-search it. In practice the list is likely to be
- * fairly short, though.
*/
static bool
IsListeningOn(const char *channel)
{
- ListCell *p;
+ if (listenChannelsHash == NULL)
+ return false;
- foreach(p, listenChannels)
- {
- char *lchan = (char *) lfirst(p);
-
- if (strcmp(lchan, channel) == 0)
- return true;
- }
- return false;
+ return (hash_search(listenChannelsHash, channel, HASH_FIND, NULL) != NULL);
}
/*
@@ -1229,7 +1667,7 @@ IsListeningOn(const char *channel)
static void
asyncQueueUnregister(void)
{
- Assert(listenChannels == NIL); /* else caller error */
+ Assert(listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0); /* else caller error */
if (!amRegisteredListener) /* nothing to do */
return;
@@ -1241,6 +1679,9 @@ asyncQueueUnregister(void)
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = false;
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVANCING_POS(MyProcNumber), 0, 0);
/* and remove it from the list */
if (QUEUE_FIRST_LISTENER == MyProcNumber)
QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);
@@ -1565,12 +2006,21 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are not interested in our notifies, that are known
+ * to still be positioned at the old queue head, or anywhere in the
+ * queue region we just wrote, can be safely advanced directly to the
+ * new head, since that region is known to contain only our own
+ * notifications. This avoids unnecessary wakeups when there is
+ * nothing of interest to them.
+ *
+ * Backends that are not interested in our notifies, that are advancing
+ * to a target position before the new queue head, or that are not
+ * advancing and are stationary at a position before the old queue head
+ * needs to be signaled since notifications could otherwise be delayed.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1580,60 +2030,106 @@ asyncQueueFillWarning(void)
static void
SignalBackends(void)
{
- int32 *pids;
- ProcNumber *procnos;
int count;
+ ListCell *lc;
- /*
- * Identify backends that we need to signal. We don't want to send
- * signals while holding the NotifyQueueLock, so this loop just builds a
- * list of target PIDs.
- *
- * XXX in principle these pallocs could fail, which would be bad. Maybe
- * preallocate the arrays? They're not that large, though.
- */
- pids = (int32 *) palloc(MaxBackends * sizeof(int32));
- procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+ Assert(signalPids != NULL && signalProcnos != NULL);
count = 0;
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, pendingNotifyChannels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelEntry *entry = NULL;
+ ListenerEntry *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue;
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i;
+ int32 pid;
+ QueuePosition pos;
+
+ if (!listeners[j].current)
+ continue;
+
+ i = listeners[j].procNo;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ signalPids[count] = pid;
+ signalProcnos[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ if (pendingNotifies != NULL)
+ {
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ if (QUEUE_BACKEND_IS_ADVANCING(i) ?
+ QUEUE_POS_PRECEDES(QUEUE_BACKEND_ADVANCING_POS(i), queueHeadAfterWrite) :
+ QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
+ {
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ signalPids[count] = pid;
+ signalProcnos[count] = i;
+ count++;
+ }
+ else if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
+ QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
+ {
+ Assert(!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite));
+
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
/* Now send signals */
for (int i = 0; i < count; i++)
{
- int32 pid = pids[i];
+ int32 pid = signalPids[i];
/*
* If we are signaling our own process, no need to involve the kernel;
@@ -1651,12 +2147,9 @@ SignalBackends(void)
* NotifyQueueLock; which is unlikely but certainly possible. So we
* just log a low-level debug message if it happens.
*/
- if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]) < 0)
+ if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, signalProcnos[i]) < 0)
elog(DEBUG3, "could not signal backend with PID %d: %m", pid);
}
-
- pfree(pids);
- pfree(procnos);
}
/*
@@ -1664,18 +2157,71 @@ SignalBackends(void)
*
* This is called at transaction abort.
*
- * Gets rid of pending actions and outbound notifies that we would have
- * executed if the transaction got committed.
+ * Revert any staged listen/unlisten changes and clean up transaction state.
*/
void
AtAbort_Notify(void)
{
/*
- * If we LISTEN but then roll back the transaction after PreCommit_Notify,
- * we have registered as a listener but have not made any entry in
- * listenChannels. In that case, deregister again.
+ * Revert staged listen/unlisten changes. For new LISTENs (current=false),
+ * remove from both local and shared hash. For UNLISTENs (current=true),
+ * just revert staged back to current.
*/
- if (amRegisteredListener && listenChannels == NIL)
+ if (pendingListenChannels != NIL && channelHash != NULL)
+ {
+ ListCell *lc;
+
+ foreach(lc, pendingListenChannels)
+ {
+ char *channel = (char *) lfirst(lc);
+ ChannelHashKey key;
+ ChannelEntry *entry;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, true);
+ if (entry != NULL)
+ {
+ ListenerEntry *listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i].procNo == MyProcNumber)
+ {
+ /* Revert staged value to current */
+ listeners[i].staged = listeners[i].current;
+
+ if (!listeners[i].current)
+ {
+ /* New LISTEN being aborted: remove from local and shared */
+ if (listenChannelsHash != NULL)
+ (void) hash_search(listenChannelsHash, channel,
+ HASH_REMOVE, NULL);
+
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ListenerEntry) * (entry->numListeners - i));
+ }
+ break;
+ }
+ }
+
+ if (entry->numListeners == 0)
+ {
+ if (DsaPointerIsValid(entry->listenersArray))
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ }
+ else
+ dshash_release_lock(channelHash, entry);
+ }
+ }
+ }
+
+
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/* And clean up */
@@ -1854,20 +2400,29 @@ asyncQueueReadAllNotifications(void)
QueuePosition head;
Snapshot snapshot;
- /* Fetch current state */
+ /*
+ * Fetch current state, indicate to others that we have woken up, and that
+ * we now will be advancing our position.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ head = QUEUE_HEAD;
pos = QUEUE_BACKEND_POS(MyProcNumber);
- head = QUEUE_HEAD;
- LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
{
/* Nothing to do, we have read all notifications already. */
+ LWLockRelease(NotifyQueueLock);
return;
}
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = true;
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = head;
+
+ LWLockRelease(NotifyQueueLock);
+
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
* This is trickier than it might seem, because of race conditions.
@@ -1954,6 +2509,8 @@ asyncQueueReadAllNotifications(void)
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
+
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = false;
QUEUE_BACKEND_POS(MyProcNumber) = pos;
LWLockRelease(NotifyQueueLock);
@@ -2051,7 +2608,7 @@ asyncQueueProcessPageEntries(QueuePosition *current,
* over it on the first LISTEN in a session, and not get stuck on
* it indefinitely.
*/
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
continue;
if (TransactionIdDidCommit(qe->xid))
@@ -2306,7 +2863,7 @@ ProcessIncomingNotify(bool flush)
notifyInterruptPending = false;
/* Do nothing else if we aren't actively listening */
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
return;
if (Trace_notify)
@@ -2410,13 +2967,15 @@ AddEventToPendingNotifies(Notification *n)
{
Assert(pendingNotifies->events != NIL);
- /* Create the hash table if it's time to */
+ /* Create the hash tables if it's time to */
if (list_length(pendingNotifies->events) >= MIN_HASHABLE_NOTIFIES &&
pendingNotifies->hashtab == NULL)
{
HASHCTL hash_ctl;
ListCell *l;
+ Assert(pendingNotifies->channelHashtab == NULL);
+
/* Create the hash table */
hash_ctl.keysize = sizeof(Notification *);
hash_ctl.entrysize = sizeof(struct NotificationHash);
@@ -2429,10 +2988,22 @@ AddEventToPendingNotifies(Notification *n)
&hash_ctl,
HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
+ /* Create the channel hash table */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelHash);
+ hash_ctl.hcxt = CurTransactionContext;
+ pendingNotifies->channelHashtab =
+ hash_create("Pending Notify Channels",
+ 64L,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
/* Insert all the already-existing events */
foreach(l, pendingNotifies->events)
{
Notification *oldn = (Notification *) lfirst(l);
+ char *channel = oldn->data;
bool found;
(void) hash_search(pendingNotifies->hashtab,
@@ -2440,22 +3011,42 @@ AddEventToPendingNotifies(Notification *n)
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Insert channel into channelHashtab */
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if multiple events on same channel */
}
}
/* Add new event to the list, in order */
pendingNotifies->events = lappend(pendingNotifies->events, n);
- /* Add event to the hash table if needed */
+ /* Add event to the hash tables if needed */
if (pendingNotifies->hashtab != NULL)
{
bool found;
+ Assert(pendingNotifies->channelHashtab != NULL);
+
(void) hash_search(pendingNotifies->hashtab,
&n,
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Add channel to channelHashtab */
+ {
+ char *channel = n->data;
+
+ (void) hash_search(pendingNotifies->channelHashtab,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if we already have an event on this channel */
+ }
}
}
@@ -2493,7 +3084,7 @@ notification_match(const void *key1, const void *key2, Size keysize)
return 1; /* not equal */
}
-/* Clear the pendingActions and pendingNotifies lists. */
+/* Clear the pendingActions, pendingNotifies, and pendingNotifyChannels lists. */
static void
ClearPendingActionsAndNotifies(void)
{
@@ -2505,6 +3096,8 @@ ClearPendingActionsAndNotifies(void)
*/
pendingActions = NULL;
pendingNotifies = NULL;
+ pendingNotifyChannels = NIL;
+ pendingListenChannels = NIL;
}
/*
@@ -2515,3 +3108,16 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index dcfadbd5aae..32b0b21f184 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -371,6 +371,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 533344509e9..277a78e7954 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -102,6 +102,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5c88fa92f4e..973d4a449fd 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -421,6 +421,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelEntry
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2025-12-28 16:10 Joel Jacobson <[email protected]>
parent: Joel Jacobson <[email protected]>
0 siblings, 0 replies; 120+ messages in thread
From: Joel Jacobson @ 2025-12-28 16:10 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Chao Li <[email protected]>; pgsql-hackers
On Sat, Dec 27, 2025, at 13:40, Joel Jacobson wrote:
> On Fri, Dec 26, 2025, at 21:12, Joel Jacobson wrote:
>> On Tue, Nov 25, 2025, at 21:17, Tom Lane wrote:
>>> "Joel Jacobson" <[email protected]> writes:
>>>> It looks to me like it would be best with two boolean fields; one
>>>> boolean to stage the updates during PreCommit_Notify, that each
>>>> pendingActions could flip back and forth, and another boolean that
>>>> represents the current value, which we would overwrite with the staged
>>>> value during AtCommit_Notify.
>>>
>>> +1, I had a feeling that a single boolean wouldn't quite do it.
>>> (There are various ways we could define the states, but what
>>> you say above seems pretty reasonable.)
>>
>> I've implemented the two boolean approach and think it's good.
I've reworked the staging mechanism for LISTEN/UNLISTEN. The new design
tracks LISTEN state at three levels:
* pendingListenChannels: per-transaction pending changes
* listenChannelsHash: per-backend committed state cache
* channelHash: cluster-wide shared state
The first two are local hash tables, the third is a dshash in shared
memory. PreCommit_Notify updates all three (doing any allocations
before clog commit for OOM safety), and AtCommit_Notify finalizes the
changes.
The previous version tried to track pending state in the shared
ListenerEntry itself using two booleans (staged/current). This worked,
but I think the three-layer approach is cleaner.
The main benefit is that pendingListenChannels is now a hash table
instead of a simple List. In the old design, LISTEN foo; UNLISTEN foo;
LISTEN foo would create three list entries that all had to be processed
at commit. The new design collapses this to one hash entry storing the
final state, which we just apply at commit.
A nice bonus is that UNLISTEN became simpler. In PreCommit_Notify it
just records the intent in the local pending hash. The old design had
to acquire an exclusive lock on the shared dshash entry to flip the
staged boolean. UNLISTEN ALL is similar -- it now just scans the
backend's own local hashes instead of the cluster-wide shared hash.
The tradeoff is one additional local hash table per transaction that
executes LISTEN/UNLISTEN. This seems like a reasonable price for the
simpler logic.
I also renamed a few things for clarity: ChannelEntry is now
ChannelListeners (since it holds the array of listeners for a channel),
and channelHashtab is now channelSet (since it's just a set of channel
names, not a hash of channel-related data).
/Joel
Attachments:
[application/octet-stream] 0001-optimize_listen_notify-v33.patch (9.9K, 2-0001-optimize_listen_notify-v33.patch)
download | inline diff:
From d31caa5da633c0a705045ba17e36363f32bdc5c5 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sat, 27 Dec 2025 08:06:21 +0100
Subject: [PATCH 1/2] Improve LISTEN/NOTIFY test coverage
This adds isolation tests to cover previously untested code paths:
* Check simple NOTIFY reparenting when parent has no action
* Check LISTEN reparenting in subtransaction
* Check LISTEN merge path when both outer and inner transactions have actions
* Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions)
* Check notification_match function (triggered by hash table duplicate detection)
* Check that notifications sent from a backend that has not done LISTEN
are properly delivered to a listener in another backend
* Check UNLISTEN * cancels a LISTEN in the same transaction
This also adds a test to prepare for the next patch:
* Check ChannelHashAddListener array growth
---
src/test/isolation/expected/async-notify.out | 124 ++++++++++++++++++-
src/test/isolation/specs/async-notify.spec | 72 +++++++++++
2 files changed, 195 insertions(+), 1 deletion(-)
diff --git a/src/test/isolation/expected/async-notify.out b/src/test/isolation/expected/async-notify.out
index 556e1805893..5d6bcce2b02 100644
--- a/src/test/isolation/expected/async-notify.out
+++ b/src/test/isolation/expected/async-notify.out
@@ -1,4 +1,4 @@
-Parsed test spec with 3 sessions
+Parsed test spec with 7 sessions
starting permutation: listenc notify1 notify2 notify3 notifyf
step listenc: LISTEN c1; LISTEN c2;
@@ -47,6 +47,115 @@ notifier: NOTIFY "c2" with payload "payload" from notifier
notifier: NOTIFY "c1" with payload "payloads" from notifier
notifier: NOTIFY "c2" with payload "payloads" from notifier
+starting permutation: listenc notifys_simple
+step listenc: LISTEN c1; LISTEN c2;
+step notifys_simple:
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+
+notifier: NOTIFY "c1" with payload "simple1" from notifier
+notifier: NOTIFY "c2" with payload "simple2" from notifier
+
+starting permutation: lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+step lsbegin: BEGIN;
+step lslisten_outer: LISTEN c3;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrelease: RELEASE SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify: NOTIFY c1, 'subxact_test';
+listen_subxact: NOTIFY "c1" with payload "subxact_test" from listen_subxact
+
+starting permutation: lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+step lsbegin: BEGIN;
+step lssavepoint: SAVEPOINT s1;
+step lslisten: LISTEN c1; LISTEN c2;
+step lsrollback: ROLLBACK TO SAVEPOINT s1;
+step lscommit: COMMIT;
+step lsnotify_check: NOTIFY c1, 'should_not_receive';
+
+starting permutation: lunlisten_all notify1 lcheck
+step lunlisten_all: BEGIN; LISTEN c1; UNLISTEN *; COMMIT;
+step notify1: NOTIFY c1;
+step lcheck: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+
+starting permutation: listenc notify_many_with_dup
+step listenc: LISTEN c1; LISTEN c2;
+step notify_many_with_dup:
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+
+pg_notify
+---------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+(17 rows)
+
+pg_notify
+---------
+
+(1 row)
+
+notifier: NOTIFY "c1" with payload "msg1" from notifier
+notifier: NOTIFY "c1" with payload "msg2" from notifier
+notifier: NOTIFY "c1" with payload "msg3" from notifier
+notifier: NOTIFY "c1" with payload "msg4" from notifier
+notifier: NOTIFY "c1" with payload "msg5" from notifier
+notifier: NOTIFY "c1" with payload "msg6" from notifier
+notifier: NOTIFY "c1" with payload "msg7" from notifier
+notifier: NOTIFY "c1" with payload "msg8" from notifier
+notifier: NOTIFY "c1" with payload "msg9" from notifier
+notifier: NOTIFY "c1" with payload "msg10" from notifier
+notifier: NOTIFY "c1" with payload "msg11" from notifier
+notifier: NOTIFY "c1" with payload "msg12" from notifier
+notifier: NOTIFY "c1" with payload "msg13" from notifier
+notifier: NOTIFY "c1" with payload "msg14" from notifier
+notifier: NOTIFY "c1" with payload "msg15" from notifier
+notifier: NOTIFY "c1" with payload "msg16" from notifier
+notifier: NOTIFY "c1" with payload "msg17" from notifier
+
+starting permutation: listenc llisten l2listen l3listen lslisten
+step listenc: LISTEN c1; LISTEN c2;
+step llisten: LISTEN c1; LISTEN c2;
+step l2listen: LISTEN c1;
+step l3listen: LISTEN c1;
+step lslisten: LISTEN c1; LISTEN c2;
+
starting permutation: llisten notify1 notify2 notify3 notifyf lcheck
step llisten: LISTEN c1; LISTEN c2;
step notify1: NOTIFY c1;
@@ -95,6 +204,8 @@ listener: NOTIFY "c2" with payload "" from notifier
starting permutation: l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
step l2listen: LISTEN c1;
+listener2: NOTIFY "c1" with payload "" from notifier
+listener2: NOTIFY "c1" with payload "" from notifier
step l2begin: BEGIN;
step notify1: NOTIFY c1;
step lbegins: BEGIN ISOLATION LEVEL SERIALIZABLE;
@@ -104,6 +215,17 @@ step l2commit: COMMIT;
listener2: NOTIFY "c1" with payload "" from notifier
step l2stop: UNLISTEN *;
+starting permutation: lch_listen nch_notify lch_check
+step lch_listen: LISTEN ch;
+step nch_notify: NOTIFY ch, 'aa';
+step lch_check: SELECT 1 AS x;
+x
+-
+1
+(1 row)
+
+listener_ch: NOTIFY "ch" with payload "aa" from notifier_ch
+
starting permutation: llisten lbegin usage bignotify usage
step llisten: LISTEN c1; LISTEN c2;
step lbegin: BEGIN;
diff --git a/src/test/isolation/specs/async-notify.spec b/src/test/isolation/specs/async-notify.spec
index 0b8cfd91083..d09c2297f09 100644
--- a/src/test/isolation/specs/async-notify.spec
+++ b/src/test/isolation/specs/async-notify.spec
@@ -31,6 +31,20 @@ step notifys1 {
ROLLBACK TO SAVEPOINT s2;
COMMIT;
}
+step notifys_simple {
+ BEGIN;
+ SAVEPOINT s1;
+ NOTIFY c1, 'simple1';
+ NOTIFY c2, 'simple2';
+ RELEASE SAVEPOINT s1;
+ COMMIT;
+}
+step notify_many_with_dup {
+ BEGIN;
+ SELECT pg_notify('c1', 'msg' || s::text) FROM generate_series(1, 17) s;
+ SELECT pg_notify('c1', 'msg1');
+ COMMIT;
+}
step usage { SELECT pg_notification_queue_usage() > 0 AS nonzero; }
step bignotify { SELECT count(pg_notify('c1', s::text)) FROM generate_series(1, 1000) s; }
teardown { UNLISTEN *; }
@@ -43,6 +57,7 @@ step lcheck { SELECT 1 AS x; }
step lbegin { BEGIN; }
step lbegins { BEGIN ISOLATION LEVEL SERIALIZABLE; }
step lcommit { COMMIT; }
+step lunlisten_all { BEGIN; LISTEN c1; UNLISTEN *; COMMIT; }
teardown { UNLISTEN *; }
# In some tests we need a second listener, just to block the queue.
@@ -53,6 +68,38 @@ step l2begin { BEGIN; }
step l2commit { COMMIT; }
step l2stop { UNLISTEN *; }
+# Third listener session for testing array growth.
+
+session listener3
+step l3listen { LISTEN c1; }
+teardown { UNLISTEN *; }
+
+# Listener session for cross-session notification test with channel 'ch'.
+
+session listener_ch
+step lch_listen { LISTEN ch; }
+step lch_check { SELECT 1 AS x; }
+teardown { UNLISTEN *; }
+
+# Notifier session for cross-session notification test with channel 'ch'.
+
+session notifier_ch
+step nch_notify { NOTIFY ch, 'aa'; }
+
+# Session for testing LISTEN in subtransaction with separate steps.
+
+session listen_subxact
+step lsbegin { BEGIN; }
+step lslisten_outer { LISTEN c3; }
+step lssavepoint { SAVEPOINT s1; }
+step lslisten { LISTEN c1; LISTEN c2; }
+step lsrelease { RELEASE SAVEPOINT s1; }
+step lsrollback { ROLLBACK TO SAVEPOINT s1; }
+step lscommit { COMMIT; }
+step lsnotify { NOTIFY c1, 'subxact_test'; }
+step lsnotify_check { NOTIFY c1, 'should_not_receive'; }
+teardown { UNLISTEN *; }
+
# Trivial cases.
permutation listenc notify1 notify2 notify3 notifyf
@@ -60,6 +107,27 @@ permutation listenc notify1 notify2 notify3 notifyf
# Check simple and less-simple deduplication.
permutation listenc notifyd1 notifyd2 notifys1
+# Check simple NOTIFY reparenting when parent has no action.
+permutation listenc notifys_simple
+
+# Check LISTEN reparenting in subtransaction.
+permutation lsbegin lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN merge path when both outer and inner transactions have actions.
+permutation lsbegin lslisten_outer lssavepoint lslisten lsrelease lscommit lsnotify
+
+# Check LISTEN abort path (ROLLBACK TO SAVEPOINT discards pending actions).
+permutation lsbegin lssavepoint lslisten lsrollback lscommit lsnotify_check
+
+# Check UNLISTEN * cancels a LISTEN in the same transaction.
+permutation lunlisten_all notify1 lcheck
+
+# Check notification_match function (triggered by hash table duplicate detection).
+permutation listenc notify_many_with_dup
+
+# Check ChannelHashAddListener array growth.
+permutation listenc llisten l2listen l3listen lslisten
+
# Cross-backend notification delivery. We use a "select 1" to force the
# listener session to check for notifies. In principle we could just wait
# for delivery, but that would require extra support in isolationtester
@@ -73,6 +141,10 @@ permutation listenc llisten notify1 notify2 notify3 notifyf lcheck
# and notify queue is not empty
permutation l2listen l2begin notify1 lbegins llisten lcommit l2commit l2stop
+# Check that notifications sent from a backend that has not done LISTEN
+# are properly delivered to a listener in another backend.
+permutation lch_listen nch_notify lch_check
+
# Verify that pg_notification_queue_usage correctly reports a non-zero result,
# after submitting notifications while another connection is listening for
# those notifications and waiting inside an active transaction. We have to
--
2.50.1
[application/octet-stream] 0002-optimize_listen_notify-v33.patch (56.8K, 3-0002-optimize_listen_notify-v33.patch)
download | inline diff:
From feb641cc6e69ae21e3c804979b3335f1b4c6d6cc Mon Sep 17 00:00:00 2001
From: Joel Jacobson <[email protected]>
Date: Sat, 27 Dec 2025 08:07:03 +0100
Subject: [PATCH 2/2] Optimize LISTEN/NOTIFY with shared channel map and direct
advancement
This patch reworks the LISTEN/NOTIFY signaling path to avoid the
long-standing inefficiency where every commit wakes all listening
backends in the same database, even those that are listening on
completely different channels.
Problem
-------
At present, SignalBackends has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, it simply iterates over all registered listeners in the
same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal.
That behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers a storm of unnecessary wakeups and context switches. As the
number of idle listeners grows, this often becomes the bottleneck and
throughput drops sharply.
Overview of the solution
------------------------
This patch introduces a lazily-created dynamic shared hash (dshash)
backed by dynamic shared memory (DSA) that maps (dboid, channel) to
arrays of listening backends (ProcNumbers). This allows the sender to
target only those backends actually listening on the channels for which
it has queued notifications.
LISTEN state tracking
---------------------
LISTEN state is tracked at three levels:
- pendingListenChannels: per-transaction pending changes
- listenChannelsHash: per-backend committed state cache
- channelHash: cluster-wide shared state
The first two are local hash tables, the third is a dshash in shared
memory. PreCommit_Notify updates all three (doing any allocations before
clog commit for OOM safety), and AtCommit_Notify finalizes the changes.
Using a hash table for pendingListenChannels provides automatic
deduplication: LISTEN foo; UNLISTEN foo; LISTEN foo collapses to one
entry storing the final state, which we just apply at commit.
For LISTEN, PreCommit_Notify pre-allocates entries in both the local
listenChannelsHash and the shared channelHash (with listening=false).
AtCommit_Notify then sets listening=true.
For UNLISTEN, PreCommit_Notify only records the intent in
pendingListenChannels. AtCommit_Notify removes the entry from
channelHash and listenChannelsHash.
On abort, entries with listening=false (staged but never committed) are
removed from channelHash and listenChannelsHash.
Signal arrays for sending notifications are also preallocated in
PreCommit_Notify to avoid allocation failures after committing to clog.
Direct advancement
------------------
A further optimization avoids signaling idle backends that are not
listening on any of the channels notified within the transaction.
While queuing notifications, PreCommit_Notify records the queue head
position both before and after writing its notifications. Because all
writers are serialized by the existing cluster-wide heavyweight lock on
"database 0", no backend (from any database) can insert entries between
those two points. This guarantees that the region [oldHead, newHead)
contains only the entries written by our commit.
SignalBackends uses this fact to directly advance any backend still
positioned at oldHead up to newHead, avoiding a needless wakeup for
listeners that would otherwise not find any notifies of interest.
To handle advancing backends correctly, each backend's entry tracks both
whether it is currently advancing (isAdvancing) and the target position
it is advancing to (advancingPos). This allows SignalBackends to signal
advancing backends only when their target position would leave them
behind the new queue head, while safely direct-advancing idle backends
that would not be interested in the newly written notifications.
Idle backends that are stationary at a position before the old queue
head are signaled, since they might be interested in the notifications
in between their current position and the old queue head.
Other notes
-----------
The listenChannelsHash provides fast lock-free lookups during
notification processing, avoiding contention on the shared hash during
the high-frequency IsListeningOn checks that occur for every
notification read from the queue.
This patch adds LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash for visibility.
There are no user-visible behavioral changes; this is an internal
optimization only.
---
src/backend/commands/async.c | 1055 ++++++++++++++---
.../utils/activity/wait_event_names.txt | 1 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 3 +
4 files changed, 864 insertions(+), 196 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index eb86402cae4..50fb17ad887 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -24,8 +24,10 @@
* All notification messages are placed in the queue and later read out
* by listening backends.
*
- * There is no central knowledge of which backend listens on which channel;
- * every backend has its own list of interesting channels.
+ * We also maintain a dynamic shared hash table (dshash) that maps channel
+ * names to the set of backends listening on each channel. This table is
+ * created lazily on the first LISTEN command and grows dynamically as
+ * needed.
*
* Although there is only one queue, notifications are treated as being
* database-local; this is done by including the sender's database OID
@@ -64,20 +66,33 @@
* notifications, we can still call elog(ERROR, ...) and the transaction
* will roll back.
*
+ * PreCommit_Notify() also stages any pending LISTEN/UNLISTEN actions.
+ * LISTEN operations pre-allocate entries in both the per-backend
+ * listenChannelsHash and the shared channelHash (with listening=false).
+ * All allocations happen before committing to clog so failures safely abort.
+ *
* Once we have put all of the notifications into the queue, we return to
* CommitTransaction() which will then do the actual transaction commit.
*
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
- * Then we signal any backends that may be interested in our messages
- * (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * commit the staged listen/unlisten changes by setting listening=true for
+ * staged LISTENs, or removing entries for UNLISTENs. Then we signal any backends
+ * that may be interested in our messages (including our own backend,
+ * if listening). This is done by SignalBackends(), which consults the
+ * shared channel hash table to identify listeners for the channels that
+ * have pending notifications in the current database. Each selected
+ * backend is marked as having a wakeup pending to avoid duplicate signals,
+ * and a PROCSIG_NOTIFY_INTERRUPT signal is sent to it.
+ *
+ * When writing notifications, PreCommit_Notify() records the queue head
+ * position both before and after the write. Because all writers serialize
+ * on a cluster-wide heavyweight lock, no backend can insert entries between
+ * these two points. SignalBackends() uses this fact to directly advance any
+ * backend that is still positioned at the old head, or within the range
+ * written, avoiding unnecessary wakeups for idle listeners that have
+ * nothing to read. Backends that cannot be direct advanced are signaled
+ * if they are stuck behind the old queue head, or advancing to a position
+ * before the new queue head, since otherwise notifications could be delayed.
*
* Finally, after we are out of the transaction altogether and about to go
* idle, we scan the queue for messages that need to be sent to our
@@ -137,14 +152,17 @@
#include "commands/async.h"
#include "common/hashfn.h"
#include "funcapi.h"
+#include "lib/dshash.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
+#include "storage/dsm_registry.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procsignal.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
+#include "utils/dsa.h"
#include "utils/guc_hooks.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
@@ -162,6 +180,36 @@
*/
#define NOTIFY_PAYLOAD_MAX_LENGTH (BLCKSZ - NAMEDATALEN - 128)
+/*
+ * Channel hash table definitions
+ *
+ * This hash table maps (database OID, channel name) keys to arrays of
+ * ProcNumbers representing the backends listening on each channel.
+ */
+
+#define INITIAL_LISTENERS_ARRAY_SIZE 4
+
+typedef struct ChannelNameKey
+{
+ Oid dboid;
+ char channel[NAMEDATALEN];
+} ChannelHashKey;
+
+
+typedef struct ListenerEntry
+{
+ ProcNumber procNo;
+ bool listening; /* true if committed listener */
+} ListenerEntry;
+
+typedef struct ChannelListeners
+{
+ ChannelHashKey key;
+ dsa_pointer listenersArray; /* DSA pointer to ListenerEntry array */
+ int numListeners; /* Number of listeners currently stored */
+ int allocatedListeners; /* Allocated size of array */
+} ChannelListeners;
+
/*
* Struct representing an entry in the global notify queue
*
@@ -224,11 +272,14 @@ typedef struct QueuePosition
(x).page != (y).page ? (x) : \
(x).offset > (y).offset ? (x) : (y))
+/* returns true if x comes before y in queue order */
+#define QUEUE_POS_PRECEDES(x,y) \
+ (asyncQueuePagePrecedes((x).page, (y).page) || \
+ ((x).page == (y).page && (x).offset < (y).offset))
+
/*
* Parameter determining how often we try to advance the tail pointer:
- * we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data. This is
- * also the distance by which a backend in another database needs to be
- * behind before we'll decide we need to wake it up to advance its pointer.
+ * we do that after every QUEUE_CLEANUP_DELAY pages of NOTIFY data.
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +297,9 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeupPending; /* signal sent but not yet processed */
+ bool isAdvancing; /* backend is advancing its position */
+ QueuePosition advancingPos; /* target position backend is advancing to */
} QueueBackendStatus;
/*
@@ -260,14 +314,16 @@ typedef struct QueueBackendStatus
* (since no other backend will inspect it).
*
* When holding NotifyQueueLock in EXCLUSIVE mode, backends can inspect the
- * entries of other backends and also change the head pointer. When holding
- * both NotifyQueueLock and NotifyQueueTailLock in EXCLUSIVE mode, backends
- * can change the tail pointers.
+ * entries of other backends and also change the head pointer. They can
+ * also advance other backends' queue positions, unless they are not
+ * in the process of doing that themselves. When holding both NotifyQueueLock and
+ * NotifyQueueTailLock in EXCLUSIVE mode, backends can change the tail pointers.
*
* SLRU buffer pool is divided in banks and bank wise SLRU lock is used as
* the control lock for the pg_notify SLRU buffers.
* In order to avoid deadlocks, whenever we need multiple locks, we first get
- * NotifyQueueTailLock, then NotifyQueueLock, and lastly SLRU bank lock.
+ * NotifyQueueTailLock, then NotifyQueueLock, then SLRU bank lock, and lastly
+ * channelHash partition locks.
*
* Each backend uses the backend[] array entry with index equal to its
* ProcNumber. We rely on this to make SendProcSignal fast.
@@ -288,11 +344,16 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channelHashDSA;
+ dshash_table_handle channelHashDSH;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
} AsyncQueueControl;
static AsyncQueueControl *asyncQueueControl;
+static dsa_area *channelDSA = NULL;
+static dshash_table *channelHash = NULL;
+
#define QUEUE_HEAD (asyncQueueControl->head)
#define QUEUE_TAIL (asyncQueueControl->tail)
#define QUEUE_STOP_PAGE (asyncQueueControl->stopPage)
@@ -301,6 +362,9 @@ static AsyncQueueControl *asyncQueueControl;
#define QUEUE_BACKEND_DBOID(i) (asyncQueueControl->backend[i].dboid)
#define QUEUE_NEXT_LISTENER(i) (asyncQueueControl->backend[i].nextListener)
#define QUEUE_BACKEND_POS(i) (asyncQueueControl->backend[i].pos)
+#define QUEUE_BACKEND_WAKEUP_PENDING(i) (asyncQueueControl->backend[i].wakeupPending)
+#define QUEUE_BACKEND_IS_ADVANCING(i) (asyncQueueControl->backend[i].isAdvancing)
+#define QUEUE_BACKEND_ADVANCING_POS(i) (asyncQueueControl->backend[i].advancingPos)
/*
* The SLRU buffer area through which we access the notification queue
@@ -313,16 +377,19 @@ static SlruCtlData NotifyCtlData;
#define QUEUE_FULL_WARN_INTERVAL 5000 /* warn at most once every 5s */
/*
- * listenChannels identifies the channels we are actually listening to
- * (ie, have committed a LISTEN on). It is a simple list of channel names,
- * allocated in TopMemoryContext.
+ * listenChannelsHash caches the channels this backend is listening on.
+ * Used by IsListeningOn() for fast lookups when reading notifications.
+ * Entries are pre-allocated during PreCommit_Notify (before clog commit)
+ * so allocation failures safely abort. On abort, staged entries are removed.
+ * Allocated in TopMemoryContext so it persists across transactions.
*/
-static List *listenChannels = NIL; /* list of C strings */
+static HTAB *listenChannelsHash = NULL;
/*
* State for pending LISTEN/UNLISTEN actions consists of an ordered list of
- * all actions requested in the current transaction. As explained above,
- * we don't actually change listenChannels until we reach transaction commit.
+ * all actions requested in the current transaction. During PreCommit_Notify,
+ * we stage these changes in listenChannelsHash and the shared channelHash.
+ * On abort, AtAbort_Notify cleans up any staged-but-uncommitted entries.
*
* The list is kept in CurTransactionContext. In subtransactions, each
* subtransaction has its own list in its own CurTransactionContext, but
@@ -391,6 +458,7 @@ typedef struct NotificationList
int nestingLevel; /* current transaction nesting depth */
List *events; /* list of Notification structs */
HTAB *hashtab; /* hash of NotificationHash structs, or NULL */
+ HTAB *channelSet; /* hash of unique channel names, or NULL */
struct NotificationList *upper; /* details for upper transaction levels */
} NotificationList;
@@ -401,6 +469,18 @@ struct NotificationHash
Notification *event; /* => the actual Notification struct */
};
+struct ChannelName
+{
+ char channel[NAMEDATALEN];
+};
+
+/* Entry for pendingListenChannels hash table */
+struct PendingListenEntry
+{
+ char channel[NAMEDATALEN]; /* hash key */
+ bool listening; /* true = LISTEN, false = UNLISTEN */
+};
+
static NotificationList *pendingNotifies = NULL;
/*
@@ -418,6 +498,37 @@ static bool unlistenExitRegistered = false;
/* True if we're currently registered as a listener in asyncQueueControl */
static bool amRegisteredListener = false;
+/*
+ * Queue head positions for direct advancement.
+ * These are captured during PreCommit_Notify while holding the heavyweight
+ * lock on database 0, ensuring no other backend can insert notifications
+ * between them. SignalBackends uses these to advance idle backends.
+ */
+static QueuePosition queueHeadBeforeWrite;
+static QueuePosition queueHeadAfterWrite;
+
+/*
+ * List of channels with pending notifications in the current transaction.
+ */
+static List *pendingNotifyChannels = NIL;
+
+/*
+ * Hash table of pending listen/unlisten changes in the current transaction.
+ * Key is channel name, value is boolean (true = LISTEN, false = UNLISTEN).
+ * Provides automatic deduplication of repeated LISTEN/UNLISTEN on same channel.
+ * Populated during PreCommit_Notify and used by AtCommit_Notify/AtAbort_Notify.
+ */
+static HTAB *pendingListenChannels = NULL;
+
+/*
+ * Preallocated arrays for SignalBackends to avoid memory allocation after
+ * committing to clog. Allocated in PreCommit_Notify when there are pending
+ * notifications.
+ */
+static int32 *signalPids = NULL;
+static ProcNumber *signalProcnos = NULL;
+
+
/* have we advanced to a page that's a multiple of QUEUE_CLEANUP_DELAY? */
static bool tryAdvanceTail = false;
@@ -428,14 +539,14 @@ bool Trace_notify = false;
int max_notify_queue_pages = 1048576;
/* local function prototypes */
-static inline int64 asyncQueuePageDiff(int64 p, int64 q);
static inline bool asyncQueuePagePrecedes(int64 p, int64 q);
static void queue_listen(ListenActionKind action, const char *channel);
static void Async_UnlistenOnExit(int code, Datum arg);
static void Exec_ListenPreCommit(void);
-static void Exec_ListenCommit(const char *channel);
-static void Exec_UnlistenCommit(const char *channel);
-static void Exec_UnlistenAllCommit(void);
+static void Exec_ListenPreCommitStage(const char *channel);
+static void Exec_UnlistenPreCommitStage(const char *channel);
+static void Exec_UnlistenAllPreCommitStage(void);
+static void CleanupListenersOnExit(void);
static bool IsListeningOn(const char *channel);
static void asyncQueueUnregister(void);
static bool asyncQueueIsFull(void);
@@ -456,16 +567,9 @@ static void AddEventToPendingNotifies(Notification *n);
static uint32 notification_hash(const void *key, Size keysize);
static int notification_match(const void *key1, const void *key2, Size keysize);
static void ClearPendingActionsAndNotifies(void);
-
-/*
- * Compute the difference between two queue page numbers.
- * Previously this function accounted for a wraparound.
- */
-static inline int64
-asyncQueuePageDiff(int64 p, int64 q)
-{
- return p - q;
-}
+static inline void ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel);
+static dshash_hash channelHashFunc(const void *key, size_t size, void *arg);
+static void initChannelHash(void);
/*
* Determines whether p precedes q.
@@ -477,6 +581,131 @@ asyncQueuePagePrecedes(int64 p, int64 q)
return p < q;
}
+/*
+ * channelHashFunc
+ * Hash function for channel keys.
+ */
+static dshash_hash
+channelHashFunc(const void *key, size_t size, void *arg)
+{
+ const ChannelHashKey *k = (const ChannelHashKey *) key;
+ dshash_hash h;
+
+ h = DatumGetUInt32(hash_uint32(k->dboid));
+ h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
+ strnlen(k->channel, NAMEDATALEN)));
+
+ return h;
+}
+
+/* parameters for the channel hash table */
+static const dshash_parameters channelDSHParams = {
+ sizeof(ChannelHashKey),
+ sizeof(ChannelListeners),
+ dshash_memcmp,
+ channelHashFunc,
+ dshash_memcpy,
+ LWTRANCHE_NOTIFY_CHANNEL_HASH
+};
+
+/*
+ * initChannelHash
+ * Lazy initialization of the channel hash table.
+ */
+static void
+initChannelHash(void)
+{
+ MemoryContext oldcontext;
+
+ /* Quick exit if we already did this */
+ if (asyncQueueControl->channelHashDSH != DSHASH_HANDLE_INVALID &&
+ channelHash != NULL)
+ return;
+
+ /* Otherwise, use a lock to ensure only one process creates the table */
+ LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+
+ /* Be sure any local memory allocated by DSA routines is persistent */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ if (asyncQueueControl->channelHashDSH == DSHASH_HANDLE_INVALID)
+ {
+ /* Initialize dynamic shared hash table for channel hash */
+ channelDSA = dsa_create(LWTRANCHE_NOTIFY_CHANNEL_HASH);
+ dsa_pin(channelDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_create(channelDSA, &channelDSHParams, NULL);
+
+ /* Store handles in shared memory for other backends to use */
+ asyncQueueControl->channelHashDSA = dsa_get_handle(channelDSA);
+ asyncQueueControl->channelHashDSH =
+ dshash_get_hash_table_handle(channelHash);
+ }
+ else if (!channelHash)
+ {
+ /* Attach to existing dynamic shared hash table */
+ channelDSA = dsa_attach(asyncQueueControl->channelHashDSA);
+ dsa_pin_mapping(channelDSA);
+ channelHash = dshash_attach(channelDSA, &channelDSHParams,
+ asyncQueueControl->channelHashDSH,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ LWLockRelease(NotifyQueueLock);
+}
+
+/*
+ * initListenChannelsHash
+ * Lazy initialization of the local listen channels hash table.
+ */
+static void
+initListenChannelsHash(void)
+{
+ HASHCTL hash_ctl;
+
+ /* Quick exit if we already did this */
+ if (listenChannelsHash != NULL)
+ return;
+
+ /* Initialize local hash table for this backend's listened channels */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelName);
+
+ listenChannelsHash =
+ hash_create("Listen Channels",
+ 64,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS);
+}
+
+/*
+ * initPendingListenChannels
+ * Lazy initialization of the pending listen channels hash table.
+ * This is allocated in CurTransactionContext and destroyed at
+ * transaction end.
+ */
+static void
+initPendingListenChannels(void)
+{
+ HASHCTL hash_ctl;
+
+ if (pendingListenChannels != NULL)
+ return;
+
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct PendingListenEntry);
+ hash_ctl.hcxt = CurTransactionContext;
+
+ pendingListenChannels =
+ hash_create("Pending Listen Channels",
+ 16,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+}
+
/*
* Report space needed for our shared memory area
*/
@@ -520,12 +749,17 @@ AsyncShmemInit(void)
QUEUE_STOP_PAGE = 0;
QUEUE_FIRST_LISTENER = INVALID_PROC_NUMBER;
asyncQueueControl->lastQueueFillWarn = 0;
+ asyncQueueControl->channelHashDSA = DSA_HANDLE_INVALID;
+ asyncQueueControl->channelHashDSH = DSHASH_HANDLE_INVALID;
for (int i = 0; i < MaxBackends; i++)
{
QUEUE_BACKEND_PID(i) = InvalidPid;
QUEUE_BACKEND_DBOID(i) = InvalidOid;
QUEUE_NEXT_LISTENER(i) = INVALID_PROC_NUMBER;
SET_QUEUE_POS(QUEUE_BACKEND_POS(i), 0, 0);
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVANCING_POS(i), 0, 0);
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = false;
+ QUEUE_BACKEND_IS_ADVANCING(i) = false;
}
}
@@ -656,6 +890,7 @@ Async_Notify(const char *channel, const char *payload)
notifies->events = list_make1(n);
/* We certainly don't need a hashtable yet */
notifies->hashtab = NULL;
+ notifies->channelSet = NULL;
notifies->upper = pendingNotifies;
pendingNotifies = notifies;
}
@@ -682,8 +917,8 @@ Async_Notify(const char *channel, const char *payload)
* Common code for listen, unlisten, unlisten all commands.
*
* Adds the request to the list of pending actions.
- * Actual update of the listenChannels list happens during transaction
- * commit.
+ * Actual update of listenChannelsHash and channelHash happens during
+ * PreCommit_Notify, with staged changes committed in AtCommit_Notify.
*/
static void
queue_listen(ListenActionKind action, const char *channel)
@@ -782,30 +1017,49 @@ Async_UnlistenAll(void)
* SQL function: return a set of the channel names this backend is actively
* listening to.
*
- * Note: this coding relies on the fact that the listenChannels list cannot
+ * Note: this coding relies on the fact that the listenChannelsHash cannot
* change within a transaction.
*/
Datum
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ HASH_SEQ_STATUS *status;
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
+
+ /* Initialize hash table iteration if we have any channels */
+ if (listenChannelsHash != NULL)
+ {
+ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ status = (HASH_SEQ_STATUS *) palloc(sizeof(HASH_SEQ_STATUS));
+ hash_seq_init(status, listenChannelsHash);
+ funcctx->user_fctx = status;
+ MemoryContextSwitchTo(oldcontext);
+ }
+ else
+ {
+ funcctx->user_fctx = NULL;
+ }
}
/* stuff done on every call of the function */
funcctx = SRF_PERCALL_SETUP();
+ status = (HASH_SEQ_STATUS *) funcctx->user_fctx;
- if (funcctx->call_cntr < list_length(listenChannels))
+ if (status != NULL)
{
- char *channel = (char *) list_nth(listenChannels,
- funcctx->call_cntr);
+ struct ChannelName *entry;
- SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(channel));
+ entry = (struct ChannelName *) hash_seq_search(status);
+ if (entry != NULL)
+ SRF_RETURN_NEXT(funcctx, CStringGetTextDatum(entry->channel));
}
SRF_RETURN_DONE(funcctx);
@@ -821,7 +1075,7 @@ pg_listening_channels(PG_FUNCTION_ARGS)
static void
Async_UnlistenOnExit(int code, Datum arg)
{
- Exec_UnlistenAllCommit();
+ CleanupListenersOnExit();
asyncQueueUnregister();
}
@@ -868,8 +1122,25 @@ PreCommit_Notify(void)
elog(DEBUG1, "PreCommit_Notify");
/* Preflight for any pending listen/unlisten actions */
+ if (pendingNotifies != NULL || pendingActions != NULL)
+ initChannelHash();
+
+ if (pendingNotifies != NULL)
+ {
+ if (signalPids == NULL)
+ signalPids = MemoryContextAlloc(TopMemoryContext,
+ MaxBackends * sizeof(int32));
+
+ if (signalProcnos == NULL)
+ signalProcnos = MemoryContextAlloc(TopMemoryContext,
+ MaxBackends * sizeof(ProcNumber));
+ }
+
if (pendingActions != NULL)
{
+ initListenChannelsHash();
+ initPendingListenChannels();
+
foreach(p, pendingActions->actions)
{
ListenAction *actrec = (ListenAction *) lfirst(p);
@@ -878,12 +1149,13 @@ PreCommit_Notify(void)
{
case LISTEN_LISTEN:
Exec_ListenPreCommit();
+ Exec_ListenPreCommitStage(actrec->channel);
break;
case LISTEN_UNLISTEN:
- /* there is no Exec_UnlistenPreCommit() */
+ Exec_UnlistenPreCommitStage(actrec->channel);
break;
case LISTEN_UNLISTEN_ALL:
- /* there is no Exec_UnlistenAllPreCommit() */
+ Exec_UnlistenAllPreCommitStage();
break;
}
}
@@ -893,6 +1165,36 @@ PreCommit_Notify(void)
if (pendingNotifies)
{
ListCell *nextNotify;
+ bool firstIteration = true;
+
+ /*
+ * Build list of unique channels for SignalBackends().
+ *
+ * If we have a channelSet, use it to efficiently get the unique
+ * channels. Otherwise, fall back to the linear approach.
+ */
+ pendingNotifyChannels = NIL;
+ if (pendingNotifies->channelSet != NULL)
+ {
+ HASH_SEQ_STATUS status;
+ struct ChannelName *channelEntry;
+
+ hash_seq_init(&status, pendingNotifies->channelSet);
+ while ((channelEntry = (struct ChannelName *) hash_seq_search(&status)) != NULL)
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channelEntry->channel);
+ }
+ else
+ {
+ /* Linear approach for small number of notifications */
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
+ }
/*
* Make sure that we have an XID assigned to the current transaction.
@@ -921,6 +1223,22 @@ PreCommit_Notify(void)
LockSharedObject(DatabaseRelationId, InvalidOid, 0,
AccessExclusiveLock);
+ /*
+ * For the direct advancement optimization in SignalBackends(), we
+ * need to ensure that no other backend can insert queue entries
+ * between queueHeadBeforeWrite and queueHeadAfterWrite. The
+ * heavyweight lock above provides this guarantee, since it serializes
+ * all writers.
+ *
+ * Note: if the heavyweight lock were ever removed for scalability
+ * reasons, we could achieve the same guarantee by holding
+ * NotifyQueueLock in EXCLUSIVE mode across all our insertions, rather
+ * than releasing and reacquiring it for each page as we do below.
+ */
+
+ /* Initialize queueHeadBeforeWrite to a safe default */
+ SET_QUEUE_POS(queueHeadBeforeWrite, 0, 0);
+
/* Now push the notifications into the queue */
nextNotify = list_head(pendingNotifies->events);
while (nextNotify != NULL)
@@ -938,12 +1256,20 @@ PreCommit_Notify(void)
* point in time we can still roll the transaction back.
*/
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
+ if (firstIteration)
+ {
+ queueHeadBeforeWrite = QUEUE_HEAD;
+ firstIteration = false;
+ }
+
asyncQueueFillWarning();
if (asyncQueueIsFull())
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("too many notifications in the NOTIFY queue")));
nextNotify = asyncQueueAddEntries(nextNotify);
+ queueHeadAfterWrite = QUEUE_HEAD;
+
LWLockRelease(NotifyQueueLock);
}
@@ -956,7 +1282,7 @@ PreCommit_Notify(void)
*
* This is called at transaction commit, after committing to clog.
*
- * Update listenChannels and clear transaction-local state.
+ * Update listenChannelsHash and clear transaction-local state.
*
* If we issued any notifications in the transaction, send signals to
* listening backends (possibly including ourselves) to process them.
@@ -966,8 +1292,6 @@ PreCommit_Notify(void)
void
AtCommit_Notify(void)
{
- ListCell *p;
-
/*
* Allow transactions that have not executed LISTEN/UNLISTEN/NOTIFY to
* return as soon as possible
@@ -978,30 +1302,69 @@ AtCommit_Notify(void)
if (Trace_notify)
elog(DEBUG1, "AtCommit_Notify");
- /* Perform any pending listen/unlisten actions */
- if (pendingActions != NULL)
+ /* Commit staged listen/unlisten changes */
+ if (pendingListenChannels != NULL)
{
- foreach(p, pendingActions->actions)
+ HASH_SEQ_STATUS seq;
+ struct PendingListenEntry *pending;
+
+ hash_seq_init(&seq, pendingListenChannels);
+ while ((pending = (struct PendingListenEntry *) hash_seq_search(&seq)) != NULL)
{
- ListenAction *actrec = (ListenAction *) lfirst(p);
+ ChannelHashKey key;
+ ChannelListeners *entry;
+ ListenerEntry *listeners;
- switch (actrec->action)
+ ChannelHashPrepareKey(&key, MyDatabaseId, pending->channel);
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ continue;
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA, entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
{
- case LISTEN_LISTEN:
- Exec_ListenCommit(actrec->channel);
- break;
- case LISTEN_UNLISTEN:
- Exec_UnlistenCommit(actrec->channel);
- break;
- case LISTEN_UNLISTEN_ALL:
- Exec_UnlistenAllCommit();
+ if (listeners[i].procNo == MyProcNumber)
+ {
+ if (pending->listening)
+ {
+ /*
+ * LISTEN being committed: set listening=true.
+ * listenChannelsHash was pre-allocated in PreCommit.
+ */
+ listeners[i].listening = true;
+ }
+ else
+ {
+ /* UNLISTEN being committed: remove from channelHash */
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ListenerEntry) * (entry->numListeners - i));
+
+ /* Remove from local cache */
+ (void) hash_search(listenChannelsHash, pending->channel,
+ HASH_REMOVE, NULL);
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ entry = NULL;
+ }
+ }
break;
+ }
}
+
+ if (entry != NULL)
+ dshash_release_lock(channelHash, entry);
}
}
/* If no longer listening to anything, get out of listener array */
- if (amRegisteredListener && listenChannels == NIL)
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/*
@@ -1098,6 +1461,9 @@ Exec_ListenPreCommit(void)
QUEUE_BACKEND_POS(MyProcNumber) = max;
QUEUE_BACKEND_PID(MyProcNumber) = MyProcPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = MyDatabaseId;
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = false;
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = max;
/* Insert backend into list of listeners at correct position */
if (prevListener != INVALID_PROC_NUMBER)
{
@@ -1127,99 +1493,220 @@ Exec_ListenPreCommit(void)
}
/*
- * Exec_ListenCommit --- subroutine for AtCommit_Notify
+ * Exec_ListenPreCommitStage --- subroutine for PreCommit_Notify
*
- * Add the channel to the list of channels we are listening on.
+ * Stage a LISTEN by recording it in pendingListenChannels, pre-allocating
+ * an entry in listenChannelsHash, and pre-allocating an entry in the shared
+ * channelHash with listening=false. The listening flag is set to true in
+ * AtCommit_Notify. On abort, the pre-allocated entries are removed.
*/
static void
-Exec_ListenCommit(const char *channel)
+Exec_ListenPreCommitStage(const char *channel)
{
- MemoryContext oldcontext;
+ ChannelHashKey key;
+ ChannelListeners *entry;
+ bool found;
+ ListenerEntry *listeners;
+ struct PendingListenEntry *pending;
- /* Do nothing if we are already listening on this channel */
- if (IsListeningOn(channel))
+ /* Record in local pending hash that we want to LISTEN */
+ pending = (struct PendingListenEntry *)
+ hash_search(pendingListenChannels, channel, HASH_ENTER, &found);
+ pending->listening = true;
+
+ /* Pre-allocate in local cache (OOM-safe: before clog commit) */
+ (void) hash_search(listenChannelsHash, channel, HASH_ENTER, NULL);
+
+ /* Pre-allocate entry in shared channelHash with listening=false */
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find_or_insert(channelHash, &key, &found);
+
+ if (!found)
+ {
+ entry->listenersArray = InvalidDsaPointer;
+ entry->numListeners = 0;
+ entry->allocatedListeners = 0;
+ }
+
+ if (!DsaPointerIsValid(entry->listenersArray))
+ {
+ entry->listenersArray = dsa_allocate(channelDSA,
+ sizeof(ListenerEntry) * INITIAL_LISTENERS_ARRAY_SIZE);
+ entry->allocatedListeners = INITIAL_LISTENERS_ARRAY_SIZE;
+ }
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA, entry->listenersArray);
+
+ /*
+ * Check if we already have an entry (possibly from earlier in this
+ * transaction)
+ */
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i].procNo == MyProcNumber)
+ {
+ /* Already have an entry; listening flag stays as-is until commit */
+ dshash_release_lock(channelHash, entry);
+ return;
+ }
+ }
+
+ /* Need to add a new entry; grow array if necessary */
+ if (entry->numListeners >= entry->allocatedListeners)
+ {
+ int new_size = entry->allocatedListeners * 2;
+ dsa_pointer new_array = dsa_allocate(channelDSA,
+ sizeof(ListenerEntry) * new_size);
+ ListenerEntry *new_listeners = (ListenerEntry *) dsa_get_address(channelDSA, new_array);
+
+ memcpy(new_listeners, listeners, sizeof(ListenerEntry) * entry->numListeners);
+ dsa_free(channelDSA, entry->listenersArray);
+ entry->listenersArray = new_array;
+ entry->allocatedListeners = new_size;
+ listeners = new_listeners;
+ }
+
+ listeners[entry->numListeners].procNo = MyProcNumber;
+ listeners[entry->numListeners].listening = false; /* staged, not yet
+ * committed */
+ entry->numListeners++;
+
+ dshash_release_lock(channelHash, entry);
+}
+
+/*
+ * Exec_UnlistenPreCommitStage --- subroutine for PreCommit_Notify
+ *
+ * Stage an UNLISTEN by recording it in pendingListenChannels. We don't
+ * touch channelHash yet - the listener keeps receiving signals until
+ * commit, when the entry is removed.
+ */
+static void
+Exec_UnlistenPreCommitStage(const char *channel)
+{
+ struct PendingListenEntry *pending;
+ bool found;
+
+ /*
+ * Record in local pending hash that we want to UNLISTEN. Don't touch
+ * listenChannelsHash or channelHash yet - we keep receiving signals until
+ * commit.
+ */
+ pending = (struct PendingListenEntry *)
+ hash_search(pendingListenChannels, channel, HASH_ENTER, &found);
+ pending->listening = false;
+}
+
+/*
+ * Exec_UnlistenAllPreCommitStage --- subroutine for PreCommit_Notify
+ *
+ * Stage UNLISTEN * by recording all listened channels in pendingListenChannels
+ * with listening=false.
+ */
+static void
+Exec_UnlistenAllPreCommitStage(void)
+{
+ HASH_SEQ_STATUS seq;
+ struct ChannelName *channelEntry;
+ struct PendingListenEntry *pending;
+
+ /*
+ * First, set all existing entries in pendingListenChannels to false. This
+ * handles the case of LISTEN foo; UNLISTEN ALL - foo needs to be marked
+ * as unlisten even though it's not in listenChannelsHash yet.
+ */
+ hash_seq_init(&seq, pendingListenChannels);
+ while ((pending = (struct PendingListenEntry *) hash_seq_search(&seq)) != NULL)
+ pending->listening = false;
+
+ /*
+ * Then scan listenChannelsHash (committed channels) and add any that
+ * aren't already in pendingListenChannels.
+ */
+ if (listenChannelsHash != NULL)
+ {
+ hash_seq_init(&seq, listenChannelsHash);
+ while ((channelEntry = (struct ChannelName *) hash_seq_search(&seq)) != NULL)
+ {
+ bool found;
+
+ pending = (struct PendingListenEntry *)
+ hash_search(pendingListenChannels, channelEntry->channel, HASH_ENTER, &found);
+ pending->listening = false;
+ }
+ }
+}
+
+/*
+ * CleanupListenersOnExit --- called from Async_UnlistenOnExit
+ *
+ * Remove this backend from all channels in the shared hash.
+ */
+static void
+CleanupListenersOnExit(void)
+{
+ dshash_seq_status status;
+ ChannelListeners *entry;
+
+ if (Trace_notify)
+ elog(DEBUG1, "CleanupListenersOnExit(%d)", MyProcPid);
+
+ /* Clear our local cache */
+ if (listenChannelsHash != NULL)
+ {
+ hash_destroy(listenChannelsHash);
+ listenChannelsHash = NULL;
+ }
+
+ /* Now remove from the shared channelHash */
+ if (channelHash == NULL)
return;
- /*
- * Add the new channel name to listenChannels.
- *
- * XXX It is theoretically possible to get an out-of-memory failure here,
- * which would be bad because we already committed. For the moment it
- * doesn't seem worth trying to guard against that, but maybe improve this
- * later.
- */
- oldcontext = MemoryContextSwitchTo(TopMemoryContext);
- listenChannels = lappend(listenChannels, pstrdup(channel));
- MemoryContextSwitchTo(oldcontext);
-}
-
-/*
- * Exec_UnlistenCommit --- subroutine for AtCommit_Notify
- *
- * Remove the specified channel name from listenChannels.
- */
-static void
-Exec_UnlistenCommit(const char *channel)
-{
- ListCell *q;
-
- if (Trace_notify)
- elog(DEBUG1, "Exec_UnlistenCommit(%s,%d)", channel, MyProcPid);
-
- foreach(q, listenChannels)
+ dshash_seq_init(&status, channelHash, true);
+ while ((entry = dshash_seq_next(&status)) != NULL)
{
- char *lchan = (char *) lfirst(q);
-
- if (strcmp(lchan, channel) == 0)
+ if (entry->key.dboid == MyDatabaseId)
{
- listenChannels = foreach_delete_current(listenChannels, q);
- pfree(lchan);
- break;
+ ListenerEntry *listeners;
+ int i;
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i].procNo == MyProcNumber)
+ {
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ListenerEntry) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_current(&status);
+ }
+ break;
+ }
+ }
}
}
-
- /*
- * We do not complain about unlistening something not being listened;
- * should we?
- */
-}
-
-/*
- * Exec_UnlistenAllCommit --- subroutine for AtCommit_Notify
- *
- * Unlisten on all channels for this backend.
- */
-static void
-Exec_UnlistenAllCommit(void)
-{
- if (Trace_notify)
- elog(DEBUG1, "Exec_UnlistenAllCommit(%d)", MyProcPid);
-
- list_free_deep(listenChannels);
- listenChannels = NIL;
+ dshash_seq_term(&status);
}
/*
* Test whether we are actively listening on the given channel name.
*
* Note: this function is executed for every notification found in the queue.
- * Perhaps it is worth further optimization, eg convert the list to a sorted
- * array so we can binary-search it. In practice the list is likely to be
- * fairly short, though.
*/
static bool
IsListeningOn(const char *channel)
{
- ListCell *p;
+ if (listenChannelsHash == NULL)
+ return false;
- foreach(p, listenChannels)
- {
- char *lchan = (char *) lfirst(p);
-
- if (strcmp(lchan, channel) == 0)
- return true;
- }
- return false;
+ return (hash_search(listenChannelsHash, channel, HASH_FIND, NULL) != NULL);
}
/*
@@ -1229,7 +1716,7 @@ IsListeningOn(const char *channel)
static void
asyncQueueUnregister(void)
{
- Assert(listenChannels == NIL); /* else caller error */
+ Assert(listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0); /* else caller error */
if (!amRegisteredListener) /* nothing to do */
return;
@@ -1241,6 +1728,9 @@ asyncQueueUnregister(void)
/* Mark our entry as invalid */
QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = false;
+ SET_QUEUE_POS(QUEUE_BACKEND_ADVANCING_POS(MyProcNumber), 0, 0);
/* and remove it from the list */
if (QUEUE_FIRST_LISTENER == MyProcNumber)
QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);
@@ -1565,12 +2055,21 @@ asyncQueueFillWarning(void)
/*
* Send signals to listening backends.
*
- * Normally we signal only backends in our own database, since only those
- * backends could be interested in notifies we send. However, if there's
- * notify traffic in our database but no traffic in another database that
- * does have listener(s), those listeners will fall further and further
- * behind. Waken them anyway if they're far enough behind, so that they'll
- * advance their queue position pointers, allowing the global tail to advance.
+ * Normally we signal only backends in our own database, that are
+ * listening on the channels with pending notifies, since only those
+ * backends are interested in notifies we send.
+ *
+ * Backends that are not interested in our notifies, that are known
+ * to still be positioned at the old queue head, or anywhere in the
+ * queue region we just wrote, can be safely advanced directly to the
+ * new head, since that region is known to contain only our own
+ * notifications. This avoids unnecessary wakeups when there is
+ * nothing of interest to them.
+ *
+ * Backends that are not interested in our notifies, that are advancing
+ * to a target position before the new queue head, or that are not
+ * advancing and are stationary at a position before the old queue head
+ * needs to be signaled since notifications could otherwise be delayed.
*
* Since we know the ProcNumber and the Pid the signaling is quite cheap.
*
@@ -1580,60 +2079,106 @@ asyncQueueFillWarning(void)
static void
SignalBackends(void)
{
- int32 *pids;
- ProcNumber *procnos;
int count;
+ ListCell *lc;
- /*
- * Identify backends that we need to signal. We don't want to send
- * signals while holding the NotifyQueueLock, so this loop just builds a
- * list of target PIDs.
- *
- * XXX in principle these pallocs could fail, which would be bad. Maybe
- * preallocate the arrays? They're not that large, though.
- */
- pids = (int32 *) palloc(MaxBackends * sizeof(int32));
- procnos = (ProcNumber *) palloc(MaxBackends * sizeof(ProcNumber));
+ Assert(signalPids != NULL && signalProcnos != NULL);
count = 0;
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, pendingNotifyChannels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = (char *) lfirst(lc);
+ ChannelListeners *entry = NULL;
+ ListenerEntry *listeners;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channelHash != NULL)
{
- /*
- * Always signal listeners in our own database, unless they're
- * already caught up (unlikely, but possible).
- */
+ ChannelHashKey key;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, channel);
+ entry = dshash_find(channelHash, &key, false);
+ }
+
+ if (entry == NULL)
+ continue;
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA,
+ entry->listenersArray);
+
+ for (int j = 0; j < entry->numListeners; j++)
+ {
+ ProcNumber i;
+ int32 pid;
+ QueuePosition pos;
+
+ if (!listeners[j].listening)
+ continue;
+
+ i = listeners[j].procNo;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ /* Skip if caught up */
if (QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
continue;
+
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ signalPids[count] = pid;
+ signalProcnos[count] = i;
+ count++;
}
- else
+
+ dshash_release_lock(channelHash, entry);
+ }
+
+ if (pendingNotifies != NULL)
+ {
+ for (ProcNumber i = QUEUE_FIRST_LISTENER;
+ i != INVALID_PROC_NUMBER;
+ i = QUEUE_NEXT_LISTENER(i))
{
- /*
- * Listeners in other databases should be signaled only if they
- * are far behind.
- */
- if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
- QUEUE_POS_PAGE(pos)) < QUEUE_CLEANUP_DELAY)
+ QueuePosition pos;
+ int32 pid;
+
+ if (QUEUE_BACKEND_WAKEUP_PENDING(i))
continue;
+
+ pos = QUEUE_BACKEND_POS(i);
+ pid = QUEUE_BACKEND_PID(i);
+
+ if (QUEUE_BACKEND_IS_ADVANCING(i) ?
+ QUEUE_POS_PRECEDES(QUEUE_BACKEND_ADVANCING_POS(i), queueHeadAfterWrite) :
+ QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
+ {
+ Assert(pid != InvalidPid);
+
+ QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
+ signalPids[count] = pid;
+ signalProcnos[count] = i;
+ count++;
+ }
+ else if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
+ QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
+ {
+ Assert(!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite));
+
+ QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
+ }
}
- /* OK, need to signal this one */
- pids[count] = pid;
- procnos[count] = i;
- count++;
}
LWLockRelease(NotifyQueueLock);
/* Now send signals */
for (int i = 0; i < count; i++)
{
- int32 pid = pids[i];
+ int32 pid = signalPids[i];
/*
* If we are signaling our own process, no need to involve the kernel;
@@ -1651,12 +2196,9 @@ SignalBackends(void)
* NotifyQueueLock; which is unlikely but certainly possible. So we
* just log a low-level debug message if it happens.
*/
- if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]) < 0)
+ if (SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, signalProcnos[i]) < 0)
elog(DEBUG3, "could not signal backend with PID %d: %m", pid);
}
-
- pfree(pids);
- pfree(procnos);
}
/*
@@ -1664,18 +2206,75 @@ SignalBackends(void)
*
* This is called at transaction abort.
*
- * Gets rid of pending actions and outbound notifies that we would have
- * executed if the transaction got committed.
+ * Revert any staged listen/unlisten changes and clean up transaction state.
*/
void
AtAbort_Notify(void)
{
/*
- * If we LISTEN but then roll back the transaction after PreCommit_Notify,
- * we have registered as a listener but have not made any entry in
- * listenChannels. In that case, deregister again.
+ * Revert staged listen/unlisten changes. For staged LISTENs (entries
+ * with listening=false), remove from channelHash. For staged UNLISTENs
+ * on committed channels (entries with listening=true), nothing to undo
+ * since we didn't modify channelHash during staging.
*/
- if (amRegisteredListener && listenChannels == NIL)
+ if (pendingListenChannels != NULL && channelHash != NULL)
+ {
+ HASH_SEQ_STATUS seq;
+ struct PendingListenEntry *pending;
+
+ hash_seq_init(&seq, pendingListenChannels);
+ while ((pending = (struct PendingListenEntry *) hash_seq_search(&seq)) != NULL)
+ {
+ ChannelHashKey key;
+ ChannelListeners *entry;
+ ListenerEntry *listeners;
+
+ ChannelHashPrepareKey(&key, MyDatabaseId, pending->channel);
+ entry = dshash_find(channelHash, &key, true);
+ if (entry == NULL)
+ continue;
+
+ listeners = (ListenerEntry *) dsa_get_address(channelDSA, entry->listenersArray);
+
+ for (int i = 0; i < entry->numListeners; i++)
+ {
+ if (listeners[i].procNo == MyProcNumber)
+ {
+ if (!listeners[i].listening)
+ {
+ /* Staged LISTEN (or LISTEN+UNLISTEN) being aborted */
+ /* Remove pre-allocated entries from both hashes */
+ (void) hash_search(listenChannelsHash, pending->channel,
+ HASH_REMOVE, NULL);
+ entry->numListeners--;
+ if (i < entry->numListeners)
+ memmove(&listeners[i], &listeners[i + 1],
+ sizeof(ListenerEntry) * (entry->numListeners - i));
+
+ if (entry->numListeners == 0)
+ {
+ dsa_free(channelDSA, entry->listenersArray);
+ dshash_delete_entry(channelHash, entry);
+ entry = NULL;
+ }
+ }
+
+ /*
+ * else: UNLISTEN on committed channel being aborted -
+ * nothing to undo
+ */
+ break;
+ }
+ }
+
+ if (entry != NULL)
+ dshash_release_lock(channelHash, entry);
+ }
+ }
+
+ /* If we're no longer listening on anything, unregister */
+ if (amRegisteredListener &&
+ (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0))
asyncQueueUnregister();
/* And clean up */
@@ -1854,20 +2453,29 @@ asyncQueueReadAllNotifications(void)
QueuePosition head;
Snapshot snapshot;
- /* Fetch current state */
+ /*
+ * Fetch current state, indicate to others that we have woken up, and that
+ * we now will be advancing our position.
+ */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
+ head = QUEUE_HEAD;
pos = QUEUE_BACKEND_POS(MyProcNumber);
- head = QUEUE_HEAD;
- LWLockRelease(NotifyQueueLock);
if (QUEUE_POS_EQUAL(pos, head))
{
/* Nothing to do, we have read all notifications already. */
+ LWLockRelease(NotifyQueueLock);
return;
}
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = true;
+ QUEUE_BACKEND_ADVANCING_POS(MyProcNumber) = head;
+
+ LWLockRelease(NotifyQueueLock);
+
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
* This is trickier than it might seem, because of race conditions.
@@ -1954,6 +2562,8 @@ asyncQueueReadAllNotifications(void)
/* Update shared state */
LWLockAcquire(NotifyQueueLock, LW_SHARED);
+
+ QUEUE_BACKEND_IS_ADVANCING(MyProcNumber) = false;
QUEUE_BACKEND_POS(MyProcNumber) = pos;
LWLockRelease(NotifyQueueLock);
@@ -2051,7 +2661,7 @@ asyncQueueProcessPageEntries(QueuePosition *current,
* over it on the first LISTEN in a session, and not get stuck on
* it indefinitely.
*/
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
continue;
if (TransactionIdDidCommit(qe->xid))
@@ -2306,7 +2916,7 @@ ProcessIncomingNotify(bool flush)
notifyInterruptPending = false;
/* Do nothing else if we aren't actively listening */
- if (listenChannels == NIL)
+ if (listenChannelsHash == NULL || hash_get_num_entries(listenChannelsHash) == 0)
return;
if (Trace_notify)
@@ -2410,13 +3020,15 @@ AddEventToPendingNotifies(Notification *n)
{
Assert(pendingNotifies->events != NIL);
- /* Create the hash table if it's time to */
+ /* Create the hash tables if it's time to */
if (list_length(pendingNotifies->events) >= MIN_HASHABLE_NOTIFIES &&
pendingNotifies->hashtab == NULL)
{
HASHCTL hash_ctl;
ListCell *l;
+ Assert(pendingNotifies->channelSet == NULL);
+
/* Create the hash table */
hash_ctl.keysize = sizeof(Notification *);
hash_ctl.entrysize = sizeof(struct NotificationHash);
@@ -2429,10 +3041,22 @@ AddEventToPendingNotifies(Notification *n)
&hash_ctl,
HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
+ /* Create the channel hash table */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = NAMEDATALEN;
+ hash_ctl.entrysize = sizeof(struct ChannelName);
+ hash_ctl.hcxt = CurTransactionContext;
+ pendingNotifies->channelSet =
+ hash_create("Pending Notify Channels",
+ 64L,
+ &hash_ctl,
+ HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
/* Insert all the already-existing events */
foreach(l, pendingNotifies->events)
{
Notification *oldn = (Notification *) lfirst(l);
+ char *channel = oldn->data;
bool found;
(void) hash_search(pendingNotifies->hashtab,
@@ -2440,22 +3064,42 @@ AddEventToPendingNotifies(Notification *n)
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Insert channel into channelSet */
+ (void) hash_search(pendingNotifies->channelSet,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if multiple events on same channel */
}
}
/* Add new event to the list, in order */
pendingNotifies->events = lappend(pendingNotifies->events, n);
- /* Add event to the hash table if needed */
+ /* Add event to the hash tables if needed */
if (pendingNotifies->hashtab != NULL)
{
bool found;
+ Assert(pendingNotifies->channelSet != NULL);
+
(void) hash_search(pendingNotifies->hashtab,
&n,
HASH_ENTER,
&found);
Assert(!found);
+
+ /* Add channel to channelSet */
+ {
+ char *channel = n->data;
+
+ (void) hash_search(pendingNotifies->channelSet,
+ channel,
+ HASH_ENTER,
+ &found);
+ /* found may be true if we already have an event on this channel */
+ }
}
}
@@ -2493,7 +3137,7 @@ notification_match(const void *key1, const void *key2, Size keysize)
return 1; /* not equal */
}
-/* Clear the pendingActions and pendingNotifies lists. */
+/* Clear the pendingActions, pendingNotifies, and pendingNotifyChannels lists. */
static void
ClearPendingActionsAndNotifies(void)
{
@@ -2505,6 +3149,12 @@ ClearPendingActionsAndNotifies(void)
*/
pendingActions = NULL;
pendingNotifies = NULL;
+ pendingNotifyChannels = NIL;
+ if (pendingListenChannels != NULL)
+ {
+ hash_destroy(pendingListenChannels);
+ pendingListenChannels = NULL;
+ }
}
/*
@@ -2515,3 +3165,16 @@ check_notify_buffers(int *newval, void **extra, GucSource source)
{
return check_slru_buffers("notify_buffers", newval);
}
+
+
+/*
+ * ChannelHashPrepareKey
+ * Prepare a channel key for use as a hash key.
+ */
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index dcfadbd5aae..32b0b21f184 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -371,6 +371,7 @@ SubtransBuffer "Waiting for I/O on a sub-transaction SLRU buffer."
MultiXactOffsetBuffer "Waiting for I/O on a multixact offset SLRU buffer."
MultiXactMemberBuffer "Waiting for I/O on a multixact member SLRU buffer."
NotifyBuffer "Waiting for I/O on a <command>NOTIFY</command> message SLRU buffer."
+NotifyChannelHash "Waiting to access the <command>NOTIFY</command> channel hash table."
SerialBuffer "Waiting for I/O on a serializable transaction conflict SLRU buffer."
WALInsert "Waiting to insert WAL data into a memory buffer."
BufferContent "Waiting to access a data page in memory."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 533344509e9..277a78e7954 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -102,6 +102,7 @@ PG_LWLOCKTRANCHE(SUBTRANS_BUFFER, SubtransBuffer)
PG_LWLOCKTRANCHE(MULTIXACTOFFSET_BUFFER, MultiXactOffsetBuffer)
PG_LWLOCKTRANCHE(MULTIXACTMEMBER_BUFFER, MultiXactMemberBuffer)
PG_LWLOCKTRANCHE(NOTIFY_BUFFER, NotifyBuffer)
+PG_LWLOCKTRANCHE(NOTIFY_CHANNEL_HASH, NotifyChannelHash)
PG_LWLOCKTRANCHE(SERIAL_BUFFER, SerialBuffer)
PG_LWLOCKTRANCHE(WAL_INSERT, WALInsert)
PG_LWLOCKTRANCHE(BUFFER_CONTENT, BufferContent)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ceb3fc5d980..b3b3312329e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -421,6 +421,8 @@ CatalogIdMapEntry
CatalogIndexState
ChangeVarNodes_callback
ChangeVarNodes_context
+ChannelListeners
+ChannelHashKey
CheckPoint
CheckPointStmt
CheckpointStatsData
@@ -1578,6 +1580,7 @@ ListParsedLex
ListenAction
ListenActionKind
ListenStmt
+ListenerEntry
LoInfo
LoadStmt
LocalBufferLookupEnt
--
2.50.1
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2026-01-15 04:46 Tom Lane <[email protected]>
parent: Arseniy Mukhin <[email protected]>
5 siblings, 0 replies; 120+ messages in thread
From: Tom Lane @ 2026-01-15 04:46 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: pgsql-hackers
"Joel Jacobson" <[email protected]> writes:
> On Thu, Jan 15, 2026, at 00:09, Tom Lane wrote:
>> I spent some time trying to measure the impact of that point,
>> by modifying the test program you posted upthread so that
>> some notifiers go at full speed while others respond to the
>> rate-limit switch so that they can be made to go slowly.
>> I couldn't really see any difference between what you have in v34
>> and doing this the old way.
> I reran the old benchmark [1] and got almost identical results as before
> on my MacBook Pro M3 Max, when I tested v34 against patching v34 with
> adding back the QUEUE_CLEANUP_DELAY logic:
> ...
> However, I completely failed to reproduce this difference on my Intel
> and AMD machines!
Fascinating. I was doing my testing on Intel (RHEL8). I'd bet a good
deal that this is more about the OS than the hardware. I wonder if
newer Linux versions behave differently.
I can try to reproduce your results tomorrow on macOS (M4 Pro chip).
> I have no idea what could explain the difference on my M3 Max. Not sure
> if it's due to macOS or due to the aarch64 CPU. It's still much faster
> than master, so I think this is fine, we can always come back to this in
> the future, if there is evidence this is not just an edge-case.
There's no question IMO that this patch is fundamentally a win.
Maybe we can tweak it some more for edge cases, but I think in the
main we should avoid changing edge-case behaviors that we don't have
solid evidence about.
> I therefore agree with your change of bringing back the "wake laggers"
> logic, even though it could possibly cause a few listening backends to
> receive their notifications a bit later than they otherwise would.
Hm, I don't see how this would delay any notifications? Any sender
that sent anything the laggard would be interested in should have
woken it up.
There might be a reason to worry about missed signals, though.
With the addition of the QUEUE_BACKEND_WAKEUP_PENDING flag,
nobody will ever re-signal a laggard backend, and maybe that
would be a problem sometimes. I think the existing code is a
bit more robust against that possibility, though it does rely
on a continuing stream of notifiers.
regards, tom lane
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2026-01-15 18:55 Tom Lane <[email protected]>
parent: Arseniy Mukhin <[email protected]>
5 siblings, 0 replies; 120+ messages in thread
From: Tom Lane @ 2026-01-15 18:55 UTC (permalink / raw)
To: Joel Jacobson <[email protected]>; +Cc: pgsql-hackers
I wrote:
> Fascinating. I was doing my testing on Intel (RHEL8). I'd bet a good
> deal that this is more about the OS than the hardware. I wonder if
> newer Linux versions behave differently.
> I can try to reproduce your results tomorrow on macOS (M4 Pro chip).
It does seem like macOS behaves noticeably differently than Linux.
On a 2024 Mac Mini M4 Pro, running current macOS (Tahoe 26.2),
I get results like these from your preferred test case:
$ ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 1000 --sleep 0.01 --sleep-exp 1.01 --duration 10
today's HEAD:
10 s: 6211 sent (657/s), 6217 received (566/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms 0 (0.0%) avg: 0.000ms
0.10-1.00ms 0 (0.0%) avg: 0.000ms
1.00-10.00ms 0 (0.0%) avg: 0.000ms
10.00-100.00ms # 77 (1.2%) avg: 58.837mss
>100.00ms ######### 6140 (98.8%) avg: 725.985ms
v34 patch:
10 s: 143733 sent (14618/s), 143733 received (14630/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 11384 (7.9%) avg: 0.084ms
0.10-1.00ms ###### 93357 (65.0%) avg: 0.266ms
1.00-10.00ms ## 36566 (25.4%) avg: 3.525ms
10.00-100.00ms # 2426 (1.7%) avg: 20.367mss
>100.00ms 0 (0.0%) avg: 0.000ms
v35 patch:
10 s: 43872 sent (4652/s), 43870 received (4651/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 267 (0.6%) avg: 0.084ms
0.10-1.00ms # 2105 (4.8%) avg: 0.337mss
1.00-10.00ms # 3861 (8.8%) avg: 5.372mss
10.00-100.00ms ####### 31756 (72.4%) avg: 51.454ms
>100.00ms # 5881 (13.4%) avg: 117.087ms
But on my Intel workstation (Xeon W-2245, up-to-date RHEL 8)
it looks like this:
HEAD:
10 s: 15324 sent (1565/s), 15326 received (1567/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms 0 (0.0%) avg: 0.000ms
0.10-1.00ms 0 (0.0%) avg: 0.000ms
1.00-10.00ms # 15 (0.1%) avg: 6.743ms
10.00-100.00ms # 92 (0.6%) avg: 56.650ms
>100.00ms ######### 15219 (99.3%) avg: 253.127ms
v34:
10 s: 198891 sent (20011/s), 198890 received (20010/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms ####### 158004 (79.4%) avg: 0.067ms
0.10-1.00ms # 39661 (19.9%) avg: 0.183ms
1.00-10.00ms # 1051 (0.5%) avg: 2.832ms
10.00-100.00ms # 175 (0.1%) avg: 14.321ms
>100.00ms 0 (0.0%) avg: 0.000ms
v35:
10 s: 190192 sent (19966/s), 190192 received (19966/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms ####### 151957 (79.9%) avg: 0.063ms
0.10-1.00ms # 31932 (16.8%) avg: 0.207ms
1.00-10.00ms # 2693 (1.4%) avg: 3.768ms
10.00-100.00ms # 3610 (1.9%) avg: 25.191mss
>100.00ms 0 (0.0%) avg: 0.000ms
This doesn't make a lot of sense if you compare the hardware specs:
the M4 Pro has more than double the geekbench ratings of the W-2245,
yet it runs these tests much more slowly. I think perhaps there is
something about the way we do sleep/wakeup on macOS that is not as
well optimized as Linux.
Also, I was testing various cases that just go as fast as possible,
no --sleep. On the Mac (ignoring plain HEAD, it's not in the
same league):
v34:
$ ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 100 --duration 10
10 s: 674530 sent (67781/s), 674529 received (67781/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms 0 (0.0%) avg: 0.000ms
0.10-1.00ms # 1 (0.0%) avg: 0.572ms
1.00-10.00ms ######### 674528 (100.0%) avg: 1.513ms
10.00-100.00ms 0 (0.0%) avg: 0.000ms
>100.00ms 0 (0.0%) avg: 0.000ms
$ ./pg_async_notify_test --listeners 10 --notifiers 1 --channels 100 --duration 10
10 s: 81097 sent (8015/s), 810962 received (80163/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms 0 (0.0%) avg: 0.000ms
0.10-1.00ms # 64 (0.0%) avg: 0.497ms
1.00-10.00ms ## 224131 (27.6%) avg: 9.105ms
10.00-100.00ms ####### 586772 (72.4%) avg: 14.290ms
>100.00ms 0 (0.0%) avg: 0.000ms
$ ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 1000 --duration 10
10 s: 363167 sent (36772/s), 363167 received (36772/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms 0 (0.0%) avg: 0.000ms
0.10-1.00ms 0 (0.0%) avg: 0.000ms
1.00-10.00ms # 2 (0.0%) avg: 5.912ms
10.00-100.00ms ######### 362165 (99.7%) avg: 27.450ms
>100.00ms # 1000 (0.3%) avg: 129.015ms
v35:
$ ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 100 --duration 10
10 s: 707180 sent (70699/s), 707179 received (70698/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 1 (0.0%) avg: 0.084ms
0.10-1.00ms # 43 (0.0%) avg: 0.862ms
1.00-10.00ms ######### 707135 (100.0%) avg: 1.441ms
10.00-100.00ms 0 (0.0%) avg: 0.000ms
>100.00ms 0 (0.0%) avg: 0.000ms
$ ./pg_async_notify_test --listeners 10 --notifiers 1 --channels 100 --duration 10
10 s: 76301 sent (7632/s), 763000 received (76325/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms 0 (0.0%) avg: 0.000ms
0.10-1.00ms # 28 (0.0%) avg: 0.886ms
1.00-10.00ms # 91294 (12.0%) avg: 9.351ms
10.00-100.00ms ######## 671680 (88.0%) avg: 13.953ms
>100.00ms 0 (0.0%) avg: 0.000ms
$ ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 1000 --duration 10
10 s: 313326 sent (36772/s), 313325 received (36771/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms 0 (0.0%) avg: 0.000ms
0.10-1.00ms 0 (0.0%) avg: 0.000ms
1.00-10.00ms 0 (0.0%) avg: 0.000ms
10.00-100.00ms ######### 302951 (96.7%) avg: 27.686ms
>100.00ms # 10375 (3.3%) avg: 164.892mss
But on the Linux box:
v34:
$ ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 100 --duration 10
10 s: 914323 sent (92641/s), 914322 received (92641/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 13 (0.0%) avg: 0.045ms
0.10-1.00ms # 17 (0.0%) avg: 0.598ms
1.00-10.00ms ######### 914292 (100.0%) avg: 1.105ms
10.00-100.00ms 0 (0.0%) avg: 0.000ms
>100.00ms 0 (0.0%) avg: 0.000ms
$ ./pg_async_notify_test --listeners 10 --notifiers 1 --channels 100 --duration 10
10 s: 205412 sent (20655/s), 2054112 received (206550/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms 0 (0.0%) avg: 0.000ms
0.10-1.00ms # 149 (0.0%) avg: 0.455ms
1.00-10.00ms ######### 2053945 (100.0%) avg: 4.888ms
10.00-100.00ms # 18 (0.0%) avg: 10.430ms
>100.00ms 0 (0.0%) avg: 0.000ms
$ ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 1000 --duration 10
10 s: 331651 sent (33458/s), 331649 received (33458/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms 0 (0.0%) avg: 0.000ms
0.10-1.00ms # 1 (0.0%) avg: 0.999ms
1.00-10.00ms # 1 (0.0%) avg: 1.035ms
10.00-100.00ms ######### 331648 (100.0%) avg: 30.134ms
>100.00ms 0 (0.0%) avg: 0.000ms
v35:
$ ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 100 --duration 10
10 s: 940448 sent (95016/s), 940447 received (95016/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 11 (0.0%) avg: 0.075ms
0.10-1.00ms # 557 (0.1%) avg: 0.660ms
1.00-10.00ms ######### 939880 (99.9%) avg: 1.074ms
10.00-100.00ms 0 (0.0%) avg: 0.000ms
>100.00ms 0 (0.0%) avg: 0.000ms
$ ./pg_async_notify_test --listeners 10 --notifiers 1 --channels 100 --duration 10
10 s: 208430 sent (21033/s), 2084298 received (210329/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms # 121 (0.0%) avg: 0.071ms
0.10-1.00ms # 455 (0.0%) avg: 0.493ms
1.00-10.00ms ######### 2082795 (99.9%) avg: 4.813ms
10.00-100.00ms # 929 (0.0%) avg: 11.882ms
>100.00ms 0 (0.0%) avg: 0.000ms
$ ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 1000 --duration 10
10 s: 351154 sent (35975/s), 351153 received (35976/s)
Notification Latency Distribution:
0.00-0.01ms 0 (0.0%) avg: 0.000ms
0.01-0.10ms 0 (0.0%) avg: 0.000ms
0.10-1.00ms 0 (0.0%) avg: 0.000ms
1.00-10.00ms 0 (0.0%) avg: 0.000ms
10.00-100.00ms ########## 351154 (100.0%) avg: 28.460ms
>100.00ms 0 (0.0%) avg: 0.000ms
Also, I looked at "perf" results for the as-fast-as-possible
runs, and was interested to see that the directly notify-related
logic accounts for only 10%-15% of the runtime. The rest is going
into generic transaction housekeeping, client I/O, kernel overhead,
and so on. So that bolsters my feeling that we should be minimizing
process wakeups rather than trying to optimize anything more in
the notify processing itself.
Anyway, at this point I'm content to go ahead with v35, and
I'll push that in a little bit. Perhaps we should take a TODO
to figure out why this test scenario runs so poorly on macOS;
but I'll bet that the answer is not anywhere near async.c itself.
regards, tom lane
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2026-04-19 04:00 Alexander Lakhin <[email protected]>
parent: Arseniy Mukhin <[email protected]>
5 siblings, 1 reply; 120+ messages in thread
From: Alexander Lakhin @ 2026-04-19 04:00 UTC (permalink / raw)
To: Tom Lane <[email protected]>; Joel Jacobson <[email protected]>; +Cc: pgsql-hackers
Hello Tom and Joel,
15.01.2026 20:55, Tom Lane wrote:
> Anyway, at this point I'm content to go ahead with v35, and
> I'll push that in a little bit. Perhaps we should take a TODO
> to figure out why this test scenario runs so poorly on macOS;
> but I'll bet that the answer is not anywhere near async.c itself.
While browsing through new inconsistencies and typos, I came across one
which I'm not sure what to do with. Could you help, please?
async-notify.spec contains:
# Check ChannelHashAddListener array growth.
permutation listenc llisten l2listen l3listen lslisten
But as far as I can see, ChannelHashAddListener() was eliminated in
0002-optimize_listen_notify-v13.patch upthread [1]:
> > Or thinking a little bigger: why are we maintaining the set of
> > channels-listened-to both as a list and a hash? Could we remove
> > the list form?
>
> Yes, it was indeed possible to remove the list form.
>
So, maybe the comment or perhaps even the test case should be changed/
removed?
[1] https://www.postgresql.org/message-id/8bfca2be-1ec0-4e15-aafb-0b7b661fe936%40app.fastmail.com
Best regards,
Alexander
^ permalink raw reply [nested|flat] 120+ messages in thread
* Re: Optimize LISTEN/NOTIFY
@ 2026-05-20 11:47 Joel Jacobson <[email protected]>
parent: Alexander Lakhin <[email protected]>
0 siblings, 0 replies; 120+ messages in thread
From: Joel Jacobson @ 2026-05-20 11:47 UTC (permalink / raw)
To: Alexander Lakhin <[email protected]>; +Cc: pgsql-hackers
On Sat, Apr 18, 2026, at 21:00, Alexander Lakhin wrote:
> While browsing through new inconsistencies and typos, I came across one
> which I'm not sure what to do with. Could you help, please?
>
> async-notify.spec contains:
> # Check ChannelHashAddListener array growth.
> permutation listenc llisten l2listen l3listen lslisten
>
> But as far as I can see, ChannelHashAddListener() was eliminated in
> 0002-optimize_listen_notify-v13.patch upthread [1]:
>
>> > Or thinking a little bigger: why are we maintaining the set of
>> > channels-listened-to both as a list and a hash? Could we remove
>> > the list form?
>> Yes, it was indeed possible to remove the list form.
>>
>
> So, maybe the comment or perhaps even the test case should be changed/
> removed?
Yes, that test is a leftover from a previous patch version.
I'll post a patch to remove it in a separate thread.
Thanks for spotting.
/Joel
^ permalink raw reply [nested|flat] 120+ messages in thread
end of thread, other threads:[~2026-05-20 11:47 UTC | newest]
Thread overview: 120+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-07-12 22:35 Optimize LISTEN/NOTIFY Joel Jacobson <[email protected]>
2025-07-12 23:18 ` Tom Lane <[email protected]>
2025-07-15 07:20 ` Joel Jacobson <[email protected]>
2025-07-15 20:56 ` Joel Jacobson <[email protected]>
2025-07-15 21:50 ` Joel Jacobson <[email protected]>
2025-07-16 00:20 ` Rishu Bagga <[email protected]>
2025-07-16 07:00 ` Joel Jacobson <[email protected]>
2025-07-17 07:43 ` Joel Jacobson <[email protected]>
2025-07-23 01:39 ` Joel Jacobson <[email protected]>
2025-07-23 02:44 ` Thomas Munro <[email protected]>
2025-07-24 21:03 ` Joel Jacobson <[email protected]>
2025-08-07 00:16 ` Joel Jacobson <[email protected]>
2025-09-23 16:27 ` Tom Lane <[email protected]>
2025-09-24 20:34 ` Joel Jacobson <[email protected]>
2025-09-25 08:25 ` Chao Li <[email protected]>
2025-09-25 21:13 ` Joel Jacobson <[email protected]>
2025-09-26 02:26 ` Chao Li <[email protected]>
2025-09-26 09:32 ` Joel Jacobson <[email protected]>
2025-09-26 09:44 ` Chao Li <[email protected]>
2025-09-28 10:24 ` Joel Jacobson <[email protected]>
2025-09-29 02:33 ` Chao Li <[email protected]>
2025-09-30 18:56 ` Joel Jacobson <[email protected]>
2025-10-01 05:47 ` Joel Jacobson <[email protected]>
2025-10-02 16:39 ` Tom Lane <[email protected]>
2025-10-06 20:11 ` Joel Jacobson <[email protected]>
2025-10-06 20:22 ` Joel Jacobson <[email protected]>
2025-10-07 05:39 ` Joel Jacobson <[email protected]>
2025-10-07 05:43 ` Tom Lane <[email protected]>
2025-10-07 06:16 ` Joel Jacobson <[email protected]>
2025-10-07 12:40 ` Matheus Alcantara <[email protected]>
2025-10-07 16:51 ` Tom Lane <[email protected]>
2025-10-07 21:14 ` Matheus Alcantara <[email protected]>
2025-10-07 21:17 ` Tom Lane <[email protected]>
2025-10-07 21:22 ` Matheus Alcantara <[email protected]>
2025-10-07 17:28 ` Joel Jacobson <[email protected]>
2025-10-07 18:14 ` Tom Lane <[email protected]>
2025-10-07 19:26 ` Joel Jacobson <[email protected]>
2025-10-07 20:15 ` Tom Lane <[email protected]>
2025-10-08 14:31 ` Joel Jacobson <[email protected]>
2025-10-08 18:46 ` Tom Lane <[email protected]>
2025-10-10 18:46 ` Joel Jacobson <[email protected]>
2025-10-11 06:43 ` Joel Jacobson <[email protected]>
2025-10-11 07:43 ` Joel Jacobson <[email protected]>
2025-10-14 16:40 ` Joel Jacobson <[email protected]>
2025-10-14 21:19 ` Tom Lane <[email protected]>
2025-10-15 03:19 ` Chao Li <[email protected]>
2025-10-15 15:36 ` Joel Jacobson <[email protected]>
2025-10-16 02:54 ` Chao Li <[email protected]>
2025-10-16 18:16 ` Joel Jacobson <[email protected]>
2025-10-16 20:06 ` Joel Jacobson <[email protected]>
2025-10-16 20:16 ` Tom Lane <[email protected]>
2025-10-18 16:41 ` Arseniy Mukhin <[email protected]>
2025-10-19 22:06 ` Joel Jacobson <[email protected]>
2025-10-19 22:10 ` Joel Jacobson <[email protected]>
2025-10-20 05:12 ` Joel Jacobson <[email protected]>
2025-10-20 16:43 ` Arseniy Mukhin <[email protected]>
2025-10-23 08:16 ` Chao Li <[email protected]>
2025-10-23 10:02 ` Arseniy Mukhin <[email protected]>
2025-10-26 04:11 ` Chao Li <[email protected]>
2025-10-26 06:33 ` Joel Jacobson <[email protected]>
2025-10-26 07:08 ` Joel Jacobson <[email protected]>
2025-10-26 23:24 ` Joel Jacobson <[email protected]>
2025-10-27 01:27 ` Chao Li <[email protected]>
2025-10-27 06:18 ` Joel Jacobson <[email protected]>
2025-10-28 01:02 ` Chao Li <[email protected]>
2025-10-28 06:41 ` Joel Jacobson <[email protected]>
2025-10-28 06:46 ` Chao Li <[email protected]>
2025-10-28 21:45 ` Joel Jacobson <[email protected]>
2025-10-29 07:05 ` Chao Li <[email protected]>
2025-10-29 10:33 ` Joel Jacobson <[email protected]>
2025-10-30 03:22 ` Chao Li <[email protected]>
2025-11-01 20:41 ` Arseniy Mukhin <[email protected]>
2025-11-05 00:58 ` Joel Jacobson <[email protected]>
2025-11-05 01:06 ` Joel Jacobson <[email protected]>
2025-11-05 09:21 ` Chao Li <[email protected]>
2025-11-05 17:51 ` Arseniy Mukhin <[email protected]>
2025-11-05 23:21 ` Chao Li <[email protected]>
2025-11-06 08:33 ` Joel Jacobson <[email protected]>
2025-11-07 18:59 ` Joel Jacobson <[email protected]>
2025-11-08 12:59 ` Joel Jacobson <[email protected]>
2025-11-08 15:04 ` Joel Jacobson <[email protected]>
2025-11-11 16:34 ` Joel Jacobson <[email protected]>
2025-11-12 16:57 ` Arseniy Mukhin <[email protected]>
2025-11-12 20:37 ` Joel Jacobson <[email protected]>
2025-11-12 20:53 ` Joel Jacobson <[email protected]>
2025-11-13 06:28 ` Joel Jacobson <[email protected]>
2025-11-13 06:36 ` Arseniy Mukhin <[email protected]>
2025-11-13 07:13 ` Joel Jacobson <[email protected]>
2025-11-14 16:01 ` Joel Jacobson <[email protected]>
2025-11-15 21:53 ` Joel Jacobson <[email protected]>
2025-11-17 07:04 ` Joel Jacobson <[email protected]>
2025-11-18 08:15 ` Chao Li <[email protected]>
2025-11-19 03:14 ` Joel Jacobson <[email protected]>
2025-11-20 20:26 ` Tom Lane <[email protected]>
2025-11-22 21:30 ` Joel Jacobson <[email protected]>
2025-11-23 15:49 ` Joel Jacobson <[email protected]>
2025-11-23 20:43 ` Joel Jacobson <[email protected]>
2025-11-25 20:14 ` Joel Jacobson <[email protected]>
2025-11-25 20:17 ` Tom Lane <[email protected]>
2025-12-26 20:12 ` Joel Jacobson <[email protected]>
2025-12-27 12:40 ` Joel Jacobson <[email protected]>
2025-12-28 16:10 ` Joel Jacobson <[email protected]>
2026-01-15 04:46 ` Tom Lane <[email protected]>
2026-01-15 18:55 ` Tom Lane <[email protected]>
2026-04-19 04:00 ` Alexander Lakhin <[email protected]>
2026-05-20 11:47 ` Joel Jacobson <[email protected]>
2025-10-15 03:22 ` Joel Jacobson <[email protected]>
2025-10-15 11:19 ` Arseniy Mukhin <[email protected]>
2025-10-15 14:16 ` Tom Lane <[email protected]>
2025-10-15 19:53 ` Arseniy Mukhin <[email protected]>
2025-10-15 20:39 ` Joel Jacobson <[email protected]>
2025-10-15 21:10 ` Joel Jacobson <[email protected]>
2025-10-15 21:15 ` Tom Lane <[email protected]>
2025-10-16 09:39 ` Joel Jacobson <[email protected]>
2025-10-08 03:43 ` Chao Li <[email protected]>
2025-10-08 04:36 ` Chao Li <[email protected]>
2025-10-08 14:53 ` Joel Jacobson <[email protected]>
2025-10-09 01:11 ` Chao Li <[email protected]>
2025-10-09 08:07 ` Joel Jacobson <[email protected]>
2025-10-09 08:39 ` Chao Li <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox