Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1v8s3f-006Aec-PE for pgsql-hackers@arkaria.postgresql.org; Wed, 15 Oct 2025 03:20:11 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1v8s3d-004I8z-JI for pgsql-hackers@arkaria.postgresql.org; Wed, 15 Oct 2025 03:20:08 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1v8s3d-004I8r-7C for pgsql-hackers@lists.postgresql.org; Wed, 15 Oct 2025 03:20:08 +0000 Received: from mail-pl1-x62a.google.com ([2607:f8b0:4864:20::62a]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1v8s3Z-002Ewa-2M for pgsql-hackers@postgresql.org; Wed, 15 Oct 2025 03:20:07 +0000 Received: by mail-pl1-x62a.google.com with SMTP id d9443c01a7336-27eed7bdfeeso6454295ad.0 for ; Tue, 14 Oct 2025 20:20:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1760498403; x=1761103203; darn=postgresql.org; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:from:to:cc:subject:date:message-id:reply-to; bh=SinHqFa27vUUITMm5NDQCNexIrOU+zE0cZVtqt3cQTo=; b=hVg1Q3ylNIkmGYZvg07wlh4Lx1sTw4M82G+4EODXoAlY3N/BWDBHXze57tAUAkuXiB Ge363RFuT1hJfRZ2Oprt/3NI+V0uTbEcD5NVh0J5cWmbLfU/3OIO0RCEtJEYYVWlycEq ObJjTwDxdKDCHRLtSGAmQbAvC12bOaH8AaJK1OXFU0brqHXQuiLfDwX3vakiD32sjXjE dPxThID/OVUyd4SVbL+MLgatdjeGNlnLSncOpR9ImVjdiA8tVavn6h0IRKqUwCkNERnx eubr4AF2Ttlkru/4W9Em0NdlGU2D22/BMBQWmM8i0DPVYvqx46U58LgV3f0PTy+1kQtC fXfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760498403; x=1761103203; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=SinHqFa27vUUITMm5NDQCNexIrOU+zE0cZVtqt3cQTo=; b=xVTPf+7t98lyO3C7HhZX2DqIWuqNIJFraimLAUXgzmwFRxtEa6vnVSwwnd/kOstzNf M1albhW1wb6gZHt60I9wR/UXSWX/XsNljGx84pjhaOWFVoP0Hg2qxsxOjidN1/vAh3dH jm8N6Cj4BVJ2SDKOQ4sxWkiFZfApw5vNI+bHKfzLzZ0d3muf3edESrcbssy84RHBKXiy GTWGuSFItqez00ZUbLGTInSsPt4Z/h0OedbOKUkX75I7uGlVabR4aQa8U9JdbAfMb88e H75YPjhmwRef9sfmhyGcvR5cYFGvvyj83J/K2G3lcu62xBLIBTbvkhtslUQjnpQj+Cob dfAQ== X-Gm-Message-State: AOJu0Yz12VxvI3iPaDFLs1es/DTwfgy5iaKctKTyMcBzdlV5TDRgUXLW Y5jju2se8q/n77qKLyRX861SzijJAt+haNaM7qdKHsA5xqK3NMyI/xMM X-Gm-Gg: ASbGncug8lJuNtSfsZK9PJ50ZKiAKpKPHEtq54Kj4vfwdgquW++EriqTbbp4/4UNzjR HqucP8ZGzbQHtTv3kSJJF+q7FXRowgiuG1NyzLGCu4xrgF78OieFf6OESeAFbvh/4w9gB6nNWFs ptUOWWy4EQHoCbgB6ZuQtpNU1+vME9zYm6mXtBglIv0A3DxkX+Kt4lOq+g5UBGQ5TItlVFfF42g ZIOW18Prl9K9eKxcz5mCKSQtuvzCDYj8wT1FE5v22aCBMsj3y3QnP86QURhdoEJfM4cHvuOgR8d UqsGojDConZVeWBKGHku+jWuFEa4/Sqx5cWmvgMKLzixQ54i+VHwgPUNQv5v7RdwqBrUP2fikAt mTxVp3ZHcorv8IQ3ykWZ1StvewtuO8MWot7Lt45U6/fhxwOPRUhaA X-Google-Smtp-Source: AGHT+IFizAhovzL6QHXs9JcQdXgPrwHOykeyUUP4QNOaKVoQh/Z4/mUukkTffJ0hAkdnKzWQOPmX0A== X-Received: by 2002:a17:902:e94e:b0:271:9b0e:54ca with SMTP id d9443c01a7336-28ec9c372a0mr402076455ad.13.1760498402908; Tue, 14 Oct 2025 20:20:02 -0700 (PDT) Received: from smtpclient.apple ([170.178.170.211]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-29034de6fd3sm180714085ad.25.2025.10.14.20.20.00 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 14 Oct 2025 20:20:02 -0700 (PDT) From: Chao Li Message-Id: <1F7227F5-C33D-4E2C-8511-33F1468590D0@gmail.com> Content-Type: multipart/alternative; boundary="Apple-Mail=_D819A24E-9724-454A-9FE3-505877FDB912" Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3826.700.81\)) Subject: Re: Optimize LISTEN/NOTIFY Date: Wed, 15 Oct 2025 11:19:47 +0800 In-Reply-To: <1009807.1760476747@sss.pgh.pa.us> Cc: pgsql-hackers To: Tom Lane , Joel Jacobson References: <6899c044-4a82-49be-8117-e6f669765f7e@app.fastmail.com> <165530.1752362320@sss.pgh.pa.us> <02a7cd37-e2fc-4212-8b19-f8c239c95fb8@app.fastmail.com> <96f00bf1-cc9d-4520-9d02-9e14e7767c88@app.fastmail.com> <30c2aa7d-dd6c-4b68-a2e4-f217a1a34acf@app.fastmail.com> <0b4d402a-9ac2-4aa8-acf8-8231dbe579ea@app.fastmail.com> <3095599.1758644879@sss.pgh.pa.us> <0dc6a2cc-5216-4dc1-9dd2-430cafc6095b@app.fastmail.com> <52CC167F-763B-4ECA-B0B4-DAB381816828@gmail.com> <9186C6D0-F7A9-482A-9183-89E530B57E36@gmail.com> <1073593.1759423179@sss.pgh.pa.us> <4bd5e6c4-6fa7-44bb-869d-59a32a331fa8@app.fastmail.com> <85828f29-e72e-4400-94f3-9a69bc8dc239@app.fastmail.com> <2495353.1759860890@sss.pgh.pa.us> <8aeae418-92a6-4bbd-9c06-9574c79e59f7@app.fastmail.com> <2531672.1759868124@sss.pgh.pa.us> <474efa78-337c-41cd-a73a-f845a0115109@app.fastmail.com> <2749343.1759949176@sss.pgh.pa.us> <8bfca2be-1ec0-4e15-aafb-0b7b661fe936@app.fastmail.com> <9eba307f-f2fb-48f0-9507-2e197f39ef9e@app.fastmail.com> <8c71183a-0d28-4bcf-a806-78446ff95404@app.fastmail.com> <1009807.1760476747@sss.pgh.pa.us> X-Mailer: Apple Mail (2.3826.700.81) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --Apple-Mail=_D819A24E-9724-454A-9FE3-505877FDB912 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On Oct 15, 2025, at 05:19, Tom Lane wrote: >=20 > "Joel Jacobson" writes: >> Having investigated this, the "direct advancement" approach seems >> correct to me. >=20 >> (I understand the exclusive lock in PreCommit_Notify on = NotifyQueueLock >> is of course needed because there are other operations that don't >> acquire the heavyweight-lock, that take shared/exclusive lock on >> NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on >> NotifyQueueLock in PreCommit_Notify is needed, since it modifies the >> QUEUE_HEAD.) >=20 > Right. What the heavyweight lock buys for us in this context is that > we can be sure no other would-be notifier can insert any messages > in between ours, even though we may take and release NotifyQueueLock > several times to allow readers to sneak in. That in turn means that > it's safe to advance readers over that whole set of messages if we > know we didn't wake them up for any of those messages. >=20 > There is a false-positive possibility if a reader was previously > signaled but hasn't yet awoken: we will think that maybe we signaled > it and hence not advance its pointer. This is an error in the safe > direction however, and it will advance its pointer when it does > wake up. >=20 > A potential complaint is that we are doubling down on the need for > that heavyweight lock, despite the upthread discussion about maybe > getting rid of it for better scalability. However, this patch > only requires holding a lock across all the insertions, not holding > it through commit which I think is the true scalability blockage. > If we did want to get rid of that lock, we'd only need to stop > releasing NotifyQueueLock at insertion page boundary crossings, > which I suspect isn't really that useful anyway. (In connection > with that though, I think you ought to capture both the "before" and > "after" pointers within that lock interval, not expend another lock > acquisition later.) >=20 > It would be good if the patch's comments made these points ... > also, the comments above struct AsyncQueueControl need to be > updated, because changing some other backend's queue pos is > not legal under any of the stated rules. >=20 I used to think =E2=80=9Cdirect advancement=E2=80=9D was a good idea. = After reading Tom=E2=80=99s explanation, and reading v16 again = carefully, now I also consider it=E2=80=99s adding complexity and could = be fragile. I just composed an example of race condition, please see if it is valid. Because recoding queueHeadBeforeWrite and queueHeadAfterWrite happen in = PreCommit_Notify() and checking them happens in AtCommit_Notify(), there = is an interval in between, something may happen. Say a listener A, it=E2=80=99s head pointing to 1. And current QueueHead is 1. Now two notifiers B and C are committing: * B enters PreCommit_Notify(), it gets the NotifyQueueLock first, it = records headBeforeWrite =3D 1 and writes to 3, and records = headAfterWrite =3D 3. * Now QueueHead is 3. * C enters PreCommit_Notify(), it records headBeforeWrite =3D 3 and = writes to 5, and records headAfterWrite =3D 5. * Now QueueHead is 5 * C starts to run AtCommit_Notify(), as A=E2=80=99s head is 1, = doesn=E2=80=99t equal to C=E2=80=99s headBeforeWrite, C won=E2=80=99t = advance A=E2=80=99s head. * A starts to run AtCommit_Notify(), A=E2=80=99s head equals to B=E2=80=99= s beforeHeadWrite, B will advance A=E2=80=99s head to 3. * At this time, QueueHead is 5, and A=E2=80=99s head is 3, so =E2=80=9Cdi= rect advancement=E2=80=9D will never work for A until A wakes up next = time. I am brainstorming. Maybe we can use a simpler strategy. If a = backend=E2=80=99s queue lag exceeds a threshold, then wake it up. This = solution is simpler and reliable, also reducing the total wake-up count. >=20 >> Given all the experiments since my earlier message, here is a fresh, >> self-contained write-up: >=20 > I'm getting itchy about removing the local listenChannels list, > because what you've done is to replace it with a shared data > structure that can't be accessed without a good deal of locking > overhead. That seems like it could easily be a net loss. >=20 > Also, I really do not like this implementation of > GetPendingNotifyChannels, as it looks like O(N^2) effort. > The potentially large length of the list it builds is scary too, > considering the comments that SignalBackends had better not fail. > If we have to do it that way it'd be better to collect the list > during PreCommit_Notify. >=20 I agree with Tom that GetPendingNotifyChannels() is too heavy and = unnecessary. In PreCommit_Notify(), we can maintain a local hash table to record = pending nofications=E2=80=99 channel names. dahash also supports hash = table in local memory. Then in SignalBackends(), we no longer need GetPendingNotifyChannels(), = we can just iterate all keys of the local channel name hash. And the local static numChannelsListeningOn is also not needed. We can = get the count from the local hash. WRT to v6, I got a few new comments: 1 - 0002 ``` * After commit we are called another time (AtCommit_Notify()). = Here we - * make any actual updates to the effective listen state = (listenChannels). + * make any actual updates to the effective listen state = (channelHash). * Then we signal any backends that may be interested in our = messages * (including our own backend, if listening). This is done by - * SignalBackends(), which scans the list of listening backends = and sends a - * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we = don't - * know which backend is listening on which channel so we must = signal them - * all). We can exclude backends that are already up to date, = though, and - * we can also exclude backends that are in other databases = (unless they - * are way behind and should be kicked to make them advance their - * pointers). + * SignalBackends(), which consults the shared channel hash table = to + * identify listeners for the channels that have pending = notifications + * in the current database. Each selected backend is marked as = having a + * wakeup pending to avoid duplicate signals, and a = PROCSIG_NOTIFY_INTERRUPT + * signal is sent to it. ``` In this comment, you refer to =E2=80=9CchannelHash=E2=80=9D and =E2=80=9Ct= he shared channel hash table=E2=80=9D, they are the same thing, but easy = to make readers to misunderstand. 2 - 0002 ``` pg_listening_channels(PG_FUNCTION_ARGS) { FuncCallContext *funcctx; + List *listenChannels; =20 /* stuff done only on the first call of the function */ if (SRF_IS_FIRSTCALL()) { + MemoryContext oldcontext; + dshash_seq_status status; + ChannelEntry *entry; + /* create a function context for cross-call persistence = */ funcctx =3D SRF_FIRSTCALL_INIT(); ``` listenChannels is only used within the =E2=80=9Cif=E2=80=9D, so it=E2=80=99= s definition can be moved into the =E2=80=9Cif=E2=80=9D. 3 - 0002 ``` + queue_length =3D asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD), + = QUEUE_POS_PAGE(QUEUE_TAIL)); + + /* Check for lagging backends when the queue spans multiple = pages */ + if (queue_length > 0) + { ``` I wonder why this check is needed. If queue_length is 0, can we return = immediately from SignalBackends()? Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/ --Apple-Mail=_D819A24E-9724-454A-9FE3-505877FDB912 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8

On Oct 15, 2025, at 05:19, Tom Lane = <tgl@sss.pgh.pa.us> wrote:

"Joel Jacobson" = <joel@compiler.org> writes:
Having = investigated this, the "direct advancement" approach seems
correct to = me.

(I understand the = exclusive lock in PreCommit_Notify on NotifyQueueLock
is of course = needed because there are other operations that don't
acquire the = heavyweight-lock, that take shared/exclusive lock on
NotifyQueueLock = to read/modify QUEUE_HEAD, so the exclusive lock on
NotifyQueueLock = in PreCommit_Notify is needed, since it modifies = the
QUEUE_HEAD.)

Right.  What the = heavyweight lock buys for us in this context is that
we can be sure = no other would-be notifier can insert any messages
in between ours, = even though we may take and release NotifyQueueLock
several times to = allow readers to sneak in.  That in turn means that
it's safe to = advance readers over that whole set of messages if we
know we didn't = wake them up for any of those messages.

There is a false-positive = possibility if a reader was previously
signaled but hasn't yet = awoken: we will think that maybe we signaled
it and hence not advance = its pointer.  This is an error in the safe
direction however, = and it will advance its pointer when it does
wake up.

A = potential complaint is that we are doubling down on the need for
that = heavyweight lock, despite the upthread discussion about maybe
getting = rid of it for better scalability.  However, this patch
only = requires holding a lock across all the insertions, not holding
it = through commit which I think is the true scalability blockage.
If we = did want to get rid of that lock, we'd only need to stop
releasing = NotifyQueueLock at insertion page boundary crossings,
which I suspect = isn't really that useful anyway.  (In connection
with that = though, I think you ought to capture both the "before" and
"after" = pointers within that lock interval, not expend another = lock
acquisition later.)

It would be good if the patch's = comments made these points ...
also, the comments above struct = AsyncQueueControl need to be
updated, because changing some other = backend's queue pos is
not legal under any of the stated = rules.


I used to = think =E2=80=9Cdirect advancement=E2=80=9D was a good idea. After = reading Tom=E2=80=99s explanation, and reading v16 again carefully, now = I also consider it=E2=80=99s adding complexity and could be = fragile.

I just composed an example of race = condition, please see if it is valid.

Because = recoding queueHeadBeforeWrite and queueHeadAfterWrite happen in = PreCommit_Notify() and checking them happens in AtCommit_Notify(), there = is an interval in between, something may = happen.

Say a listener A, it=E2=80=99s head = pointing to 1.

And current QueueHead is = 1.

Now two notifiers B and C are = committing:
 * B enters PreCommit_Notify(), it gets the = NotifyQueueLock first, it records headBeforeWrite =3D 1 and writes to 3, = and records headAfterWrite =3D 3.
 * Now QueueHead is = 3.
 * C enters PreCommit_Notify(),  it records = headBeforeWrite =3D 3 and writes to 5, and records headAfterWrite =3D = 5.
 * Now QueueHead is 5
 * C starts to = run AtCommit_Notify(), as A=E2=80=99s head is 1, doesn=E2=80=99t equal = to C=E2=80=99s headBeforeWrite, C won=E2=80=99t advance A=E2=80=99s = head.
 * A starts to run AtCommit_Notify(), A=E2=80=99s = head equals to B=E2=80=99s beforeHeadWrite, B will advance A=E2=80=99s = head to 3.
 * At this time, QueueHead is 5, and A=E2=80=99s= head is 3, so =E2=80=9Cdirect advancement=E2=80=9D will never work for = A until A wakes up next time.

I am = brainstorming. Maybe we can use a simpler strategy. If a backend=E2=80=99s= queue lag exceeds a threshold, then wake it up. This solution is = simpler and reliable, also reducing the total wake-up = count.


Given all the = experiments since my earlier message, here is a fresh,
self-contained = write-up:

I'm getting itchy about removing the local = listenChannels list,
because what you've done is to replace it with a = shared data
structure that can't be accessed without a good deal of = locking
overhead.  That seems like it could easily be a net = loss.

Also, I really do not like this implementation = of
GetPendingNotifyChannels, as it looks like O(N^2) effort.
The = potentially large length of the list it builds is scary = too,
considering the comments that SignalBackends had better not = fail.
If we have to do it that way it'd be better to collect the = list
during = PreCommit_Notify.


I = agree with Tom that GetPendingNotifyChannels() is too heavy and = unnecessary.

In PreCommit_Notify(), we can = maintain a local hash table to record pending nofications=E2=80=99 = channel names. dahash also supports hash table in local = memory.

Then in SignalBackends(), we no longer = need GetPendingNotifyChannels(), we can just iterate all keys of the = local channel name hash.

And the local = static numChannelsListeningOn is also not needed. We can get the = count from the local hash.

WRT to v6, I got a = few new comments:

1 - = 0002
```
  *  After commit we are called = another time (AtCommit_Notify()). Here we
- * =  make any actual updates to the effective listen state = (listenChannels).
+ *  make any actual updates to = the effective listen state (channelHash).
  * =  Then we signal any backends that may be interested in our = messages
  *  (including our own = backend, if listening).  This is done by
- * =  SignalBackends(), which scans the list of listening backends and = sends a
- *  PROCSIG_NOTIFY_INTERRUPT = signal to every listening backend (we don't
- * =  know which backend is listening on which channel so we must signal = them
- *  all).  We can exclude = backends that are already up to date, though, and
- *  we = can also exclude backends that are in other databases (unless = they
- *  are way behind and should = be kicked to make them advance their
- * =  pointers).
+ *  SignalBackends(), which = consults the shared channel hash table to
+ * =  identify listeners for the channels that have pending = notifications
+ *  in the current database. =  Each selected backend is marked as having a
+ * =  wakeup pending to avoid duplicate signals, and a = PROCSIG_NOTIFY_INTERRUPT
+ *  signal is sent to = it.
```

In this comment, = you refer to =E2=80=9CchannelHash=E2=80=9D and =E2=80=9Cthe shared = channel hash table=E2=80=9D, they are the same thing, but easy to make = readers to misunderstand.

2 - = 0002
```
 pg_listening_channels(PG_FUNCTION_= ARGS)
 {
  FuncCallContext = *funcctx;
+ List   = *listenChannels;
 
  /* stuff = done only on the first call of the function */
  if = (SRF_IS_FIRSTCALL())
  {
+ = MemoryContext oldcontext;
+ dshash_seq_status = status;
+ ChannelEntry = *entry;
+
  /* create a function = context for cross-call persistence */
  = funcctx =3D = SRF_FIRSTCALL_INIT();
```

listenC= hannels is only used within the =E2=80=9Cif=E2=80=9D, so it=E2=80=99s = definition can be moved into the =E2=80=9Cif=E2=80=9D.

3 - 0002
```
+ = queue_length =3D = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ = =  QUEUE_POS_PAGE(QUEUE_TAIL));
+
+ /* Check = for lagging backends when the queue spans multiple pages = */
+ = if (queue_length > 0)
+ = {
```

I wonder why this = check is needed. If queue_length is 0, can we return immediately from = SignalBackends()?

Best regards,
--
Chao Li (Evan)
HighGo Software = Co., Ltd.
https://www.highgo.com/




= --Apple-Mail=_D819A24E-9724-454A-9FE3-505877FDB912--