Feedback-ID: ic6394509:Fastmail
MIME-Version: 1.0
Date: Wed, 16 Jul 2025 09:00:05 +0200
From: "Joel Jacobson" <joel@compiler.org>
To: "Rishu Bagga" <rishu.postgres@gmail.com>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>
Message-Id: <af75d742-1b74-43aa-8777-e1de7a36fdba@app.fastmail.com>
In-Reply-To: 
 <CAK80=jhmE40KVqQ3ho37MArS7cAED1p9m7uikDxcnDmqdW7t8A@mail.gmail.com>
References: <6899c044-4a82-49be-8117-e6f669765f7e@app.fastmail.com>
 <165530.1752362320@sss.pgh.pa.us>
 <02a7cd37-e2fc-4212-8b19-f8c239c95fb8@app.fastmail.com>
 <e396ecf3-4227-4918-b9ff-e9568dcebcf0@app.fastmail.com>
 <b6a427fe-2ed6-4568-85d9-207a68172617@app.fastmail.com>
 <CAK80=jhmE40KVqQ3ho37MArS7cAED1p9m7uikDxcnDmqdW7t8A@mail.gmail.com>
Subject: Re: Optimize LISTEN/NOTIFY
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/af75d742-1b74-43aa-8777-e1de7a36fdba%40app.fastmail.com>
Precedence: bulk

On Wed, Jul 16, 2025, at 02:20, Rishu Bagga wrote:
> Hi Joel,
>
> Thanks for sharing the patch.
> I have a few questions based on a cursory first look.
>
>> If a single listener is found, we signal only that backend.
>> Otherwise, we fall back to the existing broadcast behavior.
>
> The idea of not wanting to wake up all backends makes sense to me,
> but I don=E2=80=99t understand why we want this optimization only for =
the case
> where there is a single backend listening on a channel.
>
> Is there a pattern of usage in LISTEN/NOTIFY where users typically
> have either just one or several backends listening on a channel?
>
> If we are doing this optimization, why not maintain a list of backends
> for each channel, and only wake up those channels?

Thanks for the thoughtful question. You've hit on the central design tra=
de-off
in this optimization: how to provide targeted signaling for some workloa=
ds
without degrading performance for others.

While we don't have telemetry on real-world usage patterns of LISTEN/NOT=
IFY,
it seems likely that most applications fall into one of three categories,
which I've been thinking of in networking terms:

1. Broadcast-style ("hub mode")

Many backends listening on the *same* channel (e.g., for cache invalidat=
ion).
The current implementation is already well-optimized for this, behaving =
like
an Ethernet hub that broadcasts to all ports. Waking all listeners is ef=
ficient
because they all need the message.

2. Targeted notifications ("switch mode")

Each backend listens on its own private channel (e.g., for session event=
s or
worker queues). This is where the current implementation scales poorly, =
as every
NOTIFY wakes up all listeners regardless of relevance. My patch is desig=
ned
to make this behave like an efficient Ethernet switch.

3. Selective multicast-style ("group mode")

A subset of backends shares a channel, but not all. This is the tricky m=
iddle
ground. Your question, "why not maintain a list of backends for each cha=
nnel,
and only wake up those channels?" is exactly the right one to ask.
A full listener list seems like the obvious path to optimizing for *all*=
 cases.
However, the devil is in the details of concurrency and performance. Man=
aging
such a list would require heavier locking, which would create a new bott=
leneck
and degrade the scalability of LISTEN/UNLISTEN operations=E2=80=94especi=
ally for
the "hub mode" case where many backends rapidly subscribe to the same po=
pular
channel.

This patch makes a deliberate architectural choice:
Prioritize a massive, low-risk win for "switch mode" while rigorously pr=
otecting
the performance of "hub mode".

It introduces a targeted fast path for single-listener channels and clea=
nly
falls back to the existing, well-performing broadcast model for everythi=
ng else.

This brings us back to "group mode", which remains an open optimization =
problem.
A possible approach could be to track listeners up to a small threshold =
*K*
(e.g., store up to 4 ProcNumber's in the hash entry). If the count excee=
ds *K*,
we would flip a "broadcast" flag and revert to hub-mode behavior.

However, this path has a critical drawback:

1. Performance Penalty for Hub Mode

With the current patch, after the second listener joins a channel,
the has_multiple_listeners flag is set. Every subsequent listener can ac=
quire
a shared lock, see the flag is true, and immediately continue. This is
a highly concurrent, read-only operation that does not require mutating =
shared
state.

In contrast, the K-listener approach would force every new listener (fro=
m the
third up to the K-th) to acquire an exclusive lock to mutate the shared
listener array**. This would serialize LISTEN operations on popular chan=
nels,
creating the very contention point this patch successfully avoids and di=
rectly
harming the hub-mode use case that currently works well.

2. Uncertainty

Compounding this, without clear data on typical "group" sizes, choosing =
a value
for *K* is a shot in the dark. A small *K* might not help much, while
a large *K* would increase the shared memory footprint and worsen the
serialization penalty.

For these reasons, attempting to build a switch that also optimizes for
multicast risks undermining the architectural clarity and performance of
both the switch and hub models.

This patch, therefore, draws a clean line. It provides a precise,
low-cost path for switch-mode workloads and preserves the existing,
well-performing path for hub-mode workloads. While this leaves "group mo=
de"
unoptimized for now, it ensures we make two common use cases better with=
out
making any use case worse. The new infrastructure is flexible, leaving
the door open should a better approach for "group mode" emerge in
the future=E2=80=94one that doesn't compromise the other two.

Benchmarks updated showing master vs 0001-optimize_listen_notify-v3.patc=
h:
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/plot.png
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performan=
ce_overview_connections_equal_jobs.png
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performan=
ce_overview_fixed_connections.png

I've not included the benchmark CSV data in this mail, since it's quite =
heavy,
160kB, and I couldn't see any significant performance changes since v2.

/Joel