Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vJwFm-008yHf-2E for pgsql-hackers@arkaria.postgresql.org; Fri, 14 Nov 2025 16:02:26 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vJwFk-007AcM-10 for pgsql-hackers@arkaria.postgresql.org; Fri, 14 Nov 2025 16:02:24 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vJwFj-007AcE-1u for pgsql-hackers@lists.postgresql.org; Fri, 14 Nov 2025 16:02:24 +0000 Received: from fhigh-a2-smtp.messagingengine.com ([103.168.172.153]) by makus.postgresql.org with smtp (Exim 4.96) (envelope-from ) id 1vJwFh-0077pm-1c for pgsql-hackers@postgresql.org; Fri, 14 Nov 2025 16:02:22 +0000 Received: from phl-compute-05.internal (phl-compute-05.internal [10.202.2.45]) by mailfhigh.phl.internal (Postfix) with ESMTP id DF4FB14000FA for ; Fri, 14 Nov 2025 11:02:20 -0500 (EST) Received: from phl-imap-03 ([10.202.2.93]) by phl-compute-05.internal (MEProxy); Fri, 14 Nov 2025 11:02:20 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=compiler.org; h= cc:content-transfer-encoding:content-type:content-type:date:date :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm3; t=1763136140; x=1763222540; bh=PYUOH9TbahoVdp7VJ3GXz07KDlCkkPoCRNVbfCQRv2E=; b= W98/r7J57poks+BLg5wofCZaGqiip8GDLCOe0LsR55HS//rxCuRa9krIRQRfm6NR 8sdyA4yxvT0CL1bq0GScARgtDR/7+ZBwQewMa4p03xS+H1CCUbkhD5n77K6YauMW F8PrbtwI8TZL4TB/nr6rr9VPzBAyrTaTnYm7pxGbj9SwsZ3ACQNpZ6CMbZQ+UgRb e60wH7lHPEhQAxmMdE1D7MBxHcgvC20y3C5NAVf4rjy5KpQUTUWS9kPxXqebT33G IARZz3grm1gqlkPCWxazr7Gt7G/ilyM6gvKBpfnYQhdhJ13xsZEf1rOlrm7pBppl IAO4L6SfJuwRvA1qQ/FaGQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :content-type:date:date:feedback-id:feedback-id:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:subject:subject:to:to:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm3; t=1763136140; x=1763222540; bh=P YUOH9TbahoVdp7VJ3GXz07KDlCkkPoCRNVbfCQRv2E=; b=ZO430B04rRi45N2UA eSEbnxjhPV9rUjiltqKiy9OYGlv7BQaWBBqFUcuHPYCMnqOOb9vqnIxyUHGw1j5r /vXj2h7bBbcokbP11osWGqlc9w2XokddA8CVFspVfBq+mLvx5J0+SkwTOh2XLzBt qtsZhL+coEt53JNfXLFOoGP1sJFulpoWHG0CYWBTo3vqrc19Gy/bXQY1fAIkhNuO lSzs20CEF6479Mm5IcF3Lb1WA6yxRJjmpilFcTNHLDMtUdeNd8PMxx66EUOkT1/L upluSrQjGJlUCUuyqL2vS10Io3VrXgfQYBH27Nqx6aq90PxDY31jc93N5ZGEgoos 59I6g== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdeggddvuddtvdehucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucenucfjughrpefoggffhffvkfgjfhfutgfgsehtjeertd ertddtnecuhfhrohhmpedflfhovghlucflrggtohgsshhonhdfuceojhhovghlsegtohhm phhilhgvrhdrohhrgheqnecuggftrfgrthhtvghrnhepveelffdtiedugfdvgffftddvie evleetieetgfettdffjedtjeefffelhfelffeinecuvehluhhsthgvrhfuihiivgeptden ucfrrghrrghmpehmrghilhhfrhhomhepjhhovghlsegtohhmphhilhgvrhdrohhrghdpnh gspghrtghpthhtohepuddpmhhouggvpehsmhhtphhouhhtpdhrtghpthhtohepphhgshhq lhdqhhgrtghkvghrshesphhoshhtghhrvghsqhhlrdhorhhg X-ME-Proxy: Feedback-ID: ic6394509:Fastmail Received: by mailuser.phl.internal (Postfix, from userid 501) id 3EE2818E006C; Fri, 14 Nov 2025 11:02:20 -0500 (EST) X-Mailer: MessagingEngine.com Webmail Interface MIME-Version: 1.0 X-ThreadId: AE1r89ybsZ1g Date: Fri, 14 Nov 2025 17:01:59 +0100 From: "Joel Jacobson" To: pgsql-hackers Message-Id: <2eeea4f1-1b4f-430c-8571-544da04f08dc@app.fastmail.com> In-Reply-To: References: <6899c044-4a82-49be-8117-e6f669765f7e@app.fastmail.com> <165530.1752362320@sss.pgh.pa.us> <02a7cd37-e2fc-4212-8b19-f8c239c95fb8@app.fastmail.com> <96f00bf1-cc9d-4520-9d02-9e14e7767c88@app.fastmail.com> <30c2aa7d-dd6c-4b68-a2e4-f217a1a34acf@app.fastmail.com> <0b4d402a-9ac2-4aa8-acf8-8231dbe579ea@app.fastmail.com> <3095599.1758644879@sss.pgh.pa.us> <0dc6a2cc-5216-4dc1-9dd2-430cafc6095b@app.fastmail.com> <52CC167F-763B-4ECA-B0B4-DAB381816828@gmail.com> <9186C6D0-F7A9-482A-9183-89E530B57E36@gmail.com> <1073593.1759423179@sss.pgh.pa.us> <4bd5e6c4-6fa7-44bb-869d-59a32a331fa8@app.fastmail.com> <85828f29-e72e-4400-94f3-9a69bc8dc239@app.fastmail.com> <2495353.1759860890@sss.pgh.pa.us> <8aeae418-92a6-4bbd-9c06-9574c79e59f7@app.fastmail.com> <2531672.1759868124@sss.pgh.pa.us> <474efa78-337c-41cd-a73a-f845a0115109@app.fastmail.com> <2749343.1759949176@sss.pgh.pa.us> <8bfca2be-1ec0-4e15-aafb-0b7b661fe936@app.fastmail.com> <9eba307f-f2fb-48f0-9507-2e197f39ef9e@app.fastmail.com> <8c71183a-0d28-4bcf-a806-78446ff95404@app.fastmail.com> <1009807.1760476747@sss.pgh.pa.us> <1F7227F5-C33D-4E2C-8511-33F1468590D0@gmail.com> <0a5a20d3-4621-46b3-b2ab-903f63a20dea@app.fastmail.com> <6F913129-ABEF-4004-AAF3-F22FC34!29AE8@gmail.com> <1547585.1760645808@sss.pgh.pa.us> <14865EB6-0BF4-462B-9072-10BDAC10C052@gmail.com> <0BCA1C2D-B92C-459E-B1A6-6D06BA4C62CF@gmail.com> <55d24cbb-e9ef-491f-a99b-b3dbd7cecdf9@app.fastmail.com> <38574cad-e90d-47b7-a015-753bb6bbc360@app.fastmail.com> <66631FB7-5BEA-4ED5-A694-9AD8B9CCFEE8@gmail.com> <4b7b49a5-5e1a-44a8-93e0-60457d15cb1d@app.fastmail.com> <82DEA2B6-6FC5-4A79-BDE3-1FD72F104A6E@gmail.com> <38de1036-d8cf-420c-b845-edb5a946b191@app.fastmail.com> <87E40BF8-8877-4DBD-9040-99AF8A4E6358@gmail.com> <7556f0d4-03fd-451a-bd34-5f62b424319a@app.fastmail.com> <290910DE-9A03-4AE6-B348-073D5DA96ACC@gmail.com> <4B243750-12BE-4C16-A769-A803268F40E3@gmail.com> <4605CAD6-69D5-4082-B96C-91FC0DE5399D@gmail.com> <294e1641-d658-4d43-8671-60e8ff860532@app.fastmail.com> <26d4dd6a-cfc5-4efc-9704-9cd3216ed712@app.fastmail.com> <7456ec96-7a9c-45a0-988e-ba1c7f9ec937@app.fastmail.com> <0253b822-e8fd-4067-ab24-23493c115a2a@app.fastmail.com> Subject: Re: Optimize LISTEN/NOTIFY Content-Type: text/plain Content-Transfer-Encoding: 7bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Thu, Nov 13, 2025, at 08:13, Joel Jacobson wrote: > Attached, please find a new version rebased on top of the bug fix > patches that just got committed in 0bdc777, 797e9ea, 8eeb4a0, and > 1b46990. To help reviewers, here is a new write-up of the patch: PROBLEM ======= The current implementation has no central knowledge of which backend listens on which channel. When a backend commits a transaction that issued NOTIFY, SignalBackends() iterates over all registered listeners in the same database and sends each one a PROCSIG_NOTIFY_INTERRUPT signal, regardless of whether they are listening on the notified channel. This behavior is fine when all listeners are on the same channel, but when many backends are listening on different channels, each NOTIFY triggers unnecessary wakeups and context switches. As the number of idle listeners grows, this often becomes the bottleneck and throughput drops sharply. Performance degrades dramatically: benchmarks show throughput dropping from ~9,000 TPS with few listeners to ~200 TPS with 1,000 idle listeners on unrelated channels - a 45x slowdown purely from waking backends that have no notifications to process. SOLUTION OVERVIEW ================= This patch introduces two optimizations: 1. Targeted Signaling A lazily-created dynamic shared hash table (dshash) backed by dynamic shared memory (DSA) maps (database OID, channel name) to arrays of listening backends (ProcNumbers). This allows the notifier to signal only those backends actually listening on the channels being notified. 2. Direct Advancement Even with targeted signaling, idle backends might still need to be woken to advance their queue read positions past notifications they don't care about. This patch avoids those unnecessary wakeups by directly advancing the queue positions of idle backends that are not listening on the channels being notified. This is possible because all NOTIFY writers are serialized by a heavyweight lock, allowing the notifier to identify precisely which queue entries belong to the current transaction. The notifier can then determine which idle backends are positioned within that range and safely advance their positions without waking them, since we know from the shared channel hash that they are not listening on any of the notified channels. IMPLEMENTATION DETAILS ======================= Shared Channel Hash ------------------- The patch adds a dshash table that maps (dboid, channel) keys to ChannelEntry structures. The listenersArray starts with capacity for 4 listeners and doubles when full. Memory is allocated from a DSA area and freed when a channel has zero listeners. The table is created lazily on the first LISTEN command. The DSA handle and dshash handle are stored in AsyncQueueControl for other backends to attach. Dual Data Structures -------------------- The implementation maintains two complementary data structures: 1. Shared channelHash: Used during commit to determine which backends need to be signaled. Updated during Exec_ListenCommit/UnlistenCommit/ UnlistenAllCommit. 2. Local listenChannelsHash: Changed from a List to an HTAB for fast lookups, used by IsListeningOn(). This separation avoids contention on the shared hash during the frequent IsListeningOn() checks that occur for every notification read from the queue. Direct Advancement Algorithm ----------------------------- In PreCommit_Notify(), while holding the heavyweight lock on "database 0" that serializes all NOTIFY writers: 1. Before writing the first notification, capture queueHeadBeforeWrite 2. Write all notifications for the transaction to the queue 3. After writing the last notification, capture queueHeadAfterWrite The heavyweight lock guarantee means the range [queueHeadBeforeWrite, queueHeadAfterWrite) contains only notifications written by this commit, and no other backend could have inserted entries in this range. SignalBackends() then processes each backend: - If the backend has wakeupPending: skip (already signaled) - If the backend is advancing (reading the queue): If advancingPos < queueHeadAfterWrite: signal it (it will get stuck before our new entries without a signal) - If the backend is idle: If pos < queueHeadBeforeWrite: signal it (it might be interested in older messages) If pos >= queueHeadBeforeWrite AND pos < queueHeadAfterWrite: Direct advance pos to queueHeadAfterWrite (skip our messages entirely, no signal needed) New QueueBackendStatus Fields ----------------------------- Each backend's entry in AsyncQueueControl now includes: wakeupPending: signal sent but not yet processed isAdvancing: backend is advancing its position advancingPos: target position backend is advancing to These flags ensure correct interaction between direct advancement and backends that are concurrently processing their queue. Transaction-Local State ------------------------ PreCommit_Notify() builds a list of unique channels (pendingNotifyChannels) from the transaction's notifications. This list is used by SignalBackends() to look up listeners in the shared hash efficiently, avoiding duplicate lookups when multiple notifications are sent to the same channel. Functions Modified ------------------ AsyncShmemInit Initialize channelHashDSA/DSH handles (InvalidHandle) and new per-backend fields: wakeupPending, isAdvancing, advancingPos. Async_Notify Initialize channelHashtab. pg_listening_channels Rewritten to iterate over listenChannelsHash using HASH_SEQ_STATUS instead of traversing the old listenChannels list. PreCommit_Notify Build pendingNotifyChannels list of unique channels from transaction's notifications. Capture queueHeadBeforeWrite before writing first notification and queueHeadAfterWrite after each write to enable direct advancement optimization. AtCommit_Notify Check hash table entry count instead of list emptiness when deciding whether to unregister from listener array. Exec_ListenCommit Complete rewrite to maintain both local listenChannelsHash and shared channelHash. Insert backend's ProcNumber into DSA-allocated listeners array, growing array (doubling strategy) when full. Exec_UnlistenCommit Remove from both local and shared hashes. Compact listeners array with memmove, free DSA memory and delete hash entry when last listener removed. Exec_UnlistenAllCommit Iterate shared channelHash with dshash_seq_*, remove this backend from all channel entries in current database, clean up DSA memory and delete entries when empty. IsListeningOn Simplified to single hash_search() call on listenChannelsHash. asyncQueueUnregister Clear QUEUE_BACKEND_WAKEUP_PENDING flag and update assertion to check hash table instead of list. SignalBackends Rewrite to use targeted signaling instead of broadcast. Iterate pendingNotifyChannels, look up listeners per channel in shared channelHash. Implement direct advancement: advance idle backends positioned in [queueHeadBeforeWrite, queueHeadAfterWrite) without signaling. Use wakeupPending flag to prevent duplicate signals and respect isAdvancing flag to avoid interfering with concurrent position updates. AtAbort_Notify Use listenChannelsHash instead of listenChannels. asyncQueueReadAllNotifications Set isAdvancing flag and advancingPos before reading, clear isAdvancing after advancing position. asyncQueueProcessPageEntries Use listenChannelsHash instead of listenChannels. ProcessIncomingNotify Use listenChannelsHash instead of listenChannels. AddEventToPendingNotifies Build channelHashtab when notification count exceeds MIN_HASHABLE_NOTIFIES, enabling efficient extraction of unique channel names in PreCommit_Notify. ClearPendingActionsAndNotifies Also free pendingNotifyChannels. Functions Added --------------- asyncQueuePagePrecedes Inline function returning true if page p precedes page q (p < q). channelHashFunc Hash function for ChannelHashKey, combining hash of dboid and channel name using XOR. Required callback for dshash operations. initChannelHash Lazy initialization of shared dshash table mapping (dboid, channel) to listener arrays. First caller creates DSA area and dshash, stores handles in asyncQueueControl; subsequent callers attach using stored handles. initListenChannelsHash Lazy initialization of backend-local hash table (listenChannelsHash) for faster IsListeningOn() checks. ChannelHashPrepareKey Inline helper to construct ChannelHashKey. TESTING ======= The patch adds comprehensive isolation tests covering: 1. Subtransaction handling: - LISTEN in subtransaction with SAVEPOINT/RELEASE - LISTEN merge path (both outer and inner transactions) - ROLLBACK TO SAVEPOINT discarding pending actions 2. Notification deduplication: - Hash table duplicate detection with 17 notifications + 1 duplicate 3. Listener array growth: - Multiple listeners triggering ChannelEntry array expansion 4. Cross-session delivery: - Notifications from non-listening backend to listener in another session Total test sessions expanded from 3 to 7 to cover these scenarios. /Joel