Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8YUb-000eSd-1n for pgsql-hackers@arkaria.postgresql.org; Fri, 03 Apr 2026 06:58:58 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w8YUa-00A6wJ-0b for pgsql-hackers@arkaria.postgresql.org; Fri, 03 Apr 2026 06:58:56 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8YUZ-00A6w4-2l for pgsql-hackers@lists.postgresql.org; Fri, 03 Apr 2026 06:58:56 +0000 Received: from mail-pg1-x52f.google.com ([2607:f8b0:4864:20::52f]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w8YUW-00000000Ken-45nW for pgsql-hackers@postgresql.org; Fri, 03 Apr 2026 06:58:55 +0000 Received: by mail-pg1-x52f.google.com with SMTP id 41be03b00d2f7-c76d797b180so103827a12.2 for ; Thu, 02 Apr 2026 23:58:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1775199531; cv=none; d=google.com; s=arc-20240605; b=OBhg8pk9IoIVH/GlhFDfLz4km0bB0zyDn4yjY75BIUEqMWS9Nuw6bKHRI5wB8pPns2 aC3DN04p7rJ1JdAGzPbIJO1RJtJjQCv4bMJGt9UO03ayUw4flUTdR3gBWFHFqGkoIvje FBpzdwxY9y1U2JyvfY2sWrdU2fQc4wF8fiMjt1yGUsZtCC+yOHCjbrworXolXDKDrs51 9oLA1uUkQSCBwjDxbXOUm2gbQN+pXrus+vBerZ7LsKeMUr6aKGh5EX0d/nkf85kUeFXi sCil8uAAs04C7DLq3jlFyUpFQH+mNBDOaILDxGK4mL6xaCwa2g+vUu0BER1jVEf5nCp4 6wrw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:reply-to:in-reply-to:references :mime-version:dkim-signature; bh=qbmc3phewXjKkWt+kpkFpLGEdfRysjaZ5o1wDt5jAGQ=; fh=iQdnV+HtdWY6TutkiPuEVn/dAhOZPubH+/S624wMqtU=; b=jF1lB3wZ0My8JqHHx6BFTdpotNigSh4vdzPyRcpqtm2y8b+dc4iw4EyZbo8CBUyfRS XvtX3fJutoVvltlliv4ngMncc8ratS5ylm2Y/o2aLoeeW0SflGnG1T7W1n+3oR1D3kAL SEBO99utuZLdqR5xBm339oIQ80JHMmmFKlSQi6CSbjdq79UptDeo9/osLwZaICvN90Ep TJRDrW7xj3xrg8FZKVjs1+WSYGZPZG7Mc5J5gOy6f4uRAv+VEUFnBElhd52HYJmI+k1R 11hz7EjdD/lzlWPEVTYZODrer+yscmmyPhsaIIxXj2ttdBhZqp5Xu7xyurXnpu+XUXTu PIBw==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1775199531; x=1775804331; darn=postgresql.org; h=cc:to:subject:message-id:date:from:reply-to:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=qbmc3phewXjKkWt+kpkFpLGEdfRysjaZ5o1wDt5jAGQ=; b=Rc0Dt0FbJITl8/djVQOX0G0eR7UV9g/alEfmtzACemLU71+iNC3kV119X1Iw94SsRj aV63VAkp530AgIS7MxVjsvKDfNTEjfasbIPRk27NS4u4v/RkCc7SXtReVmAE+jne1MJR 6oBSyxCrU1Zv9rO9KUT0LWauCQvZmrs7LQ7G982P8EV9t8n75zcQysCp9jNFVQ6U6Oom xgZpzPCEZOsCyTBFdnRtBiJVRwmZlE9yLx52cVchJ/eq7A5lugRE9SB6U2ctY9d3tXRz QcmQHhR0b5tpcwVcHolTMm0sEoWVRg6eB2BbpXcx9zxJM0X/o3atgP21KybfPDeHGhAC qz6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775199531; x=1775804331; h=cc:to:subject:message-id:date:from:reply-to:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=qbmc3phewXjKkWt+kpkFpLGEdfRysjaZ5o1wDt5jAGQ=; b=Cv4Y9Y29R2xKS1fnmlyYWukzOYKpJ1VASHNUHVDntVRuZsSOjOLnQ8TTwG/KRP5n3y WT/j46ceCq3GJO/UptiZEy/KW3hJPqgOzIbbQP2WKsMc3U1HUTfGtSj5ne78oF1sSn4U 54Pa4p46iU3UpS9QccGoR7+CfBvDsA/qm8J8ET+mZnFgDDgC8Swy4gQOkrOP+G0Yvy0J UeICHM9rNokRfT27qmAsFneWHgpPkXM/S5316/hdgLqiBUNxpnLze15MfYKE+7z7+Brw wDv46VDZ0OGh5O9vUTPLodwgSTu2wOiOWXUYE0MtnNZXV+gah4wP3DN/T+Ugg5kpX63J +jsA== X-Forwarded-Encrypted: i=1; AJvYcCV49KRFGb6Au3dFh/1wbHjzrcn8KYsgyf2w4CTtBcdP72lNK1n4vCvV6ZS83g+zvPghNotmmuEPkOVpFc6s@postgresql.org X-Gm-Message-State: AOJu0YwoPimJiBw/p2IENtyzWb7vmzA2i8pdLA+eNPkOEDpLdj5icUW4 TnPsBhHUs39PpQgJpo0dBaPcAk+XGXMD92WGCihvmD4mnX3iQgw3S3sbSZ4c1cZ3YAVK8kYNjtq AyAdJZ1wUaXg6mqD0nBRq1KfhBQpiR0s= X-Gm-Gg: ATEYQzySU8cGF8l85quCPrErL8V4rIAzUjzzjPGOGIuuVPByTmYpnYnHC9gwc5zPCKh Qml1FGMbCNIF5DoQRllZw5Y828BmHDJsF7ZxTYodh4Uf05n9bTcwN6Plvn0oeKnpJfg6+XjQhIi bFRrBmAJWQmRgdduM11YthTZY8n0eczKH/Wsq7w7opEFpfyFWlIqlBducC1cqBc+/9936AR5oJF dk+wUGE8B3rmpmjp+t+RdZyLIzvp2EHwC2e5Zq4vPZnllQJFn/jpf8I3WDkvMk2d0Tv7ZVPSWxN kKcH+MKkzCoqc0zfTPs/VmyV4uY7ui0jGvBoK9h8M1ZQehfayw== X-Received: by 2002:a05:6a21:3396:b0:398:8cb2:84c1 with SMTP id adf61e73a8af0-39f2e93e9d9mr2048344637.0.1775199530802; Thu, 02 Apr 2026 23:58:50 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: Reply-To: assam258@gmail.com From: Henson Choi Date: Fri, 3 Apr 2026 15:58:39 +0900 X-Gm-Features: AQROBzDQS-kD1TiHilaOARxmH95q1pD7pzZCDGRhY53CnWkp3WcRlr9noK5mZEY Message-ID: Subject: Re: [WIP] Pipelined Recovery To: Xuneng Zhou , Imran Zaheer Cc: Zsolt Parragi , Jakub Wartak , "Hayato Kuroda (Fujitsu)" , pgsql-hackers Content-Type: multipart/alternative; boundary="00000000000070612c064e88d910" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000070612c064e88d910 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Xuneng, Imran, and everyone, I=E2=80=99m curious how this approach differs from those previous efforts, = and > why those attempts ultimately did not land. There is directly relevant prior art that may be worth looking at. Koichi Suzuki presented parallel recovery at PGCon 2023 [1] and published a detailed design on the PostgreSQL wiki [2] with a working prototype on GitHub. Koichi's approach is quite different from the current patch: instead of pipelining decode, he parallelizes redo itself by dispatching WAL records to block workers based on page identity. The key rule is that for a given page, WAL records are applied in written order, but different pages can be replayed in parallel by different workers. His design uses a dispatcher to route records to workers, with synchronization needed for multi-block WAL records. One thing I wondered is whether the dispatcher could be avoided entirely: if each child simply reads the whole WAL stream on its own and skips blocks that are not assigned to it, there would be no IPC and no need to coordinate multi-block records across workers. The hard problem he ran into was Hot Standby visibility: when index and heap pages are replayed by different workers at different speeds, concurrent queries can see inconsistent state. The wiki itself notes the idea is to "use this when hot standby is disabled." As far as I know, this was never submitted as a patch to hackers. It also raises an implicit question: what makes the current approach > more promising=E2=80=94whether due to a simpler design or improved > performance. > The two approaches target different bottlenecks. The current patch parallelizes WAL decoding, which keeps the redo path single-threaded and avoids the Hot Standby visibility problem entirely. One thing I am curious about in the current patch: WAL records are already in a serialized format on disk. The producer decodes them and then re-serializes into a different custom format for shm_mq. What is the advantage of this second serialization format over simply passing the raw WAL bytes after CRC validation and letting the consumer decode directly? Offloading CRC to a separate core could still improve throughput at the cost of higher total CPU usage, without needing the custom format. Koichi's approach parallelizes redo (buffer I/O) itself, which attacks a larger cost =E2=80=94 Jakub's flamegraphs show BufferAlloc -> GetVictimBuffer -> FlushBuffer dominating in both p0 and p1 =E2=80=94 but a= t the expense of much harder concurrency problems. Whether the decode pipelining ceiling is high enough, or whether the redo parallelization complexity is tractable, seems like the central design question for this area. [1] https://www.pgcon.org/2023/schedule/session/392-parallel-recovery-in-postgr= esql/ [2] https://wiki.postgresql.org/wiki/Parallel_Recovery Best regards, Henson --00000000000070612c064e88d910 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Xuneng, Imran, and everyone,

I=E2=80=99m curious how this approach differs from those previous efforts, = and
why those attempts ultimately did not land.
=C2=A0
<= div>There is directly relevant prior art that may be worth looking at.
K= oichi Suzuki presented parallel recovery at PGCon 2023 [1] and
published= a detailed design on the PostgreSQL wiki [2] with a working
prototype o= n GitHub.

Koichi's approach is quite different from the current = patch: instead of
pipelining decode, he parallelizes redo itself by disp= atching WAL
records to block workers based on page identity. The key rul= e is that
for a given page, WAL records are applied in written order, bu= t
different pages can be replayed in parallel by different workers.
<= br>His design uses a dispatcher to route records to workers, with
synchr= onization needed for multi-block WAL records. One thing I
wondered is wh= ether the dispatcher could be avoided entirely: if each
child simply rea= ds the whole WAL stream on its own and skips blocks
that are not assigne= d to it, there would be no IPC and no need to
coordinate multi-block rec= ords across workers.

The hard problem he ran into was Hot Standby vi= sibility: when index and
heap pages are replayed by different workers at= different speeds,
concurrent queries can see inconsistent state. The wi= ki itself notes
the idea is to "use this when hot standby is disabl= ed." As far as I
know, this was never submitted as a patch to hacke= rs.

It also raises an implicit question: what makes the current approach
more promising=E2=80=94whether due to a simpler design or improved
performance.

The two approaches target diffe= rent bottlenecks. The current patch
parallelizes WAL decoding, which kee= ps the redo path single-threaded
and avoids the Hot Standby visibility p= roblem entirely.

One thing I am curious about in the current patch: = WAL records are
already in a serialized format on disk. The producer dec= odes them and
then re-serializes into a different custom format for shm_= mq. What is
the advantage of this second serialization format over simpl= y passing
the raw WAL bytes after CRC validation and letting the consume= r decode
directly? Offloading CRC to a separate core could still improve=
throughput at the cost of higher total CPU usage, without needing thecustom format.

Koichi's approach parallelizes redo (buffer I/O= ) itself, which attacks
a larger cost =E2=80=94 Jakub's flamegraphs = show BufferAlloc ->
GetVictimBuffer -> FlushBuffer dominating in b= oth p0 and p1 =E2=80=94 but at
the expense of much harder concurrency pr= oblems.

Whether the decode pipelining ceiling is high enough, or whe= ther the
redo parallelization complexity is tractable, seems like the ce= ntral
design question for this area.

[1] https://www.pgcon.org/2023/schedule/session/392-parallel-recove= ry-in-postgresql/
[2] https://wiki.postgresql.org/wiki/Paralle= l_Recovery

Best regards,
Henson
--00000000000070612c064e88d910--