MIME-Version: 1.0
References: 
 <CA+UBfa=vDV8wbmAV0pgrx-FuJh+x8YOW23vJ90Jzr=14rV+9jA@mail.gmail.com>
 <OS9PR01MB12149A4E7927072A215AEED69F565A@OS9PR01MB12149.jpnprd01.prod.outlook.com>
 <CA+UBfakmkdtauuRsOVXFqhFVJt0nTdEadx94tJn+qG0Pe8Wjfw@mail.gmail.com>
 <CAN4CZFM7FV0VTNkujD=Mb7tNa+jkmEfnX7carvj95fY6Tp11FQ@mail.gmail.com>
 <CA+UBfamW6NuuMMQTDRPDQ0a9fWN_u2OvjEF98u3CfYKTBcOZMw@mail.gmail.com>
 <CA+UBfa=Dv-2tLSEKHJ0YFFH8PCTHxnX9rtVZeV8gd8q1a-GmYA@mail.gmail.com>
 <CA+UBfa=PKdShpSTTTSHwXdGPZnm2rGMKPjERNOdS0SG9t9CT3Q@mail.gmail.com>
 <CABPTF7WVW2x4XitXttHwCamSZcBn=Q+wLjf+M+MuEbZSAxqdDw@mail.gmail.com>
In-Reply-To: 
 <CABPTF7WVW2x4XitXttHwCamSZcBn=Q+wLjf+M+MuEbZSAxqdDw@mail.gmail.com>
Reply-To: assam258@gmail.com
From: Henson Choi <assam258@gmail.com>
Date: Fri, 3 Apr 2026 15:58:39 +0900
Message-ID: 
 <CAAAe_zCxg2NTG_i1erLQQr8Wn+6SQ3EMOmp+N4J58Xxb21g2BQ@mail.gmail.com>
Subject: Re: [WIP] Pipelined Recovery
To: Xuneng Zhou <xunengzhou@gmail.com>, Imran Zaheer <imran.zhir@gmail.com>
Cc: Zsolt Parragi <zsolt.parragi@percona.com>,
 Jakub Wartak <jakub.wartak@enterprisedb.com>,
	"Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>,
 pgsql-hackers <pgsql-hackers@postgresql.org>
Content-Type: multipart/alternative; boundary="00000000000070612c064e88d910"
Archived-At: 
 <https://www.postgresql.org/message-id/CAAAe_zCxg2NTG_i1erLQQr8Wn%2B6SQ3EMOmp%2BN4J58Xxb21g2BQ%40mail.gmail.com>
Precedence: bulk

--00000000000070612c064e88d910
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Xuneng, Imran, and everyone,

I=E2=80=99m curious how this approach differs from those previous efforts, =
and
> why those attempts ultimately did not land.


There is directly relevant prior art that may be worth looking at.
Koichi Suzuki presented parallel recovery at PGCon 2023 [1] and
published a detailed design on the PostgreSQL wiki [2] with a working
prototype on GitHub.

Koichi's approach is quite different from the current patch: instead of
pipelining decode, he parallelizes redo itself by dispatching WAL
records to block workers based on page identity. The key rule is that
for a given page, WAL records are applied in written order, but
different pages can be replayed in parallel by different workers.

His design uses a dispatcher to route records to workers, with
synchronization needed for multi-block WAL records. One thing I
wondered is whether the dispatcher could be avoided entirely: if each
child simply reads the whole WAL stream on its own and skips blocks
that are not assigned to it, there would be no IPC and no need to
coordinate multi-block records across workers.

The hard problem he ran into was Hot Standby visibility: when index and
heap pages are replayed by different workers at different speeds,
concurrent queries can see inconsistent state. The wiki itself notes
the idea is to "use this when hot standby is disabled." As far as I
know, this was never submitted as a patch to hackers.

It also raises an implicit question: what makes the current approach
> more promising=E2=80=94whether due to a simpler design or improved
> performance.
>

The two approaches target different bottlenecks. The current patch
parallelizes WAL decoding, which keeps the redo path single-threaded
and avoids the Hot Standby visibility problem entirely.

One thing I am curious about in the current patch: WAL records are
already in a serialized format on disk. The producer decodes them and
then re-serializes into a different custom format for shm_mq. What is
the advantage of this second serialization format over simply passing
the raw WAL bytes after CRC validation and letting the consumer decode
directly? Offloading CRC to a separate core could still improve
throughput at the cost of higher total CPU usage, without needing the
custom format.

Koichi's approach parallelizes redo (buffer I/O) itself, which attacks
a larger cost =E2=80=94 Jakub's flamegraphs show BufferAlloc ->
GetVictimBuffer -> FlushBuffer dominating in both p0 and p1 =E2=80=94 but a=
t
the expense of much harder concurrency problems.

Whether the decode pipelining ceiling is high enough, or whether the
redo parallelization complexity is tractable, seems like the central
design question for this area.

[1]
https://www.pgcon.org/2023/schedule/session/392-parallel-recovery-in-postgr=
esql/
[2] https://wiki.postgresql.org/wiki/Parallel_Recovery

Best regards,
Henson

--00000000000070612c064e88d910
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div di=
r=3D"ltr"><div dir=3D"ltr">Hi Xuneng, Imran, and everyone,</div><div dir=3D=
"ltr"><br></div><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote=
" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);=
padding-left:1ex">
I=E2=80=99m curious how this approach differs from those previous efforts, =
and<br>
why those attempts ultimately did not land. </blockquote><div>=C2=A0</div><=
div>There is directly relevant prior art that may be worth looking at.<br>K=
oichi Suzuki presented parallel recovery at PGCon 2023 [1] and<br>published=
 a detailed design on the PostgreSQL wiki [2] with a working<br>prototype o=
n GitHub.<br><br>Koichi&#39;s approach is quite different from the current =
patch: instead of<br>pipelining decode, he parallelizes redo itself by disp=
atching WAL<br>records to block workers based on page identity. The key rul=
e is that<br>for a given page, WAL records are applied in written order, bu=
t<br>different pages can be replayed in parallel by different workers.<br><=
br>His design uses a dispatcher to route records to workers, with<br>synchr=
onization needed for multi-block WAL records. One thing I<br>wondered is wh=
ether the dispatcher could be avoided entirely: if each<br>child simply rea=
ds the whole WAL stream on its own and skips blocks<br>that are not assigne=
d to it, there would be no IPC and no need to<br>coordinate multi-block rec=
ords across workers.<br><br>The hard problem he ran into was Hot Standby vi=
sibility: when index and<br>heap pages are replayed by different workers at=
 different speeds,<br>concurrent queries can see inconsistent state. The wi=
ki itself notes<br>the idea is to &quot;use this when hot standby is disabl=
ed.&quot; As far as I<br>know, this was never submitted as a patch to hacke=
rs.<br></div><div><br></div><blockquote class=3D"gmail_quote" style=3D"marg=
in:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1e=
x">It also raises an implicit question: what makes the current approach<br>
more promising=E2=80=94whether due to a simpler design or improved<br>
performance.<br></blockquote><div><br></div>The two approaches target diffe=
rent bottlenecks. The current patch<br>parallelizes WAL decoding, which kee=
ps the redo path single-threaded<br>and avoids the Hot Standby visibility p=
roblem entirely.<br><br>One thing I am curious about in the current patch: =
WAL records are<br>already in a serialized format on disk. The producer dec=
odes them and<br>then re-serializes into a different custom format for shm_=
mq. What is<br>the advantage of this second serialization format over simpl=
y passing<br>the raw WAL bytes after CRC validation and letting the consume=
r decode<br>directly? Offloading CRC to a separate core could still improve=
<br>throughput at the cost of higher total CPU usage, without needing the<b=
r>custom format.<br><br>Koichi&#39;s approach parallelizes redo (buffer I/O=
) itself, which attacks<br>a larger cost =E2=80=94 Jakub&#39;s flamegraphs =
show BufferAlloc -&gt;<br>GetVictimBuffer -&gt; FlushBuffer dominating in b=
oth p0 and p1 =E2=80=94 but at<br>the expense of much harder concurrency pr=
oblems.<br><br>Whether the decode pipelining ceiling is high enough, or whe=
ther the<br>redo parallelization complexity is tractable, seems like the ce=
ntral<br>design question for this area.<br><br>[1] <a href=3D"https://www.p=
gcon.org/2023/schedule/session/392-parallel-recovery-in-postgresql/" target=
=3D"_blank">https://www.pgcon.org/2023/schedule/session/392-parallel-recove=
ry-in-postgresql/</a><br>[2] <a href=3D"https://wiki.postgresql.org/wiki/Pa=
rallel_Recovery" target=3D"_blank">https://wiki.postgresql.org/wiki/Paralle=
l_Recovery</a><br><br>Best regards,<br>Henson<br></div></div>
</div>
</div>
</div>
</div>

--00000000000070612c064e88d910--