Re: [WIP] Pipelined Recovery

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Xuneng Zhou <[email protected]>
To: Imran Zaheer <[email protected]>
To: [email protected]
Cc: Zsolt Parragi <[email protected]>
Cc: Jakub Wartak <[email protected]>
Cc: Hayato Kuroda (Fujitsu) <[email protected]>
Cc: pgsql-hackers <[email protected]>
Subject: Re: [WIP] Pipelined Recovery
Date: Wed, 22 Apr 2026 17:43:56 +0800
Message-ID: <CABPTF7XABSSwUPbnS+UE9OyeH-z3ihmdp9tOt3UJ4XcWZkE1DA@mail.gmail.com> (raw)
In-Reply-To: <CA+UBfakz7G5FH8PjxWyFLmF+sWdqMVcvQRRM0vURmznafqOjQQ@mail.gmail.com>
References: <CA+UBfa=vDV8wbmAV0pgrx-FuJh+x8YOW23vJ90Jzr=14rV+9jA@mail.gmail.com>
	<OS9PR01MB12149A4E7927072A215AEED69F565A@OS9PR01MB12149.jpnprd01.prod.outlook.com>
	<CA+UBfakmkdtauuRsOVXFqhFVJt0nTdEadx94tJn+qG0Pe8Wjfw@mail.gmail.com>
	<CAN4CZFM7FV0VTNkujD=Mb7tNa+jkmEfnX7carvj95fY6Tp11FQ@mail.gmail.com>
	<CA+UBfamW6NuuMMQTDRPDQ0a9fWN_u2OvjEF98u3CfYKTBcOZMw@mail.gmail.com>
	<CA+UBfa=Dv-2tLSEKHJ0YFFH8PCTHxnX9rtVZeV8gd8q1a-GmYA@mail.gmail.com>
	<CA+UBfa=PKdShpSTTTSHwXdGPZnm2rGMKPjERNOdS0SG9t9CT3Q@mail.gmail.com>
	<CABPTF7WVW2x4XitXttHwCamSZcBn=Q+wLjf+M+MuEbZSAxqdDw@mail.gmail.com>
	<CAAAe_zCxg2NTG_i1erLQQr8Wn+6SQ3EMOmp+N4J58Xxb21g2BQ@mail.gmail.com>
	<CA+UBfa=qDfWB90w5AsmX4f3PbeeM++GbaoVagd9ff-DKQDLvWA@mail.gmail.com>
	<CA+UBfakz7G5FH8PjxWyFLmF+sWdqMVcvQRRM0vURmznafqOjQQ@mail.gmail.com>

Hi Henson, Imran,

On Wed, Apr 8, 2026 at 7:14 PM Imran Zaheer <[email protected]> wrote:
>
> Hi
>
> I am uploading the new version with the following fixes
>
> * Rebased version.
> * Skip serialization of decoded records. As pointed out by Henson,
> there was no need to serialize the records again
>  for the sh_mq. We can simply pass the continuous bytes with minor
> pointer fixing to the sh_mq
>
> This time I am uploading the benchmarking results to drive and
> attaching the link here. Otherwise my mail will get holded for
> moderation (My guess is overall attachment size is greater than 1MB thats why).
>
> I am still not sure whether my testing approach is good enough.
> Because sometimes I am not able to get the same performance
> improvement
> with the pgbench builtin scripts as I got with the custom sql scripts.
> Maybe pgbench is not creating enough WAL to test on
> or maybe I am missing something.
>
> Benchmarks: https://drive.google.com/file/d/1Y4SYVnrFEQRE5T2r87rrTr7SWC9m19Si/view?usp=sharing
>
> Thanks & Regards
> Imran Zaheer
>
> Imran Zaheer
>
> On Wed, Apr 8, 2026 at 1:46 PM Imran Zaheer <[email protected]> wrote:
> >
> > >
> > > Hi Xuneng, Imran, and everyone,
> > >
> >
> > Hi Henson and Xuneng.
> >
> > Thanks for explaining the approaches to Xuneng.
> >
> > >
> > > The two approaches target different bottlenecks. The current patch
> > > parallelizes WAL decoding, which keeps the redo path single-threaded
> > > and avoids the Hot Standby visibility problem entirely.
> > >
> >
> > You are right both approaches
> > target different bottlenecks. Pipeline patch aims to improve overall
> > cpu throughput
> > and to save CPU time by offloading the steps we can safely do in parallel with
> > out causing synchronization problems.
> >
> > > One thing I am curious about in the current patch: WAL records are
> > > already in a serialized format on disk. The producer decodes them and
> > > then re-serializes into a different custom format for shm_mq. What is
> > > the advantage of this second serialization format over simply passing
> > > the raw WAL bytes after CRC validation and letting the consumer decode
> > > directly? Offloading CRC to a separate core could still improve
> > > throughput at the cost of higher total CPU usage, without needing the
> > > custom format.
> > >
> >
> > Thanks. You are right there was no need to serialize the decoded record again.
> > I was not aware that we already have continuous bytes in memory. In my
> > next patch
> > I will remove this extra serialization step.
> >
> > > Koichi's approach parallelizes redo (buffer I/O) itself, which attacks
> > > a larger cost — Jakub's flamegraphs show BufferAlloc ->
> > > GetVictimBuffer -> FlushBuffer dominating in both p0 and p1 — but at
> > > the expense of much harder concurrency problems.
> > >
> > > Whether the decode pipelining ceiling is high enough, or whether the
> > > redo parallelization complexity is tractable, seems like the central
> > > design question for this area.
> >
> > I still have to investigate the problem related to `GetVictimBuffer` that
> > Jakub mentioned. But I was trying that how can we safely offload the work done
> >  by `XLogReadBufferForRedoExtended` to a separate
> > pipeline worker, or maybe we can try prefetching the buffer header so
> > the main redo
> > loop doesn't have to spend time getting the buffer

Thanks for your clarification! I'll try to review this patch later.

--
Best,
Xuneng

view thread (9+ messages)

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: [WIP] Pipelined Recovery
  In-Reply-To: <CABPTF7XABSSwUPbnS+UE9OyeH-z3ihmdp9tOt3UJ4XcWZkE1DA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox