Re: Streamify more code paths

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Xuneng Zhou <[email protected]>
To: Andres Freund <[email protected]>
Cc: Michael Paquier <[email protected]>
Cc: pgsql-hackers <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Subject: Re: Streamify more code paths
Date: Wed, 11 Mar 2026 23:11:23 +0800
Message-ID: <CABPTF7X3RBkmOnQAoLbK-tr6o-+27fNpKPgzYZHhQCuYbP=rGA@mail.gmail.com> (raw)
In-Reply-To: <CABPTF7XFEOHpbju_pjCFHDffP_rWJU-405c6aoQdx4JjCOBimA@mail.gmail.com>
References: <CABPTF7VSa5L=k6ONVUZHfRrO2Y2_iYz6npWj0Na69RoCvSevpQ@mail.gmail.com>
	<CABPTF7V3+QGC+0W9ERCcAY14jq_w_XvmwrRs9vXbi_oqv4FnTQ@mail.gmail.com>
	<CABPTF7VyePb8O-WDgs2hCCXYhZzGzdjg0N3NkxojZ=ke4SB3pA@mail.gmail.com>
	<CAN55FZ39HSsXKTSi66ASq+i4Ed5FuGXD11hmJ+8c0F0O0+ozew@mail.gmail.com>
	<CABPTF7Vd4JWSHi9N7pGTzn6xmOdtAToCe1NGbZAH8U9_mXOqpw@mail.gmail.com>
	<CABPTF7W-f_zPN442FCp4Xaopi721oDmGYimq=VhAk=F7jwYZDQ@mail.gmail.com>
	<CABPTF7VUaRnvsXqa+628YkuR4oPVRr1mR2seXTkxabfiqQ3NHw@mail.gmail.com>
	<CABPTF7VtSYmC5LZSnkJWYn9PCkxgOJd9QbtAM79qftBK-fbA4w@mail.gmail.com>
	<CABPTF7UVCkub6jFXVk-qrYd4xjgiwRt1FTFL2=rBVV9SYcgfkQ@mail.gmail.com>
	<[email protected]>
	<dmf5ladi2amq656myv7zjl4pj4u3v2cp3azteliauifxizljej@bmwabkp5hdpi>
	<CABPTF7XFEOHpbju_pjCFHDffP_rWJU-405c6aoQdx4JjCOBimA@mail.gmail.com>

On Wed, Mar 11, 2026 at 9:37 AM Xuneng Zhou <[email protected]> wrote:
>
> Hi Andres,
>
> On Wed, Mar 11, 2026 at 7:04 AM Andres Freund <[email protected]> wrote:
> >
> > Hi,
> >
> > On 2026-03-10 19:28:29 +0900, Michael Paquier wrote:
> > > On Tue, Mar 10, 2026 at 02:06:12PM +0800, Xuneng Zhou wrote:
> > > > Here’s v5 of the patchset. The wal_logging_large patch has been
> > > > removed, as no performance gains were observed in the benchmark runs.
> > >
> > > Looking at the numbers you are posting, it is harder to get excited
> > > about the hash, gin, bloom_vacuum and wal_logging.
> >
> > It's perhaps worth emphasizing that, to allow real world usage of direct IO,
> > we'll need streaming implementation for most of these. Also, on windows the OS
> > provided readahead is ... not aggressive, so you'll hit IO stalls much more
> > frequently than you'd on linux (and some of the BSDs).
> >
> > It might be a good idea to run the benchmarks with debug_io_direct=data.
> > That'll make them very slow, since the write side doesn't yet use AIO and thus
> > will do a lot of synchronous writes, but it should still allow to evaluate the
> > gains from using read stream.
> >
> >
> > The other thing that's kinda important to evaluate read streams is to test on
> > higher latency storage, even without direct IO.  Many workloads are not at all
> > benefiting from AIO when run on a local NVMe SSD with < 10us latency, but are
> > severely IO bound when run on a cloud storage disk with 0.5ms - 4ms latency.
> >
> >
> > To be able to test such higher latencies locally, I've found it quite useful
> > to use dm_delay above a fast disk. See [1].
>
> Thanks for the tips! I currently don’t have access to a machine or
> cloud instance with slower SSDs or HDDs that have higher latency. I’ll
> try running the benchmark with debug_io_direct=data and dm_delay, as
> you suggested, to see if the results vary.
>
> >
> > > The worker method seems more efficient, may show that we are out of noise
> > > level.
> >
> > I think that's more likely to show that memory bandwidth, probably due to
> > checksum computations, is a factor. The memory copy (from the kernel page
> > cache, with buffered IO) and the checksum computations (when checksums are
> > enabled) are parallelized by worker, but not by io_uring.
> >
> >
> > Greetings,
> >
> > Andres Freund
> >
> >
> > [1]
> >
> >   https://docs.kernel.org/admin-guide/device-mapper/delay.html
> >
> >   Assuming /dev/md0 is mounted to /srv, and a delay of 1ms should be
> >   introduced for it:
> >
> >   umount /srv && dmsetup create delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 1" /dev/md0  && mount /dev/mapper/delayed /srv/
> >
> >   To update the amount of delay to 3ms the following can be used:
> >   dmsetup suspend delayed && dmsetup reload delayed --table "0 $(blockdev --getsz /dev/md0) delay /dev/md0 0 3" /dev/md0 && dmsetup resume delayed
> >
> >   (I will often just update the delay to 0 for comparison runs, as that
> >   doesn't require remounting)
>

With debug_io_direct=data and dm_delay, the results look quite promising!

medium size / io_uring
gin_vacuum_medium          base=  1619.9ms  patch=   301.8ms   5.37x
( 81.4%)  (reads=1571→947, io_time=1524.86→207.48ms)

The average runtime increases significantly after adding the manual
device delay, so it will take some time to complete all the test runs.
I was also busy with something else today... Once the runs are
finished, I’ll share the results and the script to reproduce them.

-- 
Best,
Xuneng

view thread (36+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Streamify more code paths
  In-Reply-To: <CABPTF7X3RBkmOnQAoLbK-tr6o-+27fNpKPgzYZHhQCuYbP=rGA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox