MIME-Version: 1.0
References: 
 <CA+hUKGJULUTkT2LpeHTSt3KHbJrYNBT-kj1-OhMRV_PnUQ_57A@mail.gmail.com>
 <CA+q6zcW0qh2aPKg0z58mp-Ba8avp7MWkS00ADrOOv=CBzJJMLA@mail.gmail.com>
 <CA+hUKGJLe-0E-wZ-is78CEHhjbC=ihMVCQLoN1dmD-j05s9qRg@mail.gmail.com>
 <pdrbpbvhscwckb6nqkfgoo7o63nnmeq7taaazgfi3bk627gb6p@rzcauoqabb3k>
 <CA+hUKG+P80oR7yy5-67uHqwBWr9rux69BkQF9fwz=f0LuDn3Rw@mail.gmail.com>
 <CA+hUKGKgVhLVS4Pne8cTR0f=tE8GtF4-CvY2HMDOJEm7w0iXfA@mail.gmail.com>
 <20260328095204.5tsq5bldugeumrtf@erthalion>
 <CA+hUKGLUZ51zkxUoBvANcLdvz8DjS_85=6985HE8gLuNtBxX7Q@mail.gmail.com>
 <jp7lhn2wng5mlhn3jvn4t6nmr23l6bnim5af3qslnoguevfqqz@d7cvksm62j4j>
 <CA+hUKGLp_yPztSoQkmRe1kFk1M5JNr7b+07cc+46AVg=VoKC-Q@mail.gmail.com>
 <adaz4sd2x64uoei2nmd4pinvothusd7uwllnilq6msfp4wytrw@npedbcvg3llt>
In-Reply-To: <adaz4sd2x64uoei2nmd4pinvothusd7uwllnilq6msfp4wytrw@npedbcvg3llt>
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 8 Apr 2026 11:18:51 +1200
Message-ID: 
 <CA+hUKGJi6L2ftuTiKw2W4AAah6Uivs4JGiq58A7MjePbXpc4BA@mail.gmail.com>
Subject: Re: Automatically sizing the IO worker pool
To: Andres Freund <andres@anarazel.de>
Cc: Dmitry Dolgov <9erthalion6@gmail.com>,
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CA%2BhUKGJi6L2ftuTiKw2W4AAah6Uivs4JGiq58A7MjePbXpc4BA%40mail.gmail.com>
Precedence: bulk

On Wed, Apr 8, 2026 at 7:01=E2=80=AFAM Andres Freund <andres@anarazel.de> w=
rote:
> The if (worker =3D=3D -1) is done for every to be submitted IO.  If there=
 are no
> idle workers, we'd redo the pgaio_worker_choose_idle() every time.  ISTM =
it
> should just be:
>
>                 for (int i =3D 0; i < num_staged_ios; ++i)
>                 {
>                         Assert(!pgaio_worker_needs_synchronous_execution(=
staged_ios[i]));
>                         if (!pgaio_worker_submission_queue_insert(staged_=
ios[i]))
>                         {
>                                 /*
>                                  * Do the rest synchronously. If the queu=
e is full, give up
>                                  * and do the rest synchronously. We're h=
olding an exclusive
>                                  * lock on the queue so nothing can consu=
me entries.
>                                  */
>                                 synchronous_ios =3D &staged_ios[i];
>                                 nsync =3D (num_staged_ios - i);
>
>                                 break;
>                         }
>                 }
>
>                 /* Choose one worker to wake for this batch. */
>                 if (worker =3D=3D -1)
>                         worker =3D pgaio_worker_choose_idle(-1);

Well I didn't want to wake a worker if we'd failed to enqueue
anything.  Ahh, I could put it there and test nsync.  Or I guess I
could just do it anyway.  Considering that.

> > > I think both 'wakeups" and "ios" are a bit too generically named. Bas=
ed on the
> > > names I have no idea what this heuristic might be.
> >
> > I have struggled to name them.  Does wakeup_count and io_count help?
>
> hist_wakeups, hist_ios?

Thanks, that's a good name.

> > No, we only set it if it isn't already set (like a latch), and only
> > send a pmsignal when we set it (like a latch), and the postmaster only
> > clears it if it can start a worker (unlike a latch).  That applies in
> > general, not just when we hit the cap of io_max_workers: while the
> > postmaster is waiting for launch interval to expire, it will leave the
> > flag set, suppressed for 100ms or whatever, and the in the special
> > case of io_max_workers, for as long as the count remains that high.
>
> I'm quite certain that's not how it actually ended up working with the pr=
ior
> version and the benchmark I showed, there indeed were a lot of requests t=
o
> postmaster.  I think it's because pgaio_worker_cancel_grow() (forgot the =
old
> name already) very frequently clears the flag, just for it to be immediat=
ely
> set again.
>
>
> Yep, still happens, does require the max to be smaller than 32 though.
>
> While a lot of IO is happening, no new connections being started, and wit=
h
> 1781562 being postmaster's pid:
>
> perf stat --no-inherit -p 1781562 -e raw_syscalls:sys_enter -r 0 sleep 1
>
>  Performance counter stats for process id '1781562':
>
>              2,790      raw_syscalls:sys_enter
>
>        1.001872667 seconds time elapsed
>
>              2,814      raw_syscalls:sys_enter
>
>        1.001983049 seconds time elapsed
>
>              3,036      raw_syscalls:sys_enter
>
>        1.001705850 seconds time elapsed
>
>              2,982      raw_syscalls:sys_enter
>
>        1.001881364 seconds time elapsed
>
>
> I think it may need a timestamp in the shared state to not allow another
> postmaster wake until some time has elapsed, or something.

Hnng.  Studying...

> > I should have made it clearer that that's a secondary condition.  The
> > primary condition is: a worker wanted to wake another worker, but
> > found that none were idle.  Unfortunately the whole system is a bit
> > too asynchronous for that to be a reliable cue on its own.  So, I also
> > check if the queue appears to be (1) obviously growing: that's clearly
> > too long and must be introducing latency, or (2) varying "too much".
> > Which I detect in exactly the same way.
> >
> > Imagine a histogram that look like this:
> >
> > LOG:  depth 00: 7898
> > LOG:  depth 01: 1630
> > LOG:  depth 02: 308
> > LOG:  depth 03: 93
> > LOG:  depth 04: 40
> > LOG:  depth 05: 19
> > LOG:  depth 06: 6
> > LOG:  depth 07: 4
> > LOG:  depth 08: 0
> > LOG:  depth 09: 1
> > LOG:  depth 10: 1
> > LOG:  depth 11: 0
> > LOG:  depth 12: 0
> > LOG:  depth 13: 0
> >
> > If you're failing to find idle workers to wake up AND our managic
> > threshold is hit by something in that long tail, then it'll call for
> > backup.  Of course I'm totally sidestepping a lot of queueing theory
> > maths and just saying "I'd better be able to find an idle worker when
> > I want to" and if not, "there had better not be any outliers that
> > reach this far".
> >
> > I've written a longer explanation in a long comment.  Including a
> > little challenge for someone to do better with real science and maths.
> > I hope it's a bit clearer at least.
>
> Definitely good to have that comment.  Have to ponder it for a bit.

Let me try again.

Our goal is simple: process every IO immediately.  We have immediate
feedback that is simple: there's an IO in the queue and there is no
idle worker.  The only action we can take is simple: add one more
worker.  So we don't need to suffer through the maths required to
figure out the ideal k for our M/G/k queue system (I think that's what
we have?) or any of the inputs that would require*.  The problem is
that on its own, the test triggered far too easily because a worker
that is not marked idle might in fact be just about to pick up that IO
on the one the one hand, and because there might be rare
spikes/clustering on the other, so I cooled it off a bit by
additionally testing if the queue appears to be growing or spiking
beyond some threshold.  I think it's OK to let the queue grow a bit
before we are triggered anyway, so the precise value used doesn't seem
too critical.  Someone might be able to come up with a more defensible
value, but in the end I just wanted a value that isn't triggered by
the outliers I see in real systems that are keeping up.  We could tune
it lower and overshoot more, but this setting seems to work pretty
well.  It doesn't seem likely that a real system could achieve a
steady state that is introducing latency but isn't increasing over
time, and pool size adjustments are bound to lag anyway.

* It's probably quite hard for call centres to figure out the number
of agents required to make you wait for a certain length of time, but
it's easy to know if you had to wait and you wish they had more!

> I've not again looked through the details, but based on a relatively shor=
t
> experiment, the one problematic thing I see is the frequent postmaster
> requests.

Looking into that...