Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wAFhm-002Cot-2k for pgsql-hackers@arkaria.postgresql.org; Tue, 07 Apr 2026 23:19:35 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wAFhl-0032Dx-00 for pgsql-hackers@arkaria.postgresql.org; Tue, 07 Apr 2026 23:19:33 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wAFhk-0032Dn-2A for pgsql-hackers@lists.postgresql.org; Tue, 07 Apr 2026 23:19:33 +0000 Received: from mail-dy1-x132f.google.com ([2607:f8b0:4864:20::132f]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wAFhi-00000001CJC-2kY4 for pgsql-hackers@lists.postgresql.org; Tue, 07 Apr 2026 23:19:32 +0000 Received: by mail-dy1-x132f.google.com with SMTP id 5a478bee46e88-2c645e399ffso136271eec.3 for ; Tue, 07 Apr 2026 16:19:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1775603968; cv=none; d=google.com; s=arc-20240605; b=C4Tjy27g2qgbCtBBU8qPpYUr0/3WRSNYZPbD945PYRvRGCqR0bpvYGMaWMs3OfSKjI jjx5z2DCyUbanRVvD7JUNtIQOdbnHt9jKFSwE/MfTDGX05z2ERsR64TgaJZs9eZ84kU5 oEo8J2LrAS/doqL5cBrCVDlc/LXIzLNNB2FXF6Bgd6gAOqVoI7BXEPY0/Xanl7CswZvW IvoDoBX6Obb+ZoKV1e9wMQcyk2NKN33N7For352kFMDEBLH89KFaKsRBjd9HLlTwLbIZ 8AvLFBqsVlxwPyC9bF0gJJUy4vrYJrknQY4IHhmDu0T/lE4L11/9GcXGCyDCBdzEmb/h 7TpA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=Fb/qh7EGlNwVHNvvVE7oPY0HRNT/lemv+fx1dJYqaiU=; fh=CVFgOFxdmNkymY7vlIW+0X3+2ZBYgv8ogW/vnB3i+U0=; b=ey2HiLKaW9nwvsBTFxAulcc7p+/URIJPiatyzmr947xezuC0WK59JhFqeCPS/GbC38 9sWXMnzX5kkpP5AEZ3AKMgBgFBuGmbNh8gC7LcvbLtPZ/KZzVRgg/E69ZhjV+PqbX4hq rVrVc3RH7BLUe6X8xLIgQd16P27awFZO84sCKkupW6RA0iXur0ki+TLSOoC2nzFTrVPZ /t6Fn0RqxU/eQDdRnIsPBs4dYKrPT7uc/4iZXHSwMT/ycmJwKgh2kuoibtjWf2vs23mC uG6CNWPg0vAxt86MJeDcSh9L1prDfgGhiT57konvAEHKR5l/omJQlvJiLoIL1LDcS/cB 8a5g==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1775603968; x=1776208768; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Fb/qh7EGlNwVHNvvVE7oPY0HRNT/lemv+fx1dJYqaiU=; b=B6RnQbbS6SwoHrC7YzZdkxP52yFbm89pJSRilEtCd6jQsIFRu3L6ztAuDEdARatTOb HcuAlrf+vgd7VGsnnrnMgiBiyrsa0/DgAYSs+GSixF/+JrHtSz3Pdxx7k9UE0KLe3upy S5aQesK65za3JbscsKdrgkxnXHWJYv23zdREizv00zf32DuqF8yAeenWuGmZXyLDHhVg jwaONtjdUVuNYiStdkP8D7FzNbDb+AIDjwGXRm3B2/dI5g3lTb0j5CcSzIQETUzENDtw Gzq3ECPG65VVXbIg+cK952lcY9JZSdJHCUWRXhcAVQ2CN7B2h3YBfE3wWCJ5tWmbS7gV quFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775603968; x=1776208768; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=Fb/qh7EGlNwVHNvvVE7oPY0HRNT/lemv+fx1dJYqaiU=; b=oi9FJdxtpOhr+4LdCbbQ6crfZJkomrwr0e9D7XFuoM+h1GTHJdDXL9kcG+ENSzgDrR ouJ1Se2haUJ3U3GHhKMpFA3Y+SxSQB0+w9rMU18y0oG86gGF73VsT6CpthZ2azkRHIqd y4Kj6Op7Gb2xJtgz4mTeV7+m1V+VW0Q2iT4MPGyNPslPGZNfHf+fMCeUsHAiD+ff0y4S PVmRVZJaIlPbQp7de0XCdbGXSGO4T7SaysKhnGo55ytWUw/1/pbjUHjjgm2ZpdiQXGwR sN6q5XevwDTDBsrBs3xckpuIshyFN/0sXM2/nyj5FSkaOH7R9AT/N3Eg5G8e/ccVNlSm 93MA== X-Forwarded-Encrypted: i=1; AJvYcCXUr9eu5GWSM9XV6lqMLUgwLG0rBWpJh1zgyxETsX8l1s+5AO1uxySe9mWXtYVnaFSDivmxuM8uFjHOIYLl@lists.postgresql.org X-Gm-Message-State: AOJu0YyVJS+5aWEDnLA1rqlstrlywcErTKWSVKMKHpYox3eBwOaLgRxF KAI77w7TESq5X0T43Pep5MzdtE9+mWlhPu8sHwOXJf9b+2yI/QHkQJ8dRzTcNwr+rEwqJmN+yql VuXW2VAV0N03/MweAEjSxUzDEov7vorQ= X-Gm-Gg: AeBDieuv8nmxw8b7EWl/sqZ0YYkSfOq4RiZEFr0xSwH6SGoeDF5IpT+h6h2NScfe0qB JtusZPrqrzkxNgdOSZPDJEL+pBQTZUuU+3rJkxvitzx/DkuQPUHLm0JjAEr+yqPa4/Go4ADQTZY 1lZSq0ht0XakVHY4usHdH2RdfQD2HiytyUXe4fFHhSstu6HNc1rtg8rsMaHno3rE6TdopZO3OfS ADrAio5LQDuSjN6q9NYBaoGcftfc6qK/MFsd9iRM+rtI476Mw25/sF78g/Aeu8kcqkK+MFRjoIu Bs9deg1qQQijmLOEjh17wMES9B4rtdY99PVE+6ACO1hk22kLBixTS1tU3Od2ppA3C/hzk7bz7ag = X-Received: by 2002:a05:7300:2d07:b0:2c0:c55c:156f with SMTP id 5a478bee46e88-2d2a332441emr431365eec.4.1775603968257; Tue, 07 Apr 2026 16:19:28 -0700 (PDT) MIME-Version: 1.0 References: <20260328095204.5tsq5bldugeumrtf@erthalion> In-Reply-To: From: Thomas Munro Date: Wed, 8 Apr 2026 11:18:51 +1200 X-Gm-Features: AQROBzA5elw_p5J7-c6zeb7vWTn39ieqsbJgB0eudwz9UtaughArbo8WLmHCegg Message-ID: Subject: Re: Automatically sizing the IO worker pool To: Andres Freund Cc: Dmitry Dolgov <9erthalion6@gmail.com>, PostgreSQL Hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Wed, Apr 8, 2026 at 7:01=E2=80=AFAM Andres Freund w= rote: > The if (worker =3D=3D -1) is done for every to be submitted IO. If there= are no > idle workers, we'd redo the pgaio_worker_choose_idle() every time. ISTM = it > should just be: > > for (int i =3D 0; i < num_staged_ios; ++i) > { > Assert(!pgaio_worker_needs_synchronous_execution(= staged_ios[i])); > if (!pgaio_worker_submission_queue_insert(staged_= ios[i])) > { > /* > * Do the rest synchronously. If the queu= e is full, give up > * and do the rest synchronously. We're h= olding an exclusive > * lock on the queue so nothing can consu= me entries. > */ > synchronous_ios =3D &staged_ios[i]; > nsync =3D (num_staged_ios - i); > > break; > } > } > > /* Choose one worker to wake for this batch. */ > if (worker =3D=3D -1) > worker =3D pgaio_worker_choose_idle(-1); Well I didn't want to wake a worker if we'd failed to enqueue anything. Ahh, I could put it there and test nsync. Or I guess I could just do it anyway. Considering that. > > > I think both 'wakeups" and "ios" are a bit too generically named. Bas= ed on the > > > names I have no idea what this heuristic might be. > > > > I have struggled to name them. Does wakeup_count and io_count help? > > hist_wakeups, hist_ios? Thanks, that's a good name. > > No, we only set it if it isn't already set (like a latch), and only > > send a pmsignal when we set it (like a latch), and the postmaster only > > clears it if it can start a worker (unlike a latch). That applies in > > general, not just when we hit the cap of io_max_workers: while the > > postmaster is waiting for launch interval to expire, it will leave the > > flag set, suppressed for 100ms or whatever, and the in the special > > case of io_max_workers, for as long as the count remains that high. > > I'm quite certain that's not how it actually ended up working with the pr= ior > version and the benchmark I showed, there indeed were a lot of requests t= o > postmaster. I think it's because pgaio_worker_cancel_grow() (forgot the = old > name already) very frequently clears the flag, just for it to be immediat= ely > set again. > > > Yep, still happens, does require the max to be smaller than 32 though. > > While a lot of IO is happening, no new connections being started, and wit= h > 1781562 being postmaster's pid: > > perf stat --no-inherit -p 1781562 -e raw_syscalls:sys_enter -r 0 sleep 1 > > Performance counter stats for process id '1781562': > > 2,790 raw_syscalls:sys_enter > > 1.001872667 seconds time elapsed > > 2,814 raw_syscalls:sys_enter > > 1.001983049 seconds time elapsed > > 3,036 raw_syscalls:sys_enter > > 1.001705850 seconds time elapsed > > 2,982 raw_syscalls:sys_enter > > 1.001881364 seconds time elapsed > > > I think it may need a timestamp in the shared state to not allow another > postmaster wake until some time has elapsed, or something. Hnng. Studying... > > I should have made it clearer that that's a secondary condition. The > > primary condition is: a worker wanted to wake another worker, but > > found that none were idle. Unfortunately the whole system is a bit > > too asynchronous for that to be a reliable cue on its own. So, I also > > check if the queue appears to be (1) obviously growing: that's clearly > > too long and must be introducing latency, or (2) varying "too much". > > Which I detect in exactly the same way. > > > > Imagine a histogram that look like this: > > > > LOG: depth 00: 7898 > > LOG: depth 01: 1630 > > LOG: depth 02: 308 > > LOG: depth 03: 93 > > LOG: depth 04: 40 > > LOG: depth 05: 19 > > LOG: depth 06: 6 > > LOG: depth 07: 4 > > LOG: depth 08: 0 > > LOG: depth 09: 1 > > LOG: depth 10: 1 > > LOG: depth 11: 0 > > LOG: depth 12: 0 > > LOG: depth 13: 0 > > > > If you're failing to find idle workers to wake up AND our managic > > threshold is hit by something in that long tail, then it'll call for > > backup. Of course I'm totally sidestepping a lot of queueing theory > > maths and just saying "I'd better be able to find an idle worker when > > I want to" and if not, "there had better not be any outliers that > > reach this far". > > > > I've written a longer explanation in a long comment. Including a > > little challenge for someone to do better with real science and maths. > > I hope it's a bit clearer at least. > > Definitely good to have that comment. Have to ponder it for a bit. Let me try again. Our goal is simple: process every IO immediately. We have immediate feedback that is simple: there's an IO in the queue and there is no idle worker. The only action we can take is simple: add one more worker. So we don't need to suffer through the maths required to figure out the ideal k for our M/G/k queue system (I think that's what we have?) or any of the inputs that would require*. The problem is that on its own, the test triggered far too easily because a worker that is not marked idle might in fact be just about to pick up that IO on the one the one hand, and because there might be rare spikes/clustering on the other, so I cooled it off a bit by additionally testing if the queue appears to be growing or spiking beyond some threshold. I think it's OK to let the queue grow a bit before we are triggered anyway, so the precise value used doesn't seem too critical. Someone might be able to come up with a more defensible value, but in the end I just wanted a value that isn't triggered by the outliers I see in real systems that are keeping up. We could tune it lower and overshoot more, but this setting seems to work pretty well. It doesn't seem likely that a real system could achieve a steady state that is introducing latency but isn't increasing over time, and pool size adjustments are bound to lag anyway. * It's probably quite hard for call centres to figure out the number of agents required to make you wait for a certain length of time, but it's easy to know if you had to wait and you wish they had more! > I've not again looked through the details, but based on a relatively shor= t > experiment, the one problematic thing I see is the frequent postmaster > requests. Looking into that...