Automatically sizing the IO worker pool

public inbox for [email protected]  
help / color / mirror / Atom feed

Automatically sizing the IO worker pool
24+ messages / 5 participants
[nested] [flat]

* Automatically sizing the IO worker pool
@ 2025-04-12 16:59 Thomas Munro <[email protected]>
  2025-04-13 17:45 ` Re: Automatically sizing the IO worker pool Jose Luis Tallon <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  0 siblings, 2 replies; 24+ messages in thread

From: Thomas Munro @ 2025-04-12 16:59 UTC (permalink / raw)
  To: PostgreSQL Hackers <[email protected]>

It's hard to know how to set io_workers=3.  If it's too small,
io_method=worker's small submission queue overflows and it silently
falls back to synchronous IO.  If it's too high, it generates a lot of
pointless wakeups and scheduling overhead, which might be considered
an independent problem or not, but having the right size pool
certainly mitigates it.  Here's a patch to replace that GUC with:

      io_min_workers=1
      io_max_workers=8
      io_worker_idle_timeout=60s
      io_worker_launch_interval=500ms

It grows the pool when a backlog is detected (better ideas for this
logic welcome), and lets idle workers time out.  IO jobs were already
concentrated into the lowest numbered workers, partly because that
seemed to have marginally better latency than anything else tried so
far due to latch collapsing with lucky timing, and partly in
anticipation of this.

The patch also reduces bogus wakeups a bit by being a bit more
cautious about fanout.  That could probably be improved a lot more and
needs more research.  It's quite tricky to figure out how to suppress
wakeups without throwing potential concurrency away.

The first couple of patches are independent of this topic, and might
be potential cleanups/fixes for master/v18.  The last is a simple
latency test.

Ideas, testing, flames etc welcome.


Attachments:

  [text/x-patch] 0001-aio-Regularize-io_method-worker-naming-conventions.patch (6.3K, 2-0001-aio-Regularize-io_method-worker-naming-conventions.patch)
  download | inline diff:
From 1dbba36f67df5d3d34a990613d6d68d15caf1b17 Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Sat, 29 Mar 2025 13:25:27 +1300
Subject: [PATCH 1/5] aio: Regularize io_method=worker naming conventions.

method_worker.c didn't keep up with the pattern of PgAioXXX for type
names in the pgaio module.  Add the missing "Pg" prefix used else where.

Likewise for pgaio_choose_idle_worker() which alone failed to use a
pgaio_worker_XXX() name refecting its submodule.  Rename.

Standardize on parameter names num_staged_ios, staged_ios for the
internal submission function.

Rename the array of handle IDs in PgAioSubmissionQueue to sqes,
since that's a term of art seen in many of these types of systems.
---
 src/backend/storage/aio/method_worker.c | 54 ++++++++++++-------------
 src/tools/pgindent/typedefs.list        |  6 +--
 2 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 8ad17ec1ef7..ba5bc5e44ba 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -51,26 +51,26 @@
 #define IO_WORKER_WAKEUP_FANOUT 2
 
 
-typedef struct AioWorkerSubmissionQueue
+typedef struct PgAioWorkerSubmissionQueue
 {
 	uint32		size;
 	uint32		mask;
 	uint32		head;
 	uint32		tail;
-	uint32		ios[FLEXIBLE_ARRAY_MEMBER];
-} AioWorkerSubmissionQueue;
+	uint32		sqes[FLEXIBLE_ARRAY_MEMBER];
+} PgAioWorkerSubmissionQueue;
 
-typedef struct AioWorkerSlot
+typedef struct PgAioWorkerSlot
 {
 	Latch	   *latch;
 	bool		in_use;
-} AioWorkerSlot;
+} PgAioWorkerSlot;
 
-typedef struct AioWorkerControl
+typedef struct PgAioWorkerControl
 {
 	uint64		idle_worker_mask;
-	AioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
-} AioWorkerControl;
+	PgAioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
+} PgAioWorkerControl;
 
 
 static size_t pgaio_worker_shmem_size(void);
@@ -95,8 +95,8 @@ int			io_workers = 3;
 
 static int	io_worker_queue_size = 64;
 static int	MyIoWorkerId;
-static AioWorkerSubmissionQueue *io_worker_submission_queue;
-static AioWorkerControl *io_worker_control;
+static PgAioWorkerSubmissionQueue *io_worker_submission_queue;
+static PgAioWorkerControl *io_worker_control;
 
 
 static size_t
@@ -105,15 +105,15 @@ pgaio_worker_queue_shmem_size(int *queue_size)
 	/* Round size up to next power of two so we can make a mask. */
 	*queue_size = pg_nextpower2_32(io_worker_queue_size);
 
-	return offsetof(AioWorkerSubmissionQueue, ios) +
+	return offsetof(PgAioWorkerSubmissionQueue, sqes) +
 		sizeof(uint32) * *queue_size;
 }
 
 static size_t
 pgaio_worker_control_shmem_size(void)
 {
-	return offsetof(AioWorkerControl, workers) +
-		sizeof(AioWorkerSlot) * MAX_IO_WORKERS;
+	return offsetof(PgAioWorkerControl, workers) +
+		sizeof(PgAioWorkerSlot) * MAX_IO_WORKERS;
 }
 
 static size_t
@@ -161,7 +161,7 @@ pgaio_worker_shmem_init(bool first_time)
 }
 
 static int
-pgaio_choose_idle_worker(void)
+pgaio_worker_choose_idle(void)
 {
 	int			worker;
 
@@ -178,7 +178,7 @@ pgaio_choose_idle_worker(void)
 static bool
 pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
 {
-	AioWorkerSubmissionQueue *queue;
+	PgAioWorkerSubmissionQueue *queue;
 	uint32		new_head;
 
 	queue = io_worker_submission_queue;
@@ -190,7 +190,7 @@ pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
 		return false;			/* full */
 	}
 
-	queue->ios[queue->head] = pgaio_io_get_id(ioh);
+	queue->sqes[queue->head] = pgaio_io_get_id(ioh);
 	queue->head = new_head;
 
 	return true;
@@ -199,14 +199,14 @@ pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
 static uint32
 pgaio_worker_submission_queue_consume(void)
 {
-	AioWorkerSubmissionQueue *queue;
+	PgAioWorkerSubmissionQueue *queue;
 	uint32		result;
 
 	queue = io_worker_submission_queue;
 	if (queue->tail == queue->head)
 		return UINT32_MAX;		/* empty */
 
-	result = queue->ios[queue->tail];
+	result = queue->sqes[queue->tail];
 	queue->tail = (queue->tail + 1) & (queue->size - 1);
 
 	return result;
@@ -239,37 +239,37 @@ pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh)
 }
 
 static void
-pgaio_worker_submit_internal(int nios, PgAioHandle *ios[])
+pgaio_worker_submit_internal(int num_staged_ios, PgAioHandle **staged_ios)
 {
 	PgAioHandle *synchronous_ios[PGAIO_SUBMIT_BATCH_SIZE];
 	int			nsync = 0;
 	Latch	   *wakeup = NULL;
 	int			worker;
 
-	Assert(nios <= PGAIO_SUBMIT_BATCH_SIZE);
+	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
 
 	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
-	for (int i = 0; i < nios; ++i)
+	for (int i = 0; i < num_staged_ios; ++i)
 	{
-		Assert(!pgaio_worker_needs_synchronous_execution(ios[i]));
-		if (!pgaio_worker_submission_queue_insert(ios[i]))
+		Assert(!pgaio_worker_needs_synchronous_execution(staged_ios[i]));
+		if (!pgaio_worker_submission_queue_insert(staged_ios[i]))
 		{
 			/*
 			 * We'll do it synchronously, but only after we've sent as many as
 			 * we can to workers, to maximize concurrency.
 			 */
-			synchronous_ios[nsync++] = ios[i];
+			synchronous_ios[nsync++] = staged_ios[i];
 			continue;
 		}
 
 		if (wakeup == NULL)
 		{
 			/* Choose an idle worker to wake up if we haven't already. */
-			worker = pgaio_choose_idle_worker();
+			worker = pgaio_worker_choose_idle();
 			if (worker >= 0)
 				wakeup = io_worker_control->workers[worker].latch;
 
-			pgaio_debug_io(DEBUG4, ios[i],
+			pgaio_debug_io(DEBUG4, staged_ios[i],
 						   "choosing worker %d",
 						   worker);
 		}
@@ -482,7 +482,7 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 						   IO_WORKER_WAKEUP_FANOUT);
 			for (int i = 0; i < nwakeups; ++i)
 			{
-				if ((worker = pgaio_choose_idle_worker()) < 0)
+				if ((worker = pgaio_worker_choose_idle()) < 0)
 					break;
 				latches[nlatches++] = io_worker_control->workers[worker].latch;
 			}
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d16bc208654..9946cfcec41 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -55,9 +55,6 @@ AggStrategy
 AggTransInfo
 Aggref
 AggregateInstrumentation
-AioWorkerControl
-AioWorkerSlot
-AioWorkerSubmissionQueue
 AlenState
 Alias
 AllocBlock
@@ -2175,6 +2172,9 @@ PgAioTargetID
 PgAioTargetInfo
 PgAioUringContext
 PgAioWaitRef
+PgAioWorkerControl
+PgAioWorkerSlot
+PgAioWorkerSubmissionQueue
 PgArchData
 PgBackendGSSStatus
 PgBackendSSLStatus
-- 
2.39.5



  [text/x-patch] 0002-aio-Remove-IO-worker-ID-references-from-postmaster.c.patch (2.5K, 3-0002-aio-Remove-IO-worker-ID-references-from-postmaster.c.patch)
  download | inline diff:
From 99c9a303d37d7e2232d3c28ee091aed82fe5b8eb Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Fri, 11 Apr 2025 23:10:10 +1200
Subject: [PATCH 2/5] aio: Remove IO worker ID references from postmaster.c.

An ancient ancestor of this code had the postmaster assign IDs to IO
workers.  Now it tracks them in an unordered array, and it might be
confusing to readers that it refers to their indexes as IDs in various
places.  Fix.
---
 src/backend/postmaster/postmaster.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 17fed96fe20..0e8623dea18 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4337,15 +4337,15 @@ maybe_start_bgworkers(void)
 static bool
 maybe_reap_io_worker(int pid)
 {
-	for (int id = 0; id < MAX_IO_WORKERS; ++id)
+	for (int i = 0; i < MAX_IO_WORKERS; ++i)
 	{
-		if (io_worker_children[id] &&
-			io_worker_children[id]->pid == pid)
+		if (io_worker_children[i] &&
+			io_worker_children[i]->pid == pid)
 		{
-			ReleasePostmasterChildSlot(io_worker_children[id]);
+			ReleasePostmasterChildSlot(io_worker_children[i]);
 
 			--io_worker_count;
-			io_worker_children[id] = NULL;
+			io_worker_children[i] = NULL;
 			return true;
 		}
 	}
@@ -4389,22 +4389,22 @@ maybe_adjust_io_workers(void)
 	while (io_worker_count < io_workers)
 	{
 		PMChild    *child;
-		int			id;
+		int			i;
 
 		/* find unused entry in io_worker_children array */
-		for (id = 0; id < MAX_IO_WORKERS; ++id)
+		for (i = 0; i < MAX_IO_WORKERS; ++i)
 		{
-			if (io_worker_children[id] == NULL)
+			if (io_worker_children[i] == NULL)
 				break;
 		}
-		if (id == MAX_IO_WORKERS)
-			elog(ERROR, "could not find a free IO worker ID");
+		if (i == MAX_IO_WORKERS)
+			elog(ERROR, "could not find a free IO worker slot");
 
 		/* Try to launch one. */
 		child = StartChildProcess(B_IO_WORKER);
 		if (child != NULL)
 		{
-			io_worker_children[id] = child;
+			io_worker_children[i] = child;
 			++io_worker_count;
 		}
 		else
@@ -4415,11 +4415,11 @@ maybe_adjust_io_workers(void)
 	if (io_worker_count > io_workers)
 	{
 		/* ask the IO worker in the highest slot to exit */
-		for (int id = MAX_IO_WORKERS - 1; id >= 0; --id)
+		for (int i = MAX_IO_WORKERS - 1; i >= 0; --i)
 		{
-			if (io_worker_children[id] != NULL)
+			if (io_worker_children[i] != NULL)
 			{
-				kill(io_worker_children[id]->pid, SIGUSR2);
+				kill(io_worker_children[i]->pid, SIGUSR2);
 				break;
 			}
 		}
-- 
2.39.5



  [text/x-patch] 0003-aio-Try-repeatedly-to-give-batched-IOs-to-workers.patch (1.8K, 4-0003-aio-Try-repeatedly-to-give-batched-IOs-to-workers.patch)
  download | inline diff:
From a90a692725eedd692f934bf3ed56a2e3a7f3fc2c Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Fri, 11 Apr 2025 21:17:26 +1200
Subject: [PATCH 3/5] aio: Try repeatedly to give batched IOs to workers.

Previously, if the first of a batch of IOs didn't fit in a batch we'd
run all of them synchronously.  Andres rightly pointed out that we
should really try again between synchronous IOs, since the workers might
have made progress.

Suggested-by: Andres Freund <[email protected]>
---
 src/backend/storage/aio/method_worker.c | 30 ++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index ba5bc5e44ba..c20d6d0f18b 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -280,12 +280,36 @@ pgaio_worker_submit_internal(int num_staged_ios, PgAioHandle **staged_ios)
 		SetLatch(wakeup);
 
 	/* Run whatever is left synchronously. */
-	if (nsync > 0)
+	for (int i = 0; i < nsync; ++i)
 	{
-		for (int i = 0; i < nsync; ++i)
+		wakeup = NULL;
+
+		/*
+		 * Between synchronous IO operations, try again to enqueue as many as
+		 * we can.
+		 */
+		if (i > 0)
 		{
-			pgaio_io_perform_synchronously(synchronous_ios[i]);
+			wakeup = NULL;
+
+			LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+			while (i < nsync &&
+				   pgaio_worker_submission_queue_insert(synchronous_ios[i]))
+			{
+				if (wakeup == NULL && (worker = pgaio_worker_choose_idle()) >= 0)
+					wakeup = io_worker_control->workers[worker].latch;
+				i++;
+			}
+			LWLockRelease(AioWorkerSubmissionQueueLock);
+
+			if (wakeup)
+				SetLatch(wakeup);
+
+			if (i == nsync)
+				break;
 		}
+
+		pgaio_io_perform_synchronously(synchronous_ios[i]);
 	}
 }
 
-- 
2.39.5



  [text/x-patch] 0004-aio-Adjust-IO-worker-pool-size-automatically.patch (33.5K, 5-0004-aio-Adjust-IO-worker-pool-size-automatically.patch)
  download | inline diff:
From 02325442bea440e65b5f3817c3fb8bd4681bbd25 Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Sat, 22 Mar 2025 00:36:49 +1300
Subject: [PATCH 4/5] aio: Adjust IO worker pool size automatically.

Replace the simple io_workers setting with:

  io_min_workers=1
  io_max_workers=8
  io_worker_idle_timeout=60s
  io_worker_launch_interval=500ms

The pool is automatically sized within the configured range according
to demand.

XXX WIP
---
 doc/src/sgml/config.sgml                      |  70 ++-
 src/backend/postmaster/postmaster.c           |  64 ++-
 src/backend/storage/aio/method_worker.c       | 450 ++++++++++++++----
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/misc/guc_tables.c           |  46 +-
 src/backend/utils/misc/postgresql.conf.sample |   5 +-
 src/include/storage/io_worker.h               |   9 +-
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pmsignal.h                |   1 +
 src/test/modules/test_aio/t/002_io_workers.pl |  15 +-
 10 files changed, 541 insertions(+), 121 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1674c22cb2..9f2e7ae6785 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2769,16 +2769,76 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
-      <varlistentry id="guc-io-workers" xreflabel="io_workers">
-       <term><varname>io_workers</varname> (<type>int</type>)
+      <varlistentry id="guc-io-min-workers" xreflabel="io_min_workers">
+       <term><varname>io_min_workers</varname> (<type>int</type>)
        <indexterm>
-        <primary><varname>io_workers</varname> configuration parameter</primary>
+        <primary><varname>io_min_workers</varname> configuration parameter</primary>
        </indexterm>
        </term>
        <listitem>
         <para>
-         Selects the number of I/O worker processes to use. The default is
-         3. This parameter can only be set in the
+         Sets the minimum number of I/O worker processes to use. The default is
+         1. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-max-workers" xreflabel="io_max_workers">
+       <term><varname>io_max_workers</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_max_workers</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the maximum number of I/O worker processes to use. The default is
+         8. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-worker-idle-timeout" xreflabel="io_worker_idle_timeout">
+       <term><varname>io_worker_idle_timeout</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_worker_idle_timeout</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the time after which idle I/O worker processes will exit, reducing the
+         maximum size of the I/O worker pool towards the minimum.  The default
+         is 1 minute.
+         This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-worker-launch-interval" xreflabel="io_worker_launch_interval">
+       <term><varname>io_worker_launch_interval</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_worker_launch_interval</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the minimum time between launching new I/O workers.  This can be used to avoid
+         sudden bursts of new I/O workers.  The default is 100ms.
+         This parameter can only be set in the
          <filename>postgresql.conf</filename> file or on the server command
          line.
         </para>
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 0e8623dea18..b3f68897194 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -408,6 +408,7 @@ static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
 /* State for IO worker management. */
+static TimestampTz io_worker_launch_delay_until = 0;
 static int	io_worker_count = 0;
 static PMChild *io_worker_children[MAX_IO_WORKERS];
 
@@ -1569,6 +1570,15 @@ DetermineSleepTime(void)
 	if (StartWorkerNeeded)
 		return 0;
 
+	/* If we need a new IO worker, defer until launch delay expires. */
+	if (pgaio_worker_test_new_worker_needed() &&
+		io_worker_count < io_max_workers)
+	{
+		if (io_worker_launch_delay_until == 0)
+			return 0;
+		next_wakeup = io_worker_launch_delay_until;
+	}
+
 	if (HaveCrashedWorker)
 	{
 		dlist_mutable_iter iter;
@@ -3750,6 +3760,15 @@ process_pm_pmsignal(void)
 		StartWorkerNeeded = true;
 	}
 
+	/* Process IO worker start requets. */
+	if (CheckPostmasterSignal(PMSIGNAL_IO_WORKER_CHANGE))
+	{
+		/*
+		 * No local flag, as the state is exposed through pgaio_worker_*()
+		 * functions.  This signal is received on potentially actionable level
+		 * changes, so that maybe_adjust_io_workers() will run.
+		 */
+	}
 	/* Process background worker state changes. */
 	if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
 	{
@@ -4355,8 +4374,9 @@ maybe_reap_io_worker(int pid)
 /*
  * Start or stop IO workers, to close the gap between the number of running
  * workers and the number of configured workers.  Used to respond to change of
- * the io_workers GUC (by increasing and decreasing the number of workers), as
- * well as workers terminating in response to errors (by starting
+ * the io_{min,max}_workers GUCs (by increasing and decreasing the number of
+ * workers) and requests to start a new one due to submission queue backlog,
+ * as well as workers terminating in response to errors (by starting
  * "replacement" workers).
  */
 static void
@@ -4385,8 +4405,16 @@ maybe_adjust_io_workers(void)
 
 	Assert(pmState < PM_WAIT_IO_WORKERS);
 
-	/* Not enough running? */
-	while (io_worker_count < io_workers)
+	/* Cancel the launch delay when it expires to minimize clock access. */
+	if (io_worker_launch_delay_until != 0 &&
+		io_worker_launch_delay_until <= GetCurrentTimestamp())
+		io_worker_launch_delay_until = 0;
+
+	/* Not enough workers running? */
+	while (io_worker_launch_delay_until == 0 &&
+		   io_worker_count < io_max_workers &&
+		   ((io_worker_count < io_min_workers ||
+			 pgaio_worker_clear_new_worker_needed())))
 	{
 		PMChild    *child;
 		int			i;
@@ -4400,6 +4428,16 @@ maybe_adjust_io_workers(void)
 		if (i == MAX_IO_WORKERS)
 			elog(ERROR, "could not find a free IO worker slot");
 
+		/*
+		 * Apply launch delay even for failures to avoid retrying too fast on
+		 * fork() failure, but not while we're still building the minimum pool
+		 * size.
+		 */
+		if (io_worker_count >= io_min_workers)
+			io_worker_launch_delay_until =
+				TimestampTzPlusMilliseconds(GetCurrentTimestamp(),
+											io_worker_launch_interval);
+
 		/* Try to launch one. */
 		child = StartChildProcess(B_IO_WORKER);
 		if (child != NULL)
@@ -4411,19 +4449,11 @@ maybe_adjust_io_workers(void)
 			break;				/* try again next time */
 	}
 
-	/* Too many running? */
-	if (io_worker_count > io_workers)
-	{
-		/* ask the IO worker in the highest slot to exit */
-		for (int i = MAX_IO_WORKERS - 1; i >= 0; --i)
-		{
-			if (io_worker_children[i] != NULL)
-			{
-				kill(io_worker_children[i]->pid, SIGUSR2);
-				break;
-			}
-		}
-	}
+	/*
+	 * If there are too many running because io_max_workers changed, that will
+	 * be handled by the IO workers themselves so they can shut down in
+	 * preferred order.
+	 */
 }
 
 
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index c20d6d0f18b..78817bb4196 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -11,9 +11,10 @@
  * infrastructure for reopening the file, and must processed synchronously by
  * the client code when submitted.
  *
- * So that the submitter can make just one system call when submitting a batch
- * of IOs, wakeups "fan out"; each woken IO worker can wake two more. XXX This
- * could be improved by using futexes instead of latches to wake N waiters.
+ * When a batch of IOs is submitted, the lowest numbered idle worker is woken
+ * up.  If it sees more work in the queue it wakes a peer to help, and so on
+ * in a chain.  When a backlog is detected, the pool size is increased.  When
+ * the highest numbered worker times out after a period of inactivity.
  *
  * This method of AIO is available in all builds on all operating systems, and
  * is the default.
@@ -40,16 +41,16 @@
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/memdebug.h"
 #include "utils/ps_status.h"
 #include "utils/wait_event.h"
 
-
-/* How many workers should each worker wake up if needed? */
-#define IO_WORKER_WAKEUP_FANOUT 2
-
+/* Saturation for stats counters used to estimate wakeup:work ratio. */
+#define PGAIO_WORKER_STATS_MAX 64
 
 typedef struct PgAioWorkerSubmissionQueue
 {
@@ -62,17 +63,25 @@ typedef struct PgAioWorkerSubmissionQueue
 
 typedef struct PgAioWorkerSlot
 {
-	Latch	   *latch;
-	bool		in_use;
+	ProcNumber	proc_number;
 } PgAioWorkerSlot;
 
 typedef struct PgAioWorkerControl
 {
+	/* Seen by postmaster */
+	volatile bool new_worker_needed;
+
+	/* Potected by AioWorkerSubmissionQueueLock. */
 	uint64		idle_worker_mask;
+
+	/* Protected by AioWorkerControlLock. */
+	uint64		worker_set;
+	int			nworkers;
+
+	/* Protected by AioWorkerControlLock. */
 	PgAioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
 } PgAioWorkerControl;
 
-
 static size_t pgaio_worker_shmem_size(void);
 static void pgaio_worker_shmem_init(bool first_time);
 
@@ -90,11 +99,14 @@ const IoMethodOps pgaio_worker_ops = {
 
 
 /* GUCs */
-int			io_workers = 3;
+int			io_min_workers = 1;
+int			io_max_workers = 8;
+int			io_worker_idle_timeout = 60000;
+int			io_worker_launch_interval = 500;
 
 
 static int	io_worker_queue_size = 64;
-static int	MyIoWorkerId;
+static int	MyIoWorkerId = -1;
 static PgAioWorkerSubmissionQueue *io_worker_submission_queue;
 static PgAioWorkerControl *io_worker_control;
 
@@ -151,36 +163,171 @@ pgaio_worker_shmem_init(bool first_time)
 						&found);
 	if (!found)
 	{
-		io_worker_control->idle_worker_mask = 0;
+		io_worker_control->new_worker_needed = false;
+		io_worker_control->worker_set = 0;
 		for (int i = 0; i < MAX_IO_WORKERS; ++i)
-		{
-			io_worker_control->workers[i].latch = NULL;
-			io_worker_control->workers[i].in_use = false;
-		}
+			io_worker_control->workers[i].proc_number = INVALID_PROC_NUMBER;
+	}
+}
+
+static void
+pgaio_worker_consider_new_worker(uint32 queue_depth)
+{
+	/*
+	 * This is called from sites that don't hold AioWorkerControlLock, but it
+	 * changes infrequently and an up to date value is not required for this
+	 * heuristic purpose.
+	 */
+	if (!io_worker_control->new_worker_needed &&
+		queue_depth >= io_worker_control->nworkers)
+	{
+		io_worker_control->new_worker_needed = true;
+		SendPostmasterSignal(PMSIGNAL_IO_WORKER_CHANGE);
 	}
 }
 
+/*
+ * Called by a worker when the queue is empty, to try to prevent a delayed
+ * reaction to a brief burst.  This races against the postmaster acting on the
+ * old value if it was recently set to true, but that's OK, the ordering would
+ * be indeterminate anyway even if we could use locks in the postmaster.
+ */
+static void
+pgaio_worker_cancel_new_worker(void)
+{
+	io_worker_control->new_worker_needed = false;
+}
+
+/*
+ * Called by the postmaster to check if a new worker is needed.
+ */
+bool
+pgaio_worker_test_new_worker_needed(void)
+{
+	return io_worker_control->new_worker_needed;
+}
+
+/*
+ * Called by the postmaster to check if a new worker is needed when it's ready
+ * to launch one, and clear the flag.
+ */
+bool
+pgaio_worker_clear_new_worker_needed(void)
+{
+	bool		result;
+
+	result = io_worker_control->new_worker_needed;
+	if (result)
+		io_worker_control->new_worker_needed = false;
+
+	return result;
+}
+
+static uint64
+pgaio_worker_mask(int worker)
+{
+	return UINT64_C(1) << worker;
+}
+
+static void
+pgaio_worker_add(uint64 *set, int worker)
+{
+	*set |= pgaio_worker_mask(worker);
+}
+
+static void
+pgaio_worker_remove(uint64 *set, int worker)
+{
+	*set &= ~pgaio_worker_mask(worker);
+}
+
+#ifdef USE_ASSERT_CHECKING
+static bool
+pgaio_worker_in(uint64 set, int worker)
+{
+	return (set & pgaio_worker_mask(worker)) != 0;
+}
+#endif
+
+static uint64
+pgaio_worker_highest(uint64 set)
+{
+	return pg_leftmost_one_pos64(set);
+}
+
+static uint64
+pgaio_worker_lowest(uint64 set)
+{
+	return pg_rightmost_one_pos64(set);
+}
+
+static int
+pgaio_worker_pop(uint64 *set)
+{
+	int			worker;
+
+	Assert(set != 0);
+	worker = pgaio_worker_lowest(*set);
+	pgaio_worker_remove(set, worker);
+	return worker;
+}
+
 static int
 pgaio_worker_choose_idle(void)
 {
+	uint64		idle_worker_mask;
 	int			worker;
 
-	if (io_worker_control->idle_worker_mask == 0)
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
+	/*
+	 * Workers only wake higher numbered workers, to try to encourage an
+	 * ordering of wakeup:work ratios, reducing spurious wakeups in lower
+	 * numbered workers.
+	 */
+	idle_worker_mask = io_worker_control->idle_worker_mask;
+	if (MyIoWorkerId != -1)
+		idle_worker_mask &= ~(pgaio_worker_mask(MyIoWorkerId) - 1);
+
+	if (idle_worker_mask == 0)
 		return -1;
 
 	/* Find the lowest bit position, and clear it. */
-	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
-	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
+	worker = pgaio_worker_lowest(idle_worker_mask);
+	pgaio_worker_remove(&io_worker_control->idle_worker_mask, worker);
 
 	return worker;
 }
 
+/*
+ * Try to wake a worker by setting its latch, to tell it there are IOs to
+ * process in the submission queue.
+ */
+static void
+pgaio_worker_wake(int worker)
+{
+	ProcNumber	proc_number;
+
+	/*
+	 * If the selected worker is concurrently exiting, then pgaio_worker_die()
+	 * had not yet removed it as of when we saw it in idle_worker_mask. That's
+	 * OK, because it will wake all remaining workers to close wakeup-vs-exit
+	 * races: *someone* will see the queued IO.  If there are no workers
+	 * running, the postmaster will start a new one.
+	 */
+	proc_number = io_worker_control->workers[worker].proc_number;
+	if (proc_number != INVALID_PROC_NUMBER)
+		SetLatch(&GetPGProcByNumber(proc_number)->procLatch);
+}
+
 static bool
 pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
 {
 	PgAioWorkerSubmissionQueue *queue;
 	uint32		new_head;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	queue = io_worker_submission_queue;
 	new_head = (queue->head + 1) & (queue->size - 1);
 	if (new_head == queue->tail)
@@ -202,6 +349,8 @@ pgaio_worker_submission_queue_consume(void)
 	PgAioWorkerSubmissionQueue *queue;
 	uint32		result;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	queue = io_worker_submission_queue;
 	if (queue->tail == queue->head)
 		return UINT32_MAX;		/* empty */
@@ -218,6 +367,8 @@ pgaio_worker_submission_queue_depth(void)
 	uint32		head;
 	uint32		tail;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	head = io_worker_submission_queue->head;
 	tail = io_worker_submission_queue->tail;
 
@@ -242,9 +393,9 @@ static void
 pgaio_worker_submit_internal(int num_staged_ios, PgAioHandle **staged_ios)
 {
 	PgAioHandle *synchronous_ios[PGAIO_SUBMIT_BATCH_SIZE];
+	uint32		queue_depth;
+	int			worker = -1;
 	int			nsync = 0;
-	Latch	   *wakeup = NULL;
-	int			worker;
 
 	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
 
@@ -259,51 +410,48 @@ pgaio_worker_submit_internal(int num_staged_ios, PgAioHandle **staged_ios)
 			 * we can to workers, to maximize concurrency.
 			 */
 			synchronous_ios[nsync++] = staged_ios[i];
-			continue;
 		}
-
-		if (wakeup == NULL)
+		else if (worker == -1)
 		{
 			/* Choose an idle worker to wake up if we haven't already. */
 			worker = pgaio_worker_choose_idle();
-			if (worker >= 0)
-				wakeup = io_worker_control->workers[worker].latch;
 
 			pgaio_debug_io(DEBUG4, staged_ios[i],
 						   "choosing worker %d",
 						   worker);
 		}
 	}
+	queue_depth = pgaio_worker_submission_queue_depth();
 	LWLockRelease(AioWorkerSubmissionQueueLock);
 
-	if (wakeup)
-		SetLatch(wakeup);
+	if (worker != -1)
+		pgaio_worker_wake(worker);
+	else
+		pgaio_worker_consider_new_worker(queue_depth);
 
 	/* Run whatever is left synchronously. */
 	for (int i = 0; i < nsync; ++i)
 	{
-		wakeup = NULL;
-
 		/*
 		 * Between synchronous IO operations, try again to enqueue as many as
 		 * we can.
 		 */
 		if (i > 0)
 		{
-			wakeup = NULL;
+			worker = -1;
 
 			LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
 			while (i < nsync &&
 				   pgaio_worker_submission_queue_insert(synchronous_ios[i]))
 			{
-				if (wakeup == NULL && (worker = pgaio_worker_choose_idle()) >= 0)
-					wakeup = io_worker_control->workers[worker].latch;
+				if (worker == -1)
+					worker = pgaio_worker_choose_idle();
 				i++;
 			}
 			LWLockRelease(AioWorkerSubmissionQueueLock);
 
-			if (wakeup)
-				SetLatch(wakeup);
+			if (worker != -1)
+				pgaio_worker_wake(worker);
 
 			if (i == nsync)
 				break;
@@ -335,13 +483,27 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 static void
 pgaio_worker_die(int code, Datum arg)
 {
-	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
-	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
-	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+	uint64		notify_set;
 
-	io_worker_control->workers[MyIoWorkerId].in_use = false;
-	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	pgaio_worker_remove(&io_worker_control->idle_worker_mask, MyIoWorkerId);
 	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].proc_number == MyProcNumber);
+	io_worker_control->workers[MyIoWorkerId].proc_number = INVALID_PROC_NUMBER;
+	Assert(pgaio_worker_in(io_worker_control->worker_set, MyIoWorkerId));
+	pgaio_worker_remove(&io_worker_control->worker_set, MyIoWorkerId);
+	notify_set = io_worker_control->worker_set;
+	Assert(io_worker_control->nworkers > 0);
+	io_worker_control->nworkers--;
+	Assert(pg_popcount64(io_worker_control->worker_set) ==
+		   io_worker_control->nworkers);
+	LWLockRelease(AioWorkerControlLock);
+
+	/* Notify other workers on pool change. */
+	while (notify_set != 0)
+		pgaio_worker_wake(pgaio_worker_pop(&notify_set));
 }
 
 /*
@@ -351,33 +513,37 @@ pgaio_worker_die(int code, Datum arg)
 static void
 pgaio_worker_register(void)
 {
-	MyIoWorkerId = -1;
+	uint64		worker_set_inverted;
+	uint64		old_worker_set;
 
-	/*
-	 * XXX: This could do with more fine-grained locking. But it's also not
-	 * very common for the number of workers to change at the moment...
-	 */
-	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	MyIoWorkerId = -1;
 
-	for (int i = 0; i < MAX_IO_WORKERS; ++i)
+	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
+	worker_set_inverted = ~io_worker_control->worker_set;
+	if (worker_set_inverted != 0)
 	{
-		if (!io_worker_control->workers[i].in_use)
-		{
-			Assert(io_worker_control->workers[i].latch == NULL);
-			io_worker_control->workers[i].in_use = true;
-			MyIoWorkerId = i;
-			break;
-		}
-		else
-			Assert(io_worker_control->workers[i].latch != NULL);
+		MyIoWorkerId = pgaio_worker_lowest(worker_set_inverted);
+		if (MyIoWorkerId >= MAX_IO_WORKERS)
+			MyIoWorkerId = -1;
 	}
-
 	if (MyIoWorkerId == -1)
 		elog(ERROR, "couldn't find a free worker slot");
 
-	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
-	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
-	LWLockRelease(AioWorkerSubmissionQueueLock);
+	Assert(io_worker_control->workers[MyIoWorkerId].proc_number ==
+		   INVALID_PROC_NUMBER);
+	io_worker_control->workers[MyIoWorkerId].proc_number = MyProcNumber;
+
+	old_worker_set = io_worker_control->worker_set;
+	Assert(!pgaio_worker_in(old_worker_set, MyIoWorkerId));
+	pgaio_worker_add(&io_worker_control->worker_set, MyIoWorkerId);
+	io_worker_control->nworkers++;
+	Assert(pg_popcount64(io_worker_control->worker_set) ==
+		   io_worker_control->nworkers);
+	LWLockRelease(AioWorkerControlLock);
+
+	/* Notify other workers on pool change. */
+	while (old_worker_set != 0)
+		pgaio_worker_wake(pgaio_worker_pop(&old_worker_set));
 
 	on_shmem_exit(pgaio_worker_die, 0);
 }
@@ -403,14 +569,47 @@ pgaio_worker_error_callback(void *arg)
 	errcontext("I/O worker executing I/O on behalf of process %d", owner_pid);
 }
 
+/*
+ * Check if this backend is allowed to time out, and thus should use a
+ * non-infinite sleep time.  Only the highest-numbered worker is allowed to
+ * time out, and only if the pool is above io_min_workers.  Serializing
+ * timeouts keeps IDs in a range 0..N without gaps, and avoids undershooting
+ * io_min_workers.
+ *
+ * The result is only instantaneously true and may be temporarily inconsistent
+ * in different workers around transitions, but all workers are woken up on
+ * pool size or GUC changes making the result eventually consistent.
+ */
+static bool
+pgaio_worker_can_timeout(void)
+{
+	uint64		worker_set;
+
+	/* Serialize against pool sized changes. */
+	LWLockAcquire(AioWorkerControlLock, LW_SHARED);
+	worker_set = io_worker_control->worker_set;
+	LWLockRelease(AioWorkerControlLock);
+
+	if (MyIoWorkerId != pgaio_worker_highest(worker_set))
+		return false;
+	if (MyIoWorkerId < io_min_workers)
+		return false;
+
+	return true;
+}
+
 void
 IoWorkerMain(const void *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	TimestampTz idle_timeout_abs = 0;
+	int			timeout_guc_used = 0;
 	PgAioHandle *volatile error_ioh = NULL;
 	ErrorContextCallback errcallback = {0};
 	volatile int error_errno = 0;
 	char		cmd[128];
+	int			ios = 0;
+	int			wakeups = 0;
 
 	MyBackendType = B_IO_WORKER;
 	AuxiliaryProcessMainCommon();
@@ -479,47 +678,53 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	while (!ShutdownRequestPending)
 	{
 		uint32		io_index;
-		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
-		int			nlatches = 0;
-		int			nwakeups = 0;
-		int			worker;
+		uint32		queue_depth;
+		int			worker = -1;
 
 		/* Try to get a job to do. */
 		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
-		if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
+		io_index = pgaio_worker_submission_queue_consume();
+		queue_depth = pgaio_worker_submission_queue_depth();
+		if (io_index == UINT32_MAX)
 		{
-			/*
-			 * Nothing to do.  Mark self idle.
-			 *
-			 * XXX: Invent some kind of back pressure to reduce useless
-			 * wakeups?
-			 */
-			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+			/* Nothing to do.  Mark self idle. */
+			pgaio_worker_add(&io_worker_control->idle_worker_mask,
+							 MyIoWorkerId);
 		}
 		else
 		{
 			/* Got one.  Clear idle flag. */
-			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+			pgaio_worker_remove(&io_worker_control->idle_worker_mask,
+								MyIoWorkerId);
 
-			/* See if we can wake up some peers. */
-			nwakeups = Min(pgaio_worker_submission_queue_depth(),
-						   IO_WORKER_WAKEUP_FANOUT);
-			for (int i = 0; i < nwakeups; ++i)
-			{
-				if ((worker = pgaio_worker_choose_idle()) < 0)
-					break;
-				latches[nlatches++] = io_worker_control->workers[worker].latch;
-			}
+			/*
+			 * See if we should wake up a peer.  Only do this if this worker
+			 * is not experiencing spurious wakeups itself, to end a chain of
+			 * wasted scheduling.
+			 */
+			if (queue_depth > 0 && wakeups <= ios)
+				worker = pgaio_worker_choose_idle();
 		}
 		LWLockRelease(AioWorkerSubmissionQueueLock);
 
-		for (int i = 0; i < nlatches; ++i)
-			SetLatch(latches[i]);
+		/* Propagate wakeups. */
+		if (worker != -1)
+			pgaio_worker_wake(worker);
+		else if (wakeups <= ios)
+			pgaio_worker_consider_new_worker(queue_depth);
 
 		if (io_index != UINT32_MAX)
 		{
 			PgAioHandle *ioh = NULL;
 
+			/* Cancel timeout and update wakeup:work ratio. */
+			idle_timeout_abs = 0;
+			if (++ios == PGAIO_WORKER_STATS_MAX)
+			{
+				ios /= 2;
+				wakeups /= 2;
+			}
+
 			ioh = &pgaio_ctl->io_handles[io_index];
 			error_ioh = ioh;
 			errcallback.arg = ioh;
@@ -585,12 +790,83 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		}
 		else
 		{
-			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-					  WAIT_EVENT_IO_WORKER_MAIN);
+			int			timeout_ms;
+
+			/* Cancel new worker if pending. */
+			pgaio_worker_cancel_new_worker();
+
+			/* Compute the remaining allowed idle time. */
+			if (io_worker_idle_timeout == -1)
+			{
+				/* Never time out. */
+				timeout_ms = -1;
+			}
+			else
+			{
+				TimestampTz now = GetCurrentTimestamp();
+
+				/* If the GUC changes, reset timer. */
+				if (idle_timeout_abs != 0 &&
+					io_worker_idle_timeout != timeout_guc_used)
+					idle_timeout_abs = 0;
+
+				/* On first sleep, compute absolute timeout. */
+				if (idle_timeout_abs == 0)
+				{
+					idle_timeout_abs =
+						TimestampTzPlusMilliseconds(now,
+													io_worker_idle_timeout);
+					timeout_guc_used = io_worker_idle_timeout;
+				}
+
+				/*
+				 * All workers maintain the absolute timeout value, but only
+				 * the highest worker can actually time out and only if
+				 * io_min_workers is exceeded.  All others wait only for
+				 * explicit wakeups caused by queue insertion, wakeup
+				 * propagation, change of pool size (possibly making them
+				 * highest), or GUC reload.
+				 */
+				if (pgaio_worker_can_timeout())
+					timeout_ms =
+						TimestampDifferenceMilliseconds(now,
+														idle_timeout_abs);
+				else
+					timeout_ms = -1;
+			}
+
+			if (WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT,
+						  timeout_ms,
+						  WAIT_EVENT_IO_WORKER_MAIN) == WL_TIMEOUT)
+			{
+				/* WL_TIMEOUT */
+				if (pgaio_worker_can_timeout())
+					if (GetCurrentTimestamp() >= idle_timeout_abs)
+						break;
+			}
+			else
+			{
+				/* WL_LATCH_SET */
+				if (++wakeups == PGAIO_WORKER_STATS_MAX)
+				{
+					ios /= 2;
+					wakeups /= 2;
+				}
+			}
 			ResetLatch(MyLatch);
 		}
 
 		CHECK_FOR_INTERRUPTS();
+
+		if (ConfigReloadPending)
+		{
+			ConfigReloadPending = false;
+			ProcessConfigFile(PGC_SIGHUP);
+
+			/* If io_max_workers has been decreased, exit highest first. */
+			if (MyIoWorkerId >= io_max_workers)
+				break;
+		}
 	}
 
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 930321905f1..067a3a1bb21 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -353,6 +353,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+AioWorkerControl	"Waiting to update AIO worker information."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 60b12446a1c..bbb8855b12d 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3306,14 +3306,52 @@ struct config_int ConfigureNamesInt[] =
 	},
 
 	{
-		{"io_workers",
+		{"io_max_workers",
 			PGC_SIGHUP,
 			RESOURCES_IO,
-			gettext_noop("Number of IO worker processes, for io_method=worker."),
+			gettext_noop("Maximum number of IO worker processes, for io_method=worker."),
 			NULL,
 		},
-		&io_workers,
-		3, 1, MAX_IO_WORKERS,
+		&io_max_workers,
+		8, 1, MAX_IO_WORKERS,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"io_min_workers",
+			PGC_SIGHUP,
+			RESOURCES_IO,
+			gettext_noop("Minimum number of IO worker processes, for io_method=worker."),
+			NULL,
+		},
+		&io_min_workers,
+		1, 1, MAX_IO_WORKERS,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"io_worker_idle_timeout",
+			PGC_SIGHUP,
+			RESOURCES_IO,
+			gettext_noop("Maximum idle time before IO workers exit, for io_method=worker."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&io_worker_idle_timeout,
+		60 * 1000, -1, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"io_worker_launch_interval",
+			PGC_SIGHUP,
+			RESOURCES_IO,
+			gettext_noop("Maximum idle time between launching IO workers, for io_method=worker."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&io_worker_launch_interval,
+		500, 0, INT_MAX,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 34826d01380..4370f673821 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -214,7 +214,10 @@
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
 					# (change requires restart)
-#io_workers = 3				# 1-32;
+#io_min_workers = 1			# 1-32;
+#io_max_workers = 8			# 1-32;
+#io_worker_idle_timeout = 60s		# min 100ms
+#io_worker_launch_interval = 500ms	# min 0ms
 
 # - Worker Processes -
 
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
index 7bde7e89c8a..de9c80109e0 100644
--- a/src/include/storage/io_worker.h
+++ b/src/include/storage/io_worker.h
@@ -17,6 +17,13 @@
 
 pg_noreturn extern void IoWorkerMain(const void *startup_data, size_t startup_data_len);
 
-extern PGDLLIMPORT int io_workers;
+extern PGDLLIMPORT int io_min_workers;
+extern PGDLLIMPORT int io_max_workers;
+extern PGDLLIMPORT int io_worker_idle_timeout;
+extern PGDLLIMPORT int io_worker_launch_interval;
+
+/* Interfaces visible to the postmaster. */
+extern bool pgaio_worker_test_new_worker_needed(void);
+extern bool pgaio_worker_clear_new_worker_needed(void);
 
 #endif							/* IO_WORKER_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index a9681738146..c1801d08833 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, AioWorkerControl)
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 67fa9ac06e1..10a967f6739 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -38,6 +38,7 @@ typedef enum
 	PMSIGNAL_ROTATE_LOGFILE,	/* send SIGUSR1 to syslogger to rotate logfile */
 	PMSIGNAL_START_AUTOVAC_LAUNCHER,	/* start an autovacuum launcher */
 	PMSIGNAL_START_AUTOVAC_WORKER,	/* start an autovacuum worker */
+	PMSIGNAL_IO_WORKER_CHANGE,	/* IO worker pool change */
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
diff --git a/src/test/modules/test_aio/t/002_io_workers.pl b/src/test/modules/test_aio/t/002_io_workers.pl
index af5fae15ea7..a0252857798 100644
--- a/src/test/modules/test_aio/t/002_io_workers.pl
+++ b/src/test/modules/test_aio/t/002_io_workers.pl
@@ -14,6 +14,9 @@ $node->init();
 $node->append_conf(
 	'postgresql.conf', qq(
 io_method=worker
+io_worker_idle_timeout=0ms
+io_worker_launch_interval=0ms
+io_max_workers=32
 ));
 
 $node->start();
@@ -31,7 +34,7 @@ sub test_number_of_io_workers_dynamic
 {
 	my $node = shift;
 
-	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_workers');
+	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_min_workers');
 
 	# Verify that worker count can't be set to 0
 	change_number_of_io_workers($node, 0, $prev_worker_count, 1);
@@ -62,23 +65,23 @@ sub change_number_of_io_workers
 	my ($result, $stdout, $stderr);
 
 	($result, $stdout, $stderr) =
-	  $node->psql('postgres', "ALTER SYSTEM SET io_workers = $worker_count");
+	  $node->psql('postgres', "ALTER SYSTEM SET io_min_workers = $worker_count");
 	$node->safe_psql('postgres', 'SELECT pg_reload_conf()');
 
 	if ($expect_failure)
 	{
 		ok( $stderr =~
-			  /$worker_count is outside the valid range for parameter "io_workers"/,
-			"updating number of io_workers to $worker_count failed, as expected"
+			  /$worker_count is outside the valid range for parameter "io_min_workers"/,
+			"updating number of io_min_workers to $worker_count failed, as expected"
 		);
 
 		return $prev_worker_count;
 	}
 	else
 	{
-		is( $node->safe_psql('postgres', 'SHOW io_workers'),
+		is( $node->safe_psql('postgres', 'SHOW io_min_workers'),
 			$worker_count,
-			"updating number of io_workers from $prev_worker_count to $worker_count"
+			"updating number of io_min_workers from $prev_worker_count to $worker_count"
 		);
 
 		check_io_worker_count($node, $worker_count);
-- 
2.39.5



  [text/x-patch] 0005-XXX-read_buffer_loop.patch (3.0K, 6-0005-XXX-read_buffer_loop.patch)
  download | inline diff:
From 43fea48f5f6e9b3301a0216f0402b2558862d632 Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Sat, 5 Apr 2025 11:14:26 +1300
Subject: [PATCH 5/5] XXX read_buffer_loop

select read_buffer_loop(n) with different values of n in each
session to test latency of reading one block.
---
 src/test/modules/test_aio/test_aio--1.0.sql |  4 ++
 src/test/modules/test_aio/test_aio.c        | 59 +++++++++++++++++++++
 2 files changed, 63 insertions(+)

diff --git a/src/test/modules/test_aio/test_aio--1.0.sql b/src/test/modules/test_aio/test_aio--1.0.sql
index e495481c41e..c37b38afcb0 100644
--- a/src/test/modules/test_aio/test_aio--1.0.sql
+++ b/src/test/modules/test_aio/test_aio--1.0.sql
@@ -106,3 +106,7 @@ AS 'MODULE_PATHNAME' LANGUAGE C;
 CREATE FUNCTION inj_io_reopen_detach()
 RETURNS pg_catalog.void STRICT
 AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION read_buffer_loop(block int)
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c
index 1d776010ef4..2654302a13c 100644
--- a/src/test/modules/test_aio/test_aio.c
+++ b/src/test/modules/test_aio/test_aio.c
@@ -18,6 +18,8 @@
 
 #include "postgres.h"
 
+#include <math.h>
+
 #include "access/relation.h"
 #include "fmgr.h"
 #include "storage/aio.h"
@@ -27,6 +29,7 @@
 #include "storage/checksum.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/read_stream.h"
 #include "utils/builtins.h"
 #include "utils/injection_point.h"
 #include "utils/rel.h"
@@ -806,3 +809,59 @@ inj_io_reopen_detach(PG_FUNCTION_ARGS)
 #endif
 	PG_RETURN_VOID();
 }
+
+static BlockNumber
+zero_callback(ReadStream *stream, void *user_data, void *pbd)
+{
+	return *(BlockNumber *) user_data;
+}
+
+PG_FUNCTION_INFO_V1(read_buffer_loop);
+Datum
+read_buffer_loop(PG_FUNCTION_ARGS)
+{
+	BlockNumber block = PG_GETARG_UINT32(0);
+	Relation	rel;
+	ReadStream *stream;
+	Buffer		buffer;
+	TimestampTz start;
+
+	rel = relation_open(TypeRelationId, AccessShareLock);
+	stream = read_stream_begin_relation(0, NULL, rel, MAIN_FORKNUM, zero_callback, &block, 0);
+	for (int loop = 0; loop < 10; loop++)
+	{
+		double		samples[25000];
+		double		avg = 0;
+		double		sum = 0;
+		double		var = 0;
+		double		dev;
+		double		stddev;
+
+		for (int i = 0; i < lengthof(samples); ++i)
+		{
+			bool flushed;
+
+			start = GetCurrentTimestamp();
+			buffer = read_stream_next_buffer(stream, NULL);
+			samples[i] = GetCurrentTimestamp() - start;
+			sum += samples[i];
+
+			ReleaseBuffer(buffer);
+			read_stream_reset(stream);
+			EvictUnpinnedBuffer(buffer, &flushed);
+		}
+		avg = sum / lengthof(samples);
+		for (int i = 0; i < lengthof(samples); i++)
+		{
+			dev = samples[i] - avg;
+			var += dev * dev;
+		}
+		stddev = sqrt(var / lengthof(samples));
+
+		elog(NOTICE, "n = %zu, avg = %.1fus, stddev = %.1f", lengthof(samples), avg, stddev);
+	}
+	read_stream_end(stream);
+	relation_close(rel, AccessShareLock);
+
+	PG_RETURN_VOID();
+}
-- 
2.39.5



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
@ 2025-04-13 17:45 ` Jose Luis Tallon <[email protected]>
  1 sibling, 0 replies; 24+ messages in thread

From: Jose Luis Tallon @ 2025-04-13 17:45 UTC (permalink / raw)
  To: Thomas Munro <[email protected]>; PostgreSQL Hackers <[email protected]>

On 12/4/25 18:59, Thomas Munro wrote:
> It's hard to know how to set io_workers=3.

Hmmm.... enable the below behaviour if "io_workers=auto" (default) ?

Sometimes being able to set this kind of parameters manually helps 
tremendously with specific workloads... :S

> [snip]
> Here's a patch to replace that GUC with:
>
>        io_min_workers=1
>        io_max_workers=8
>        io_worker_idle_timeout=60s
>        io_worker_launch_interval=500ms

Great as defaults / backwards compat with io_workers=auto. Sounds more 
user-friendly to me, at least....

> [snip]
>
> Ideas, testing, flames etc welcome.

Logic seems sound, if a bit daunting for inexperienced users --- well, 
maybe just a bit more than it is now, but ISTM evolution should try and 
flatten novices' learning curve, right?

Just .02€, though.

Thanks,

-- 
Parkinson's Law: Work expands to fill the time alloted to it.

^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
@ 2025-05-24 19:20 ` Dmitry Dolgov <[email protected]>
  2025-05-26 02:17   ` Re: Automatically sizing the IO worker pool wenhui qiu <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  1 sibling, 2 replies; 24+ messages in thread

From: Dmitry Dolgov @ 2025-05-24 19:20 UTC (permalink / raw)
  To: Thomas Munro <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

> On Sun, Apr 13, 2025 at 04:59:54AM GMT, Thomas Munro wrote:
> It's hard to know how to set io_workers=3.  If it's too small,
> io_method=worker's small submission queue overflows and it silently
> falls back to synchronous IO.  If it's too high, it generates a lot of
> pointless wakeups and scheduling overhead, which might be considered
> an independent problem or not, but having the right size pool
> certainly mitigates it.  Here's a patch to replace that GUC with:
>
>       io_min_workers=1
>       io_max_workers=8
>       io_worker_idle_timeout=60s
>       io_worker_launch_interval=500ms
>
> It grows the pool when a backlog is detected (better ideas for this
> logic welcome), and lets idle workers time out.

I like the idea. In fact, I've been pondering about something like a
"smart" configuration for quite some time, and convinced that a similar
approach needs to be applied to many performance-related GUCs.

Idle timeout and launch interval serving as a measure of sensitivity
makes sense to me, growing the pool when a backlog (queue_depth >
nworkers, so even a slightest backlog?) is detected seems to be somewhat
arbitrary. From what I understand the pool growing velocity is constant
and do not depend on the worker demand (i.e. queue_depth)? It may sounds
fancy, but I've got an impression it should be possible to apply what's
called a "low-pass filter" in the control theory (sort of a transfer
function with an exponential decay) to smooth out the demand and adjust
the worker pool based on that.

As a side note, it might be far fetched, but there are instruments in
queueing theory to figure out how much workers are needed to guarantee a
certain low queueing probability, but for that one needs to have an
average arrival rate (in our case, average number of IO operations
dispatched to workers) and an average service rate (average number of IO
operations performed by workers).

^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
@ 2025-05-26 02:17   ` wenhui qiu <[email protected]>
  1 sibling, 0 replies; 24+ messages in thread

From: wenhui qiu @ 2025-05-26 02:17 UTC (permalink / raw)
  To: Dmitry Dolgov <[email protected]>; +Cc: Thomas Munro <[email protected]>; PostgreSQL Hackers <[email protected]>

HI
> On Sun, Apr 13, 2025 at 04:59:54AM GMT, Thomas Munro wrote:
> It's hard to know how to set io_workers=3.  If it's too small,
> io_method=worker's small submission queue overflows and it silently
> falls back to synchronous IO.  If it's too high, it generates a lot of
> pointless wakeups and scheduling overhead, which might be considered
> an independent problem or not, but having the right size pool
> certainly mitigates it.  Here's a patch to replace that GUC with:
>
>       io_min_workers=1
>       io_max_workers=8
>       io_worker_idle_timeout=60s
>       io_worker_launch_interval=500ms
>
> It grows the pool when a backlog is detected (better ideas for this
> logic welcome), and lets idle workers time out.
I also like idea ,can we set a
io_workers= 3
io_max_workers= cpu/4
io_workers_oversubscribe = 3 (range 1-8）
io_workers * io_workers_oversubscribe <=io_max_workers

On Sun, May 25, 2025 at 3:20 AM Dmitry Dolgov <[email protected]> wrote:

> > On Sun, Apr 13, 2025 at 04:59:54AM GMT, Thomas Munro wrote:
> > It's hard to know how to set io_workers=3.  If it's too small,
> > io_method=worker's small submission queue overflows and it silently
> > falls back to synchronous IO.  If it's too high, it generates a lot of
> > pointless wakeups and scheduling overhead, which might be considered
> > an independent problem or not, but having the right size pool
> > certainly mitigates it.  Here's a patch to replace that GUC with:
> >
> >       io_min_workers=1
> >       io_max_workers=8
> >       io_worker_idle_timeout=60s
> >       io_worker_launch_interval=500ms
> >
> > It grows the pool when a backlog is detected (better ideas for this
> > logic welcome), and lets idle workers time out.
>
> I like the idea. In fact, I've been pondering about something like a
> "smart" configuration for quite some time, and convinced that a similar
> approach needs to be applied to many performance-related GUCs.
>
> Idle timeout and launch interval serving as a measure of sensitivity
> makes sense to me, growing the pool when a backlog (queue_depth >
> nworkers, so even a slightest backlog?) is detected seems to be somewhat
> arbitrary. From what I understand the pool growing velocity is constant
> and do not depend on the worker demand (i.e. queue_depth)? It may sounds
> fancy, but I've got an impression it should be possible to apply what's
> called a "low-pass filter" in the control theory (sort of a transfer
> function with an exponential decay) to smooth out the demand and adjust
> the worker pool based on that.
>
> As a side note, it might be far fetched, but there are instruments in
> queueing theory to figure out how much workers are needed to guarantee a
> certain low queueing probability, but for that one needs to have an
> average arrival rate (in our case, average number of IO operations
> dispatched to workers) and an average service rate (average number of IO
> operations performed by workers).
>
>
>


^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
@ 2025-05-26 06:00   ` Thomas Munro <[email protected]>
  2025-05-26 22:54     ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  1 sibling, 2 replies; 24+ messages in thread

From: Thomas Munro @ 2025-05-26 06:00 UTC (permalink / raw)
  To: Dmitry Dolgov <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Sun, May 25, 2025 at 7:20 AM Dmitry Dolgov <[email protected]> wrote:
> > On Sun, Apr 13, 2025 at 04:59:54AM GMT, Thomas Munro wrote:
> > It's hard to know how to set io_workers=3.  If it's too small,
> > io_method=worker's small submission queue overflows and it silently
> > falls back to synchronous IO.  If it's too high, it generates a lot of
> > pointless wakeups and scheduling overhead, which might be considered
> > an independent problem or not, but having the right size pool
> > certainly mitigates it.  Here's a patch to replace that GUC with:
> >
> >       io_min_workers=1
> >       io_max_workers=8
> >       io_worker_idle_timeout=60s
> >       io_worker_launch_interval=500ms
> >
> > It grows the pool when a backlog is detected (better ideas for this
> > logic welcome), and lets idle workers time out.
>
> I like the idea. In fact, I've been pondering about something like a
> "smart" configuration for quite some time, and convinced that a similar
> approach needs to be applied to many performance-related GUCs.
>
> Idle timeout and launch interval serving as a measure of sensitivity
> makes sense to me, growing the pool when a backlog (queue_depth >
> nworkers, so even a slightest backlog?) is detected seems to be somewhat
> arbitrary. From what I understand the pool growing velocity is constant
> and do not depend on the worker demand (i.e. queue_depth)? It may sounds
> fancy, but I've got an impression it should be possible to apply what's
> called a "low-pass filter" in the control theory (sort of a transfer
> function with an exponential decay) to smooth out the demand and adjust
> the worker pool based on that.
>
> As a side note, it might be far fetched, but there are instruments in
> queueing theory to figure out how much workers are needed to guarantee a
> certain low queueing probability, but for that one needs to have an
> average arrival rate (in our case, average number of IO operations
> dispatched to workers) and an average service rate (average number of IO
> operations performed by workers).

Hi Dmitry,

Thanks for looking, and yeah these are definitely the right sort of
questions.  I will be both unsurprised and delighted if someone can
bring some more science to this problem.  I did read up on Erlang's
formula C ("This formula is used to determine the number of agents or
customer service representatives needed to staff a call centre, for a
specified desired probability of queuing" according to Wikipedia) and
a bunch of related textbook stuff.  And yeah I had a bunch of
exponential moving averages of various values using scaled fixed point
arithmetic (just a bunch of shifts and adds) to smooth inputs, in
various attempts.  But ... I'm not even sure if we can say that our
I/O arrivals have a Poisson distribution, since they are not all
independent.  I tried more things too, while I was still unsure what I
should even be optimising for.  My current answer to that is: low
latency with low variance, as seen with io_uring.

In this version I went back to basics and built something that looks
more like the controls of a classic process/thread pool (think Apache)
or connection pool (think JDBC), with a couple of additions based on
intuition: (1) a launch interval, which acts as a bit of damping
against overshooting on brief bursts that are too far apart, and (2)
the queue length > workers * k as a simple way to determine that
latency is being introduced by not having enough workers.  Perhaps
there is a good way to compute an adaptive value for k with some fancy
theories, but k=1 seems to have *some* basis: that's the lowest number
which the pool is too small and *certainly* introducing latency, but
any lower constant is harder to defend because we don't know how many
workers are already awake and about to consume tasks.  Something from
queuing theory might provide an adaptive value, but in the end, I
figured we really just want to know if the queue is growing ie in
danger of overflowing (note: the queue is small!  64, and not
currently changeable, more on that later, and the overflow behaviour
is synchronous I/O as back-pressure).  You seem to be suggesting that
k=1 sounds too low, not too high, but there is that separate
time-based defence against overshoot in response to rare bursts.

You could get more certainty about jobs already about to be consumed
by a worker that is about to dequeue, by doing a lot more book
keeping: assigning them to workers on submission (separate states,
separate queues, various other ideas I guess).  But everything I tried
like that caused latency or latency variance to go up, because it
missed out on the chance for another worker to pick it up sooner
opportunistically.  This arrangement has the most stable and
predictable pool size and lowest avg latency and stddev(latency) I
have managed to come up with so far.  That said, we have plenty of
time to experiment with better ideas if you want to give it a shot or
propose concrete ideas, given that I missed v18 :-)

About control theory... yeah.  That's an interesting bag of tricks.
FWIW Melanie and (more recently) I have looked into textbook control
algorithms at a higher level of the I/O stack (and Melanie gave a talk
about other applications in eg VACUUM at pgconf.dev).  In
read_stream.c, where I/O demand is created, we've been trying to set
the desired I/O concurrency level and thus lookahead distance with
adaptive feedback.  We've tried a lot of stuff.  I hope we can share
some concept patches some time soon, well, maybe in this cycle.  Some
interesting recent experiments produced graphs that look a lot like
the ones in the book "Feedback Control for Computer Systems" (an easy
software-person book I found for people without an engineering/control
theory background where the problems match our world more closely, cf
typical texts that are about controlling motors and other mechanical
stuff...).  Experimental goals are: find the the smallest concurrent
I/O request level (and thus lookahead distance and thus speculative
work done and buffers pinned) that keeps the I/O stall probability
near zero (and keep adapting, since other queries and applications are
sharing system I/O queues), and if that's not even possible, find the
highest concurrent I/O request level that doesn't cause extra latency
due to queuing in lower levels (I/O workers, kernel, ...,  disks).
That second part is quite hard.  In other words, if higher levels own
that problem and bring the adaptivity, then perhaps io_method=worker
can get away with being quite stupid.  Just a thought...

^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
@ 2025-05-26 22:54     ` Thomas Munro <[email protected]>
  1 sibling, 0 replies; 24+ messages in thread

From: Thomas Munro @ 2025-05-26 22:54 UTC (permalink / raw)
  To: Dmitry Dolgov <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

BTW I would like to push 0001 and 0002 to master/18.  They are are not
behaviour changes, they just fix up a bunch of inconsistent (0001) and
misleading (0002) variable naming and comments to reflect reality (in
AIO v1 the postmaster used to assign those I/O worker numbers, now the
postmaster has its own array of slots to track them that is *not*
aligned with the ID numbers/slots in shared memory ie those numbers
you see in the ps status, so that's bound to confuse people
maintaining this code).  I just happened to notice that when working
on this dynamic worker pool stuff.  If there are no objections I will
go ahead and do that soon.

^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
@ 2025-05-27 17:55     ` Dmitry Dolgov <[email protected]>
  2025-07-12 05:08       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  1 sibling, 1 reply; 24+ messages in thread

From: Dmitry Dolgov @ 2025-05-27 17:55 UTC (permalink / raw)
  To: Thomas Munro <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Mon, May 26, 2025, 8:01 AM Thomas Munro <[email protected]> wrote:

> But ... I'm not even sure if we can say that our
> I/O arrivals have a Poisson distribution, since they are not all
> independent.
>

Yeah, a good point, one have to be careful with assumptions about
distribution -- from what I've read many processes in computer systems are
better described by a Pareto. But the beauty of the queuing theory is that
many results are independent from the distribution (not sure about
dependencies though).

In this version I went back to basics and built something that looks
> more like the controls of a classic process/thread pool (think Apache)
> or connection pool (think JDBC), with a couple of additions based on
> intuition: (1) a launch interval, which acts as a bit of damping
> against overshooting on brief bursts that are too far apart, and (2)
> the queue length > workers * k as a simple way to determine that
> latency is being introduced by not having enough workers.  Perhaps
> there is a good way to compute an adaptive value for k with some fancy
> theories, but k=1 seems to have *some* basis: that's the lowest number
> which the pool is too small and *certainly* introducing latency, but
> any lower constant is harder to defend because we don't know how many
> workers are already awake and about to consume tasks.  Something from
> queuing theory might provide an adaptive value, but in the end, I
> figured we really just want to know if the queue is growing ie in
> danger of overflowing (note: the queue is small!  64, and not
> currently changeable, more on that later, and the overflow behaviour
> is synchronous I/O as back-pressure).  You seem to be suggesting that
> k=1 sounds too low, not too high, but there is that separate
> time-based defence against overshoot in response to rare bursts.
>

I probably had to start with a statement that I find the current approach
reasonable, and I'm only curious if there is more to get out of it. I
haven't benchmarked the patch yet (plan getting to it when I'll get back),
and can imagine practical considerations significantly impacting any
potential solution.

About control theory... yeah.  That's an interesting bag of tricks.
> FWIW Melanie and (more recently) I have looked into textbook control
> algorithms at a higher level of the I/O stack (and Melanie gave a talk
> about other applications in eg VACUUM at pgconf.dev).  In
> read_stream.c, where I/O demand is created, we've been trying to set
> the desired I/O concurrency level and thus lookahead distance with
> adaptive feedback.  We've tried a lot of stuff.  I hope we can share
> some concept patches some time soon, well, maybe in this cycle.  Some
> interesting recent experiments produced graphs that look a lot like
> the ones in the book "Feedback Control for Computer Systems" (an easy
> software-person book I found for people without an engineering/control
> theory background where the problems match our world more closely, cf
> typical texts that are about controlling motors and other mechanical
> stuff...).  Experimental goals are: find the the smallest concurrent
> I/O request level (and thus lookahead distance and thus speculative
> work done and buffers pinned) that keeps the I/O stall probability
> near zero (and keep adapting, since other queries and applications are
> sharing system I/O queues), and if that's not even possible, find the
> highest concurrent I/O request level that doesn't cause extra latency
> due to queuing in lower levels (I/O workers, kernel, ...,  disks).
> That second part is quite hard.  In other words, if higher levels own
> that problem and bring the adaptivity, then perhaps io_method=worker
> can get away with being quite stupid.  Just a thought...
>

Looking forward to it. And thanks for the reminder about the talk, wanted
to watch it already long time ago, but somehow didn't managed yet.

>


^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
@ 2025-07-12 05:08       ` Thomas Munro <[email protected]>
  2025-07-30 10:14         ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Thomas Munro @ 2025-07-12 05:08 UTC (permalink / raw)
  To: Dmitry Dolgov <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Wed, May 28, 2025 at 5:55 AM Dmitry Dolgov <[email protected]> wrote:
> I probably had to start with a statement that I find the current approach reasonable, and I'm only curious if there is more to get out of it. I haven't benchmarked the patch yet (plan getting to it when I'll get back), and can imagine practical considerations significantly impacting any potential solution.

Here's a rebase.


Attachments:

  [text/x-patch] v2-0001-aio-Try-repeatedly-to-give-batched-IOs-to-workers.patch (1.9K, 2-v2-0001-aio-Try-repeatedly-to-give-batched-IOs-to-workers.patch)
  download | inline diff:
From fa7aac1bc9c0a47fbdbd9459424f08fa61b71ce2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Fri, 11 Apr 2025 21:17:26 +1200
Subject: [PATCH v2 1/2] aio: Try repeatedly to give batched IOs to workers.

Previously, when the submission queue was full we'd run all remaining
IOs in a batched submissoin synchronously.  Andres rightly pointed out
that we should really try again between synchronous IOs, since the
workers might have made progress in draining the queue.

Suggested-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com
---
 src/backend/storage/aio/method_worker.c | 30 ++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index bf8f77e6ff6..9a82d5f847d 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -282,12 +282,36 @@ pgaio_worker_submit_internal(int num_staged_ios, PgAioHandle **staged_ios)
 		SetLatch(wakeup);
 
 	/* Run whatever is left synchronously. */
-	if (nsync > 0)
+	for (int i = 0; i < nsync; ++i)
 	{
-		for (int i = 0; i < nsync; ++i)
+		wakeup = NULL;
+
+		/*
+		 * Between synchronous IO operations, try again to enqueue as many as
+		 * we can.
+		 */
+		if (i > 0)
 		{
-			pgaio_io_perform_synchronously(synchronous_ios[i]);
+			wakeup = NULL;
+
+			LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+			while (i < nsync &&
+				   pgaio_worker_submission_queue_insert(synchronous_ios[i]))
+			{
+				if (wakeup == NULL && (worker = pgaio_worker_choose_idle()) >= 0)
+					wakeup = io_worker_control->workers[worker].latch;
+				i++;
+			}
+			LWLockRelease(AioWorkerSubmissionQueueLock);
+
+			if (wakeup)
+				SetLatch(wakeup);
+
+			if (i == nsync)
+				break;
 		}
+
+		pgaio_io_perform_synchronously(synchronous_ios[i]);
 	}
 }
 
-- 
2.47.2



  [text/x-patch] v2-0002-aio-Adjust-IO-worker-pool-size-automatically.patch (33.9K, 3-v2-0002-aio-Adjust-IO-worker-pool-size-automatically.patch)
  download | inline diff:
From a0a5fff1f1d21c002bf68d36de9aff21bdf61783 Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Sat, 22 Mar 2025 00:36:49 +1300
Subject: [PATCH v2 2/2] aio: Adjust IO worker pool size automatically.

Replace the simple io_workers setting with:

  io_min_workers=1
  io_max_workers=8
  io_worker_idle_timeout=60s
  io_worker_launch_interval=500ms

The pool is automatically sized within the configured range according
to demand.

Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  70 ++-
 src/backend/postmaster/postmaster.c           |  64 ++-
 src/backend/storage/aio/method_worker.c       | 445 ++++++++++++++----
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/misc/guc_tables.c           |  46 +-
 src/backend/utils/misc/postgresql.conf.sample |   5 +-
 src/include/storage/io_worker.h               |   9 +-
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pmsignal.h                |   1 +
 src/test/modules/test_aio/t/002_io_workers.pl |  15 +-
 10 files changed, 535 insertions(+), 122 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c7acc0f182f..98532e55041 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2787,16 +2787,76 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
-      <varlistentry id="guc-io-workers" xreflabel="io_workers">
-       <term><varname>io_workers</varname> (<type>integer</type>)
+      <varlistentry id="guc-io-min-workers" xreflabel="io_min_workers">
+       <term><varname>io_min_workers</varname> (<type>integer</type>)
        <indexterm>
-        <primary><varname>io_workers</varname> configuration parameter</primary>
+        <primary><varname>io_min_workers</varname> configuration parameter</primary>
        </indexterm>
        </term>
        <listitem>
         <para>
-         Selects the number of I/O worker processes to use. The default is
-         3. This parameter can only be set in the
+         Sets the minimum number of I/O worker processes to use. The default is
+         1. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-max-workers" xreflabel="io_max_workers">
+       <term><varname>io_max_workers</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_max_workers</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the maximum number of I/O worker processes to use. The default is
+         8. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-worker-idle-timeout" xreflabel="io_worker_idle_timeout">
+       <term><varname>io_worker_idle_timeout</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_worker_idle_timeout</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the time after which idle I/O worker processes will exit, reducing the
+         maximum size of the I/O worker pool towards the minimum.  The default
+         is 1 minute.
+         This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-worker-launch-interval" xreflabel="io_worker_launch_interval">
+       <term><varname>io_worker_launch_interval</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_worker_launch_interval</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the minimum time between launching new I/O workers.  This can be used to avoid
+         sudden bursts of new I/O workers.  The default is 100ms.
+         This parameter can only be set in the
          <filename>postgresql.conf</filename> file or on the server command
          line.
         </para>
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index cca9b946e53..a5438fa079d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -408,6 +408,7 @@ static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
 /* State for IO worker management. */
+static TimestampTz io_worker_launch_delay_until = 0;
 static int	io_worker_count = 0;
 static PMChild *io_worker_children[MAX_IO_WORKERS];
 
@@ -1569,6 +1570,15 @@ DetermineSleepTime(void)
 	if (StartWorkerNeeded)
 		return 0;
 
+	/* If we need a new IO worker, defer until launch delay expires. */
+	if (pgaio_worker_test_new_worker_needed() &&
+		io_worker_count < io_max_workers)
+	{
+		if (io_worker_launch_delay_until == 0)
+			return 0;
+		next_wakeup = io_worker_launch_delay_until;
+	}
+
 	if (HaveCrashedWorker)
 	{
 		dlist_mutable_iter iter;
@@ -3750,6 +3760,15 @@ process_pm_pmsignal(void)
 		StartWorkerNeeded = true;
 	}
 
+	/* Process IO worker start requets. */
+	if (CheckPostmasterSignal(PMSIGNAL_IO_WORKER_CHANGE))
+	{
+		/*
+		 * No local flag, as the state is exposed through pgaio_worker_*()
+		 * functions.  This signal is received on potentially actionable level
+		 * changes, so that maybe_adjust_io_workers() will run.
+		 */
+	}
 	/* Process background worker state changes. */
 	if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
 	{
@@ -4355,8 +4374,9 @@ maybe_reap_io_worker(int pid)
 /*
  * Start or stop IO workers, to close the gap between the number of running
  * workers and the number of configured workers.  Used to respond to change of
- * the io_workers GUC (by increasing and decreasing the number of workers), as
- * well as workers terminating in response to errors (by starting
+ * the io_{min,max}_workers GUCs (by increasing and decreasing the number of
+ * workers) and requests to start a new one due to submission queue backlog,
+ * as well as workers terminating in response to errors (by starting
  * "replacement" workers).
  */
 static void
@@ -4385,8 +4405,16 @@ maybe_adjust_io_workers(void)
 
 	Assert(pmState < PM_WAIT_IO_WORKERS);
 
-	/* Not enough running? */
-	while (io_worker_count < io_workers)
+	/* Cancel the launch delay when it expires to minimize clock access. */
+	if (io_worker_launch_delay_until != 0 &&
+		io_worker_launch_delay_until <= GetCurrentTimestamp())
+		io_worker_launch_delay_until = 0;
+
+	/* Not enough workers running? */
+	while (io_worker_launch_delay_until == 0 &&
+		   io_worker_count < io_max_workers &&
+		   ((io_worker_count < io_min_workers ||
+			 pgaio_worker_clear_new_worker_needed())))
 	{
 		PMChild    *child;
 		int			i;
@@ -4400,6 +4428,16 @@ maybe_adjust_io_workers(void)
 		if (i == MAX_IO_WORKERS)
 			elog(ERROR, "could not find a free IO worker slot");
 
+		/*
+		 * Apply launch delay even for failures to avoid retrying too fast on
+		 * fork() failure, but not while we're still building the minimum pool
+		 * size.
+		 */
+		if (io_worker_count >= io_min_workers)
+			io_worker_launch_delay_until =
+				TimestampTzPlusMilliseconds(GetCurrentTimestamp(),
+											io_worker_launch_interval);
+
 		/* Try to launch one. */
 		child = StartChildProcess(B_IO_WORKER);
 		if (child != NULL)
@@ -4411,19 +4449,11 @@ maybe_adjust_io_workers(void)
 			break;				/* try again next time */
 	}
 
-	/* Too many running? */
-	if (io_worker_count > io_workers)
-	{
-		/* ask the IO worker in the highest slot to exit */
-		for (int i = MAX_IO_WORKERS - 1; i >= 0; --i)
-		{
-			if (io_worker_children[i] != NULL)
-			{
-				kill(io_worker_children[i]->pid, SIGUSR2);
-				break;
-			}
-		}
-	}
+	/*
+	 * If there are too many running because io_max_workers changed, that will
+	 * be handled by the IO workers themselves so they can shut down in
+	 * preferred order.
+	 */
 }
 
 
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 9a82d5f847d..6d3f5289e18 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -11,9 +11,10 @@
  * infrastructure for reopening the file, and must processed synchronously by
  * the client code when submitted.
  *
- * So that the submitter can make just one system call when submitting a batch
- * of IOs, wakeups "fan out"; each woken IO worker can wake two more. XXX This
- * could be improved by using futexes instead of latches to wake N waiters.
+ * When a batch of IOs is submitted, the lowest numbered idle worker is woken
+ * up.  If it sees more work in the queue it wakes a peer to help, and so on
+ * in a chain.  When a backlog is detected, the pool size is increased.  When
+ * the highest numbered worker times out after a period of inactivity.
  *
  * This method of AIO is available in all builds on all operating systems, and
  * is the default.
@@ -40,6 +41,8 @@
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/injection_point.h"
@@ -47,10 +50,8 @@
 #include "utils/ps_status.h"
 #include "utils/wait_event.h"
 
-
-/* How many workers should each worker wake up if needed? */
-#define IO_WORKER_WAKEUP_FANOUT 2
-
+/* Saturation for stats counters used to estimate wakeup:work ratio. */
+#define PGAIO_WORKER_STATS_MAX 64
 
 typedef struct PgAioWorkerSubmissionQueue
 {
@@ -63,17 +64,25 @@ typedef struct PgAioWorkerSubmissionQueue
 
 typedef struct PgAioWorkerSlot
 {
-	Latch	   *latch;
-	bool		in_use;
+	ProcNumber	proc_number;
 } PgAioWorkerSlot;
 
 typedef struct PgAioWorkerControl
 {
+	/* Seen by postmaster */
+	volatile bool new_worker_needed;
+
+	/* Potected by AioWorkerSubmissionQueueLock. */
 	uint64		idle_worker_mask;
+
+	/* Protected by AioWorkerControlLock. */
+	uint64		worker_set;
+	int			nworkers;
+
+	/* Protected by AioWorkerControlLock. */
 	PgAioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
 } PgAioWorkerControl;
 
-
 static size_t pgaio_worker_shmem_size(void);
 static void pgaio_worker_shmem_init(bool first_time);
 
@@ -91,11 +100,14 @@ const IoMethodOps pgaio_worker_ops = {
 
 
 /* GUCs */
-int			io_workers = 3;
+int			io_min_workers = 1;
+int			io_max_workers = 8;
+int			io_worker_idle_timeout = 60000;
+int			io_worker_launch_interval = 500;
 
 
 static int	io_worker_queue_size = 64;
-static int	MyIoWorkerId;
+static int	MyIoWorkerId = -1;
 static PgAioWorkerSubmissionQueue *io_worker_submission_queue;
 static PgAioWorkerControl *io_worker_control;
 
@@ -152,37 +164,172 @@ pgaio_worker_shmem_init(bool first_time)
 						&found);
 	if (!found)
 	{
+		io_worker_control->new_worker_needed = false;
+		io_worker_control->worker_set = 0;
 		io_worker_control->idle_worker_mask = 0;
 		for (int i = 0; i < MAX_IO_WORKERS; ++i)
-		{
-			io_worker_control->workers[i].latch = NULL;
-			io_worker_control->workers[i].in_use = false;
-		}
+			io_worker_control->workers[i].proc_number = INVALID_PROC_NUMBER;
 	}
 }
 
+static void
+pgaio_worker_consider_new_worker(uint32 queue_depth)
+{
+	/*
+	 * This is called from sites that don't hold AioWorkerControlLock, but it
+	 * changes infrequently and an up to date value is not required for this
+	 * heuristic purpose.
+	 */
+	if (!io_worker_control->new_worker_needed &&
+		queue_depth >= io_worker_control->nworkers)
+	{
+		io_worker_control->new_worker_needed = true;
+		SendPostmasterSignal(PMSIGNAL_IO_WORKER_CHANGE);
+	}
+}
+
+/*
+ * Called by a worker when the queue is empty, to try to prevent a delayed
+ * reaction to a brief burst.  This races against the postmaster acting on the
+ * old value if it was recently set to true, but that's OK, the ordering would
+ * be indeterminate anyway even if we could use locks in the postmaster.
+ */
+static void
+pgaio_worker_cancel_new_worker(void)
+{
+	io_worker_control->new_worker_needed = false;
+}
+
+/*
+ * Called by the postmaster to check if a new worker is needed.
+ */
+bool
+pgaio_worker_test_new_worker_needed(void)
+{
+	return io_worker_control->new_worker_needed;
+}
+
+/*
+ * Called by the postmaster to check if a new worker is needed when it's ready
+ * to launch one, and clear the flag.
+ */
+bool
+pgaio_worker_clear_new_worker_needed(void)
+{
+	bool		result;
+
+	result = io_worker_control->new_worker_needed;
+	if (result)
+		io_worker_control->new_worker_needed = false;
+
+	return result;
+}
+
+static uint64
+pgaio_worker_mask(int worker)
+{
+	return UINT64_C(1) << worker;
+}
+
+static void
+pgaio_worker_add(uint64 *set, int worker)
+{
+	*set |= pgaio_worker_mask(worker);
+}
+
+static void
+pgaio_worker_remove(uint64 *set, int worker)
+{
+	*set &= ~pgaio_worker_mask(worker);
+}
+
+#ifdef USE_ASSERT_CHECKING
+static bool
+pgaio_worker_in(uint64 set, int worker)
+{
+	return (set & pgaio_worker_mask(worker)) != 0;
+}
+#endif
+
+static uint64
+pgaio_worker_highest(uint64 set)
+{
+	return pg_leftmost_one_pos64(set);
+}
+
+static uint64
+pgaio_worker_lowest(uint64 set)
+{
+	return pg_rightmost_one_pos64(set);
+}
+
+static int
+pgaio_worker_pop(uint64 *set)
+{
+	int			worker;
+
+	Assert(set != 0);
+	worker = pgaio_worker_lowest(*set);
+	pgaio_worker_remove(set, worker);
+	return worker;
+}
+
 static int
 pgaio_worker_choose_idle(void)
 {
+	uint64		idle_worker_mask;
 	int			worker;
 
-	if (io_worker_control->idle_worker_mask == 0)
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
+	/*
+	 * Workers only wake higher numbered workers, to try to encourage an
+	 * ordering of wakeup:work ratios, reducing spurious wakeups in lower
+	 * numbered workers.
+	 */
+	idle_worker_mask = io_worker_control->idle_worker_mask;
+	if (MyIoWorkerId != -1)
+		idle_worker_mask &= ~(pgaio_worker_mask(MyIoWorkerId) - 1);
+
+	if (idle_worker_mask == 0)
 		return -1;
 
 	/* Find the lowest bit position, and clear it. */
-	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
-	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
-	Assert(io_worker_control->workers[worker].in_use);
+	worker = pgaio_worker_lowest(idle_worker_mask);
+	pgaio_worker_remove(&io_worker_control->idle_worker_mask, worker);
 
 	return worker;
 }
 
+/*
+ * Try to wake a worker by setting its latch, to tell it there are IOs to
+ * process in the submission queue.
+ */
+static void
+pgaio_worker_wake(int worker)
+{
+	ProcNumber	proc_number;
+
+	/*
+	 * If the selected worker is concurrently exiting, then pgaio_worker_die()
+	 * had not yet removed it as of when we saw it in idle_worker_mask. That's
+	 * OK, because it will wake all remaining workers to close wakeup-vs-exit
+	 * races: *someone* will see the queued IO.  If there are no workers
+	 * running, the postmaster will start a new one.
+	 */
+	proc_number = io_worker_control->workers[worker].proc_number;
+	if (proc_number != INVALID_PROC_NUMBER)
+		SetLatch(&GetPGProcByNumber(proc_number)->procLatch);
+}
+
 static bool
 pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
 {
 	PgAioWorkerSubmissionQueue *queue;
 	uint32		new_head;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	queue = io_worker_submission_queue;
 	new_head = (queue->head + 1) & (queue->size - 1);
 	if (new_head == queue->tail)
@@ -204,6 +351,8 @@ pgaio_worker_submission_queue_consume(void)
 	PgAioWorkerSubmissionQueue *queue;
 	uint32		result;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	queue = io_worker_submission_queue;
 	if (queue->tail == queue->head)
 		return UINT32_MAX;		/* empty */
@@ -220,6 +369,8 @@ pgaio_worker_submission_queue_depth(void)
 	uint32		head;
 	uint32		tail;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	head = io_worker_submission_queue->head;
 	tail = io_worker_submission_queue->tail;
 
@@ -244,9 +395,9 @@ static void
 pgaio_worker_submit_internal(int num_staged_ios, PgAioHandle **staged_ios)
 {
 	PgAioHandle *synchronous_ios[PGAIO_SUBMIT_BATCH_SIZE];
+	uint32		queue_depth;
+	int			worker = -1;
 	int			nsync = 0;
-	Latch	   *wakeup = NULL;
-	int			worker;
 
 	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
 
@@ -261,51 +412,48 @@ pgaio_worker_submit_internal(int num_staged_ios, PgAioHandle **staged_ios)
 			 * we can to workers, to maximize concurrency.
 			 */
 			synchronous_ios[nsync++] = staged_ios[i];
-			continue;
 		}
-
-		if (wakeup == NULL)
+		else if (worker == -1)
 		{
 			/* Choose an idle worker to wake up if we haven't already. */
 			worker = pgaio_worker_choose_idle();
-			if (worker >= 0)
-				wakeup = io_worker_control->workers[worker].latch;
 
 			pgaio_debug_io(DEBUG4, staged_ios[i],
 						   "choosing worker %d",
 						   worker);
 		}
 	}
+	queue_depth = pgaio_worker_submission_queue_depth();
 	LWLockRelease(AioWorkerSubmissionQueueLock);
 
-	if (wakeup)
-		SetLatch(wakeup);
+	if (worker != -1)
+		pgaio_worker_wake(worker);
+	else
+		pgaio_worker_consider_new_worker(queue_depth);
 
 	/* Run whatever is left synchronously. */
 	for (int i = 0; i < nsync; ++i)
 	{
-		wakeup = NULL;
-
 		/*
 		 * Between synchronous IO operations, try again to enqueue as many as
 		 * we can.
 		 */
 		if (i > 0)
 		{
-			wakeup = NULL;
+			worker = -1;
 
 			LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
 			while (i < nsync &&
 				   pgaio_worker_submission_queue_insert(synchronous_ios[i]))
 			{
-				if (wakeup == NULL && (worker = pgaio_worker_choose_idle()) >= 0)
-					wakeup = io_worker_control->workers[worker].latch;
+				if (worker == -1)
+					worker = pgaio_worker_choose_idle();
 				i++;
 			}
 			LWLockRelease(AioWorkerSubmissionQueueLock);
 
-			if (wakeup)
-				SetLatch(wakeup);
+			if (worker != -1)
+				pgaio_worker_wake(worker);
 
 			if (i == nsync)
 				break;
@@ -337,14 +485,27 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 static void
 pgaio_worker_die(int code, Datum arg)
 {
-	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
-	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
-	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+	uint64		notify_set;
 
-	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
-	io_worker_control->workers[MyIoWorkerId].in_use = false;
-	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	pgaio_worker_remove(&io_worker_control->idle_worker_mask, MyIoWorkerId);
 	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].proc_number == MyProcNumber);
+	io_worker_control->workers[MyIoWorkerId].proc_number = INVALID_PROC_NUMBER;
+	Assert(pgaio_worker_in(io_worker_control->worker_set, MyIoWorkerId));
+	pgaio_worker_remove(&io_worker_control->worker_set, MyIoWorkerId);
+	notify_set = io_worker_control->worker_set;
+	Assert(io_worker_control->nworkers > 0);
+	io_worker_control->nworkers--;
+	Assert(pg_popcount64(io_worker_control->worker_set) ==
+		   io_worker_control->nworkers);
+	LWLockRelease(AioWorkerControlLock);
+
+	/* Notify other workers on pool change. */
+	while (notify_set != 0)
+		pgaio_worker_wake(pgaio_worker_pop(&notify_set));
 }
 
 /*
@@ -354,33 +515,37 @@ pgaio_worker_die(int code, Datum arg)
 static void
 pgaio_worker_register(void)
 {
-	MyIoWorkerId = -1;
+	uint64		worker_set_inverted;
+	uint64		old_worker_set;
 
-	/*
-	 * XXX: This could do with more fine-grained locking. But it's also not
-	 * very common for the number of workers to change at the moment...
-	 */
-	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	MyIoWorkerId = -1;
 
-	for (int i = 0; i < MAX_IO_WORKERS; ++i)
+	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
+	worker_set_inverted = ~io_worker_control->worker_set;
+	if (worker_set_inverted != 0)
 	{
-		if (!io_worker_control->workers[i].in_use)
-		{
-			Assert(io_worker_control->workers[i].latch == NULL);
-			io_worker_control->workers[i].in_use = true;
-			MyIoWorkerId = i;
-			break;
-		}
-		else
-			Assert(io_worker_control->workers[i].latch != NULL);
+		MyIoWorkerId = pgaio_worker_lowest(worker_set_inverted);
+		if (MyIoWorkerId >= MAX_IO_WORKERS)
+			MyIoWorkerId = -1;
 	}
-
 	if (MyIoWorkerId == -1)
 		elog(ERROR, "couldn't find a free worker slot");
 
-	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
-	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
-	LWLockRelease(AioWorkerSubmissionQueueLock);
+	Assert(io_worker_control->workers[MyIoWorkerId].proc_number ==
+		   INVALID_PROC_NUMBER);
+	io_worker_control->workers[MyIoWorkerId].proc_number = MyProcNumber;
+
+	old_worker_set = io_worker_control->worker_set;
+	Assert(!pgaio_worker_in(old_worker_set, MyIoWorkerId));
+	pgaio_worker_add(&io_worker_control->worker_set, MyIoWorkerId);
+	io_worker_control->nworkers++;
+	Assert(pg_popcount64(io_worker_control->worker_set) ==
+		   io_worker_control->nworkers);
+	LWLockRelease(AioWorkerControlLock);
+
+	/* Notify other workers on pool change. */
+	while (old_worker_set != 0)
+		pgaio_worker_wake(pgaio_worker_pop(&old_worker_set));
 
 	on_shmem_exit(pgaio_worker_die, 0);
 }
@@ -406,14 +571,47 @@ pgaio_worker_error_callback(void *arg)
 	errcontext("I/O worker executing I/O on behalf of process %d", owner_pid);
 }
 
+/*
+ * Check if this backend is allowed to time out, and thus should use a
+ * non-infinite sleep time.  Only the highest-numbered worker is allowed to
+ * time out, and only if the pool is above io_min_workers.  Serializing
+ * timeouts keeps IDs in a range 0..N without gaps, and avoids undershooting
+ * io_min_workers.
+ *
+ * The result is only instantaneously true and may be temporarily inconsistent
+ * in different workers around transitions, but all workers are woken up on
+ * pool size or GUC changes making the result eventually consistent.
+ */
+static bool
+pgaio_worker_can_timeout(void)
+{
+	uint64		worker_set;
+
+	/* Serialize against pool sized changes. */
+	LWLockAcquire(AioWorkerControlLock, LW_SHARED);
+	worker_set = io_worker_control->worker_set;
+	LWLockRelease(AioWorkerControlLock);
+
+	if (MyIoWorkerId != pgaio_worker_highest(worker_set))
+		return false;
+	if (MyIoWorkerId < io_min_workers)
+		return false;
+
+	return true;
+}
+
 void
 IoWorkerMain(const void *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	TimestampTz idle_timeout_abs = 0;
+	int			timeout_guc_used = 0;
 	PgAioHandle *volatile error_ioh = NULL;
 	ErrorContextCallback errcallback = {0};
 	volatile int error_errno = 0;
 	char		cmd[128];
+	int			ios = 0;
+	int			wakeups = 0;
 
 	MyBackendType = B_IO_WORKER;
 	AuxiliaryProcessMainCommon();
@@ -482,10 +680,8 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	while (!ShutdownRequestPending)
 	{
 		uint32		io_index;
-		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
-		int			nlatches = 0;
-		int			nwakeups = 0;
-		int			worker;
+		uint32		queue_depth;
+		int			worker = -1;
 
 		/*
 		 * Try to get a job to do.
@@ -494,40 +690,48 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		 * to ensure that we don't see an outdated data in the handle.
 		 */
 		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
-		if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
+		io_index = pgaio_worker_submission_queue_consume();
+		queue_depth = pgaio_worker_submission_queue_depth();
+		if (io_index == UINT32_MAX)
 		{
-			/*
-			 * Nothing to do.  Mark self idle.
-			 *
-			 * XXX: Invent some kind of back pressure to reduce useless
-			 * wakeups?
-			 */
-			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+			/* Nothing to do.  Mark self idle. */
+			pgaio_worker_add(&io_worker_control->idle_worker_mask,
+							 MyIoWorkerId);
 		}
 		else
 		{
 			/* Got one.  Clear idle flag. */
-			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+			pgaio_worker_remove(&io_worker_control->idle_worker_mask,
+								MyIoWorkerId);
 
-			/* See if we can wake up some peers. */
-			nwakeups = Min(pgaio_worker_submission_queue_depth(),
-						   IO_WORKER_WAKEUP_FANOUT);
-			for (int i = 0; i < nwakeups; ++i)
-			{
-				if ((worker = pgaio_worker_choose_idle()) < 0)
-					break;
-				latches[nlatches++] = io_worker_control->workers[worker].latch;
-			}
+			/*
+			 * See if we should wake up a peer.  Only do this if this worker
+			 * is not experiencing spurious wakeups itself, to end a chain of
+			 * wasted scheduling.
+			 */
+			if (queue_depth > 0 && wakeups <= ios)
+				worker = pgaio_worker_choose_idle();
 		}
 		LWLockRelease(AioWorkerSubmissionQueueLock);
 
-		for (int i = 0; i < nlatches; ++i)
-			SetLatch(latches[i]);
+		/* Propagate wakeups. */
+		if (worker != -1)
+			pgaio_worker_wake(worker);
+		else if (wakeups <= ios)
+			pgaio_worker_consider_new_worker(queue_depth);
 
 		if (io_index != UINT32_MAX)
 		{
 			PgAioHandle *ioh = NULL;
 
+			/* Cancel timeout and update wakeup:work ratio. */
+			idle_timeout_abs = 0;
+			if (++ios == PGAIO_WORKER_STATS_MAX)
+			{
+				ios /= 2;
+				wakeups /= 2;
+			}
+
 			ioh = &pgaio_ctl->io_handles[io_index];
 			error_ioh = ioh;
 			errcallback.arg = ioh;
@@ -593,8 +797,69 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		}
 		else
 		{
-			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-					  WAIT_EVENT_IO_WORKER_MAIN);
+			int			timeout_ms;
+
+			/* Cancel new worker if pending. */
+			pgaio_worker_cancel_new_worker();
+
+			/* Compute the remaining allowed idle time. */
+			if (io_worker_idle_timeout == -1)
+			{
+				/* Never time out. */
+				timeout_ms = -1;
+			}
+			else
+			{
+				TimestampTz now = GetCurrentTimestamp();
+
+				/* If the GUC changes, reset timer. */
+				if (idle_timeout_abs != 0 &&
+					io_worker_idle_timeout != timeout_guc_used)
+					idle_timeout_abs = 0;
+
+				/* On first sleep, compute absolute timeout. */
+				if (idle_timeout_abs == 0)
+				{
+					idle_timeout_abs =
+						TimestampTzPlusMilliseconds(now,
+													io_worker_idle_timeout);
+					timeout_guc_used = io_worker_idle_timeout;
+				}
+
+				/*
+				 * All workers maintain the absolute timeout value, but only
+				 * the highest worker can actually time out and only if
+				 * io_min_workers is exceeded.  All others wait only for
+				 * explicit wakeups caused by queue insertion, wakeup
+				 * propagation, change of pool size (possibly making them
+				 * highest), or GUC reload.
+				 */
+				if (pgaio_worker_can_timeout())
+					timeout_ms =
+						TimestampDifferenceMilliseconds(now,
+														idle_timeout_abs);
+				else
+					timeout_ms = -1;
+			}
+
+			if (WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT,
+						  timeout_ms,
+						  WAIT_EVENT_IO_WORKER_MAIN) == WL_TIMEOUT)
+			{
+				/* WL_TIMEOUT */
+				if (pgaio_worker_can_timeout())
+					if (GetCurrentTimestamp() >= idle_timeout_abs)
+						break;
+			}
+			else
+			{
+				/* WL_LATCH_SET */
+				if (++wakeups == PGAIO_WORKER_STATS_MAX)
+				{
+					ios /= 2;
+					wakeups /= 2;
+				}
+			}
 			ResetLatch(MyLatch);
 		}
 
@@ -604,6 +869,10 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		{
 			ConfigReloadPending = false;
 			ProcessConfigFile(PGC_SIGHUP);
+
+			/* If io_max_workers has been decreased, exit highest first. */
+			if (MyIoWorkerId >= io_max_workers)
+				break;
 		}
 	}
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 4da68312b5f..c6c8107fe33 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -352,6 +352,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+AioWorkerControl	"Waiting to update AIO worker information."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index d14b1678e7f..ecb16facb67 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3306,14 +3306,52 @@ struct config_int ConfigureNamesInt[] =
 	},
 
 	{
-		{"io_workers",
+		{"io_max_workers",
 			PGC_SIGHUP,
 			RESOURCES_IO,
-			gettext_noop("Number of IO worker processes, for io_method=worker."),
+			gettext_noop("Maximum number of IO worker processes, for io_method=worker."),
 			NULL,
 		},
-		&io_workers,
-		3, 1, MAX_IO_WORKERS,
+		&io_max_workers,
+		8, 1, MAX_IO_WORKERS,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"io_min_workers",
+			PGC_SIGHUP,
+			RESOURCES_IO,
+			gettext_noop("Minimum number of IO worker processes, for io_method=worker."),
+			NULL,
+		},
+		&io_min_workers,
+		1, 1, MAX_IO_WORKERS,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"io_worker_idle_timeout",
+			PGC_SIGHUP,
+			RESOURCES_IO,
+			gettext_noop("Maximum idle time before IO workers exit, for io_method=worker."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&io_worker_idle_timeout,
+		60 * 1000, -1, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"io_worker_launch_interval",
+			PGC_SIGHUP,
+			RESOURCES_IO,
+			gettext_noop("Maximum idle time between launching IO workers, for io_method=worker."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&io_worker_launch_interval,
+		500, 0, INT_MAX,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a9d8293474a..1da6345ad7a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -210,7 +210,10 @@
 					# can execute simultaneously
 					# -1 sets based on shared_buffers
 					# (change requires restart)
-#io_workers = 3				# 1-32;
+#io_min_workers = 1			# 1-32;
+#io_max_workers = 8			# 1-32;
+#io_worker_idle_timeout = 60s		# min 100ms
+#io_worker_launch_interval = 500ms	# min 0ms
 
 # - Worker Processes -
 
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
index 7bde7e89c8a..de9c80109e0 100644
--- a/src/include/storage/io_worker.h
+++ b/src/include/storage/io_worker.h
@@ -17,6 +17,13 @@
 
 pg_noreturn extern void IoWorkerMain(const void *startup_data, size_t startup_data_len);
 
-extern PGDLLIMPORT int io_workers;
+extern PGDLLIMPORT int io_min_workers;
+extern PGDLLIMPORT int io_max_workers;
+extern PGDLLIMPORT int io_worker_idle_timeout;
+extern PGDLLIMPORT int io_worker_launch_interval;
+
+/* Interfaces visible to the postmaster. */
+extern bool pgaio_worker_test_new_worker_needed(void);
+extern bool pgaio_worker_clear_new_worker_needed(void);
 
 #endif							/* IO_WORKER_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index a9681738146..c1801d08833 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, AioWorkerControl)
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 428aa3fd68a..2859a636349 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -38,6 +38,7 @@ typedef enum
 	PMSIGNAL_ROTATE_LOGFILE,	/* send SIGUSR1 to syslogger to rotate logfile */
 	PMSIGNAL_START_AUTOVAC_LAUNCHER,	/* start an autovacuum launcher */
 	PMSIGNAL_START_AUTOVAC_WORKER,	/* start an autovacuum worker */
+	PMSIGNAL_IO_WORKER_CHANGE,	/* IO worker pool change */
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
diff --git a/src/test/modules/test_aio/t/002_io_workers.pl b/src/test/modules/test_aio/t/002_io_workers.pl
index af5fae15ea7..a0252857798 100644
--- a/src/test/modules/test_aio/t/002_io_workers.pl
+++ b/src/test/modules/test_aio/t/002_io_workers.pl
@@ -14,6 +14,9 @@ $node->init();
 $node->append_conf(
 	'postgresql.conf', qq(
 io_method=worker
+io_worker_idle_timeout=0ms
+io_worker_launch_interval=0ms
+io_max_workers=32
 ));
 
 $node->start();
@@ -31,7 +34,7 @@ sub test_number_of_io_workers_dynamic
 {
 	my $node = shift;
 
-	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_workers');
+	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_min_workers');
 
 	# Verify that worker count can't be set to 0
 	change_number_of_io_workers($node, 0, $prev_worker_count, 1);
@@ -62,23 +65,23 @@ sub change_number_of_io_workers
 	my ($result, $stdout, $stderr);
 
 	($result, $stdout, $stderr) =
-	  $node->psql('postgres', "ALTER SYSTEM SET io_workers = $worker_count");
+	  $node->psql('postgres', "ALTER SYSTEM SET io_min_workers = $worker_count");
 	$node->safe_psql('postgres', 'SELECT pg_reload_conf()');
 
 	if ($expect_failure)
 	{
 		ok( $stderr =~
-			  /$worker_count is outside the valid range for parameter "io_workers"/,
-			"updating number of io_workers to $worker_count failed, as expected"
+			  /$worker_count is outside the valid range for parameter "io_min_workers"/,
+			"updating number of io_min_workers to $worker_count failed, as expected"
 		);
 
 		return $prev_worker_count;
 	}
 	else
 	{
-		is( $node->safe_psql('postgres', 'SHOW io_workers'),
+		is( $node->safe_psql('postgres', 'SHOW io_min_workers'),
 			$worker_count,
-			"updating number of io_workers from $prev_worker_count to $worker_count"
+			"updating number of io_min_workers from $prev_worker_count to $worker_count"
 		);
 
 		check_io_worker_count($node, $worker_count);
-- 
2.47.2



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-07-12 05:08       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
@ 2025-07-30 10:14         ` Dmitry Dolgov <[email protected]>
  2025-08-04 05:30           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-11 06:35           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  0 siblings, 2 replies; 24+ messages in thread

From: Dmitry Dolgov @ 2025-07-30 10:14 UTC (permalink / raw)
  To: Thomas Munro <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

> On Sat, Jul 12, 2025 at 05:08:29PM +1200, Thomas Munro wrote:
> On Wed, May 28, 2025 at 5:55 AM Dmitry Dolgov <[email protected]> wrote:
> > I probably had to start with a statement that I find the current
> > approach reasonable, and I'm only curious if there is more to get
> > out of it. I haven't benchmarked the patch yet (plan getting to it
> > when I'll get back), and can imagine practical considerations
> > significantly impacting any potential solution.
>
> Here's a rebase.

Thanks. I was experimenting with this approach, and realized there isn't
much metrics exposed about workers and the IO queue so far. Since the
worker pool growth is based on the queue size and workers try to share
the load uniformly, it makes to have a system view to show those
numbers, let's say a system view for worker handles and a function to
get the current queue size? E.g. workers load in my testing was quite
varying, see "Load distribution between workers" graph, which shows a
quick profiling run including currently running io workers.

Regarding the worker pool growth approach, it sounds reasonable to me.
With static number of workers one needs to somehow find a number
suitable for all types of workload, where with this patch one needs only
to fiddle with the launch interval to handle possible spikes. It would
be interesting to investigate, how this approach would react to
different dynamics of the queue size. I've plotted one "spike" scenario
in the "Worker pool size response to queue depth", where there is a
pretty artificial burst of IO, making the queue size look like a step
function. If I understand the patch implementation correctly, it would
respond linearly over time (green line), one could also think about
applying a first order butterworth low pass filter to respond quicker
but still smooth (orange line).

But in reality the queue size would be of course much more volatile even
on stable workloads, like in "Queue depth over time" (one can see
general oscillation, as well as different modes, e.g. where data is in
the page cache vs where it isn't). Event more, there is a feedback where
increasing number of workers would accelerate queue size decrease --
based on [1] the system utilization for M/M/k depends on the arrival
rate, processing rate and number of processors, where pretty intuitively
more processors reduce utilization. But alas, as you've mentioned this
result exists for Poisson distribution only.

Btw, I assume something similar could be done to other methods as well?
I'm not up to date on io uring, can one change the ring depth on the
fly?

As a side note, I was trying to experiment with this patch using
dm-mapper's delay feature to introduce an arbitrary large io latency and
see how the io queue is growing. But strangely enough, even though the
pure io latency was high, the queue growth was smaller than e.g. on a
real hardware under the same conditions without any artificial delay. Is
there anything obvious I'm missing that could have explained that?

[1]: Harchol-Balter, Mor. Performance modeling and design of computer
systems: queueing theory in action. Cambridge University Press, 2013.

Attachments:

  [image/png] load.png (25.9K, 2-load.png)
  download | view image

  [image/png] workers.png (38.9K, 3-workers.png)
  download | view image

  [image/png] queue.png (53.8K, 4-queue.png)
  download | view image

^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-07-12 05:08       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-07-30 10:14         ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
@ 2025-08-04 05:30           ` Thomas Munro <[email protected]>
  2026-03-28 09:31             ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  1 sibling, 1 reply; 24+ messages in thread

From: Thomas Munro @ 2025-08-04 05:30 UTC (permalink / raw)
  To: Dmitry Dolgov <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Wed, Jul 30, 2025 at 10:15 PM Dmitry Dolgov <[email protected]> wrote:
> Thanks. I was experimenting with this approach, and realized there isn't
> much metrics exposed about workers and the IO queue so far. Since the

Hmm.  You can almost infer the depth from the pg_aios view.  All IOs
in use are visible there, and the SUBMITTED ones are all either in the
queue, currently being executed by a worker, or being executed
synchronously by a regular backend because the queue was full and in
that case it just falls back to synchronous execution.  Perhaps we
just need to be able to distinguish those three cases in that view.
For the synchronous-in-submitter overflow case, I think f_sync should
really show 't', and I'll post a patch for that shortly.  For
"currently executing in a worker", I wonder if we could have an "info"
column that queries a new optional callback
pgaio_iomethod_ops->get_info(ioh) where worker mode could return
"worker 3", or something like that.

> worker pool growth is based on the queue size and workers try to share
> the load uniformly, it makes to have a system view to show those

Actually it's not uniform: it tries to wake up the lowest numbered
worker that advertises itself as idle, in that little bitmap of idle
workers.  So if you look in htop you'll see that worker 0 is the most
busy, then worker 1, etc.  Only if they are all quite busy does it
become almost uniform, which probably implies you've reached hit
io_max_workers and should probably set it higher (or without this
patch, you should probably just increase io_workers manually, assuming
your I/O hardware can take more).

Originally I made it like that to give higher numbered workers a
chance to time out (anticipating this patch).  Later I found another
reason to do it that way:

When I tried uniform distribution using atomic_fetch_add(&distributor,
1) % nworkers to select the worker to wake up, avg(latency) and
stddev(latency) were both higher for simple tests like the one
attached to the first message, when running several copies of it
concurrently.  The concentrate-into-lowest-numbers design benefits
from latch collapsing and allows the busier workers to avoid going
back to sleep when they could immediately pick up a new job.  I didn't
change that in this patch, though I did tweak the "fan out" logic a
bit, after some experimentation on several machines where I realised
the code in master/18 is a bit over enthusiastic about that and has a
higher spurious wakeup ratio (something this patch actually measures
and tries to reduce).

Here is one of my less successful attempts to do a round-robin system
that tries to adjust the pool size with more engineering, but it was
consistently worse on those latency statistics compared to this
approach, and wasn't even as good at finding a good pool size,  so
eventually I realised that it was a dead end and my original work
contrentrating concept was better:

https://github.com/macdice/postgres/tree/io-worker-pool

FWIW the patch in this branch is in this public branch:

https://github.com/macdice/postgres/tree/io-worker-pool-3

> Regarding the worker pool growth approach, it sounds reasonable to me.

Great to hear.  I wonder what other kinds of testing we should do to
validate this, but I am feeling quite confident about this patch and
thinking it should probably go in sooner rather than later.

> With static number of workers one needs to somehow find a number
> suitable for all types of workload, where with this patch one needs only
> to fiddle with the launch interval to handle possible spikes. It would
> be interesting to investigate, how this approach would react to
> different dynamics of the queue size. I've plotted one "spike" scenario
> in the "Worker pool size response to queue depth", where there is a
> pretty artificial burst of IO, making the queue size look like a step
> function. If I understand the patch implementation correctly, it would
> respond linearly over time (green line), one could also think about
> applying a first order butterworth low pass filter to respond quicker
> but still smooth (orange line).

Interesting.

There is only one kind of smoothing in the patch currently, relating
to the pool size going down.  It models spurious latch wakeups in an
exponentially decaying ratio of wakeups:work.  That's the only way I
could find to deal with the inherent sloppiness of the wakeup
mechanism with a shared queue: when you wake the lowest numbered idle
worker as of some moment in time, it might lose the race against an
even lower numbered worker that finishes its current job and steals
the new job.  When workers steal jobs, latency decreases, which is
good, so instead of preventing it I eventually figured out that we
should measure it, smooth it, and use it to limit wakeup propagation.
I wonder if that naturally produces curves a bit like your butterworth
line when it's going down already, but I'm not sure.

As for the curve on the way up, hmm, I'm not sure.  Yes, it goes up
linearly and is limited by the launch delay, but I was thinking of
that only as the way it grows when the *variation* in workload changes
over a long time frame.  In other words, maybe it's not so important
how exactly it grows, it's more important that it achieves a steady
state that can handle the oscillations and spikes in your workload.
The idle timeout creates that steady state by holding the current pool
size for quite a while, so that it can handle your quieter and busier
moments immediately without having to adjust the pool size.

In that other failed attempt I tried to model that more explicitly,
with "active" workers and "spare" workers, with the active set sizes
for average demand with uniform wakeups and the spare set sized for
some number of standard deviations that are woken up only when the
queue is high, but I could never really make it work well...

> But in reality the queue size would be of course much more volatile even
> on stable workloads, like in "Queue depth over time" (one can see
> general oscillation, as well as different modes, e.g. where data is in
> the page cache vs where it isn't). Event more, there is a feedback where
> increasing number of workers would accelerate queue size decrease --
> based on [1] the system utilization for M/M/k depends on the arrival
> rate, processing rate and number of processors, where pretty intuitively
> more processors reduce utilization. But alas, as you've mentioned this
> result exists for Poisson distribution only.

> Btw, I assume something similar could be done to other methods as well?
> I'm not up to date on io uring, can one change the ring depth on the
> fly?

Each backend's io_uring submission queue is configured at startup and
not changeable later, but it is sized for the maximum possible number
that each backend can submit, io_max_concurrency, which corresponds to
the backend's portion of the array of PgAioHandle objects that is
fixed.  I suppose you could say that each backend's submission queue
can't overflow at that level, because it's perfectly sized and not
shared with other backends, or to put it another way, the equivalent
of overflow is we won't try to submit more IOs than that.

Worker mode has a shared submission queue, but falls back to
synchronous execution if it's full, which is a bit weird as it makes
your IOs jump the queue in a sense, and that is a good reason to want
this patch so that the pool can try to find the size that avoids that
instead of leaving the user in the dark.

As for the equivalent of pool sizing inside io_uring (and maybe other
AIO systems in other kernels), hmm.... in the absolute best cases
worker threads can be skipped completely, eg for direct I/O queued
straight to the device, but when used, I guess they have pretty
different economics.  A kernel can start a thread just by allocating a
bit of memory and sticking it in a queue, and can also wake them (move
them to a different scheduler queue) cheaply, but we have to fork a
giant process that has to open all the files and build up its caches
etc.  So I think they just start threads on demand immediately on need
without damping, with some kind of short grace period just to avoid
those smaller costs being repeated.  I'm no expert on those internal
details, but our worker system clearly needs all this damping and
steady state discovery heuristics due to the higher overheads and
sloppy wakeups.

Thinking more about our comparatively heavyweight I/O workers, there
must also be affinity opportunities.  If you somehow tended to use the
same workers for a given database in a cluster with multiple active
databases, then workers might accumulate fewer open file descriptors
and SMgrRelation cache objects.  If you had per-NUMA node pools and
queues then you might be able to reduce  contention, and maybe also
cache line ping-pong on buffer headers considering that the submitter
dirties the header, then the worker does (in the completion callback),
and then the submitter accesses it again.  I haven't investigated
that.

> As a side note, I was trying to experiment with this patch using
> dm-mapper's delay feature to introduce an arbitrary large io latency and
> see how the io queue is growing. But strangely enough, even though the
> pure io latency was high, the queue growth was smaller than e.g. on a
> real hardware under the same conditions without any artificial delay. Is
> there anything obvious I'm missing that could have explained that?

Could it be alternating full and almost empty due to method_worker.c's
fallback to synchronous on overflow, which slows the submission down,
or something like that, and then you're plotting an average depth that
is lower than you expected?  With the patch I'll share shortly to make
pg_aios show a useful f_sync value it might be more obvious...

About dm-mapper delays, I actually found it useful to hack up worker
mode itself to simulate storage behaviours, for example swamped local
disks or cloud storage with deep queues and no back pressure but
artificial IOPS and bandwidth caps, etc.  I was thinking about
developing some proper settings to help with that kind of research:
debug_io_worker_queue_size (changeable at runtime),
debug_io_max_worker_queue_size (allocated at startup),
debug_io_worker_{latency,bandwidth,iops} to introduce calculated
sleeps, and debug_io_worker_overflow_policy=synchronous|wait so that
you can disable the synchronous fallback that confuses matters.
That'd be more convenient, portable and flexible than dm-mapper tricks
I guess.  I'd been imagining that as a tool to investigate higher
level work on feedback control for read_stream.c as mentioned, but
come to think of it, it could also be useful to understand things
about the worker pool itself.  That's vapourware though, for myself I
just used dirty hacks last time I was working on that stuff.  In other
words, patches are most welcome if you're interested in that kind of
thing.  I am a bit tied up with multithreading at the moment and time
grows short.  I will come back to that problem in a little while and
that patch is on my list as part of the infrastructure needed to prove
things about the I/O stream feedback work I hope to share later...

^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-07-12 05:08       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-07-30 10:14         ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-08-04 05:30           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
@ 2026-03-28 09:31             ` Thomas Munro <[email protected]>
  2026-04-06 15:02               ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Thomas Munro @ 2026-03-28 09:31 UTC (permalink / raw)
  To: Dmitry Dolgov <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

Here is a rebase.  I would like to push these shortly if there are no
objections.  I propose 8 as the default upper limit.


Attachments:

  [text/x-patch] v3-0001-aio-Simplify-pgaio_worker_submit.patch (1.8K, 2-v3-0001-aio-Simplify-pgaio_worker_submit.patch)
  download | inline diff:
From ea67807f7ba32f02943a78f2a15e20748df4dd14 Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Wed, 18 Mar 2026 16:54:34 +1300
Subject: [PATCH v3 1/3] aio: Simplify pgaio_worker_submit().

Rename pgaio_worker_submit_internal() to pgaio_worker_submit().  The
extra wrapper didn't serve any useful purpose.
---
 src/backend/storage/aio/method_worker.c | 20 +++++---------------
 1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index efe38e9f113..e24357a7a0a 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -239,8 +239,8 @@ pgaio_worker_needs_synchronous_execution(PgAioHandle *ioh)
 		|| !pgaio_io_can_reopen(ioh);
 }
 
-static void
-pgaio_worker_submit_internal(int num_staged_ios, PgAioHandle **staged_ios)
+static int
+pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 {
 	PgAioHandle **synchronous_ios = NULL;
 	int			nsync = 0;
@@ -249,6 +249,9 @@ pgaio_worker_submit_internal(int num_staged_ios, PgAioHandle **staged_ios)
 
 	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
 
+	for (int i = 0; i < num_staged_ios; i++)
+		pgaio_io_prepare_submit(staged_ios[i]);
+
 	if (LWLockConditionalAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE))
 	{
 		for (int i = 0; i < num_staged_ios; ++i)
@@ -299,19 +302,6 @@ pgaio_worker_submit_internal(int num_staged_ios, PgAioHandle **staged_ios)
 			pgaio_io_perform_synchronously(synchronous_ios[i]);
 		}
 	}
-}
-
-static int
-pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
-{
-	for (int i = 0; i < num_staged_ios; i++)
-	{
-		PgAioHandle *ioh = staged_ios[i];
-
-		pgaio_io_prepare_submit(ioh);
-	}
-
-	pgaio_worker_submit_internal(num_staged_ios, staged_ios);
 
 	return num_staged_ios;
 }
-- 
2.53.0



  [text/x-patch] v3-0002-aio-Improve-I-O-worker-behavior-on-full-queue.patch (1.8K, 3-v3-0002-aio-Improve-I-O-worker-behavior-on-full-queue.patch)
  download | inline diff:
From 1ad41a1f07343fd676ee7bb9741dbf539889f50f Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Fri, 11 Apr 2025 21:17:26 +1200
Subject: [PATCH v3 2/3] aio: Improve I/O worker behavior on full queue.

Previously, when the submission queue was full we'd run all remaining
IOs in a batch synchronously.  Now we'll try again between synchronous
operations, because the I/O workers might have drained some of the
queue.

Suggested-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com
---
 src/backend/storage/aio/method_worker.c | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index e24357a7a0a..b1b0b6848a0 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -295,11 +295,29 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 		SetLatch(wakeup);
 
 	/* Run whatever is left synchronously. */
-	if (nsync > 0)
+	while (nsync > 0)
 	{
-		for (int i = 0; i < nsync; ++i)
+		pgaio_io_perform_synchronously(*synchronous_ios++);
+		nsync--;
+
+		/* Between synchronous operations, try to enqueue again. */
+		if (nsync > 0)
 		{
-			pgaio_io_perform_synchronously(synchronous_ios[i]);
+			wakeup = NULL;
+			if (LWLockConditionalAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE))
+			{
+				while (nsync > 0 &&
+					   pgaio_worker_submission_queue_insert(*synchronous_ios))
+				{
+					synchronous_ios++;
+					nsync--;
+					if (wakeup == NULL && (worker = pgaio_worker_choose_idle()) >= 0)
+						wakeup = io_worker_control->workers[worker].latch;
+				}
+				LWLockRelease(AioWorkerSubmissionQueueLock);
+			}
+			if (wakeup)
+				SetLatch(wakeup);
 		}
 	}
 
-- 
2.53.0



  [text/x-patch] v3-0003-aio-Adjust-I-O-worker-pool-size-automatically.patch (38.5K, 4-v3-0003-aio-Adjust-I-O-worker-pool-size-automatically.patch)
  download | inline diff:
From 0e4619a7243c85a2db95a993cc49f6e70ff2258a Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Sat, 22 Mar 2025 00:36:49 +1300
Subject: [PATCH v3 3/3] aio: Adjust I/O worker pool size automatically.

Replace the simple io_workers setting with:

  io_min_workers=1
  io_max_workers=8 (can be up to 32)
  io_worker_idle_timeout=60s
  io_worker_launch_interval=100ms

The pool is automatically sized within the configured range according to
recent demand.

Reviewed-by: Dmitry Dolgov <[email protected]>
Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  69 ++-
 src/backend/postmaster/postmaster.c           |  87 ++-
 src/backend/storage/aio/method_worker.c       | 515 ++++++++++++++----
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/misc/guc_parameters.dat     |  34 +-
 src/backend/utils/misc/postgresql.conf.sample |   6 +-
 src/include/storage/io_worker.h               |  10 +-
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pmsignal.h                |   1 +
 src/test/modules/test_aio/t/002_io_workers.pl |  15 +-
 src/tools/pgindent/typedefs.list              |   1 +
 11 files changed, 608 insertions(+), 132 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 229f41353eb..4c8a133bb4d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2870,16 +2870,75 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
-      <varlistentry id="guc-io-workers" xreflabel="io_workers">
-       <term><varname>io_workers</varname> (<type>integer</type>)
+      <varlistentry id="guc-io-min-workers" xreflabel="io_min_workers">
+       <term><varname>io_min_workers</varname> (<type>integer</type>)
        <indexterm>
-        <primary><varname>io_workers</varname> configuration parameter</primary>
+        <primary><varname>io_min_workers</varname> configuration parameter</primary>
        </indexterm>
        </term>
        <listitem>
         <para>
-         Selects the number of I/O worker processes to use. The default is
-         3. This parameter can only be set in the
+         Sets the minimum number of I/O worker processes to use. The default is
+         1. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-max-workers" xreflabel="io_max_workers">
+       <term><varname>io_max_workers</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_max_workers</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the maximum number of I/O worker processes to use. The default is
+         8. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-worker-idle-timeout" xreflabel="io_worker_idle_timeout">
+       <term><varname>io_worker_idle_timeout</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_worker_idle_timeout</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the time after which idle I/O worker processes will exit, reducing the
+         size of pool when demand reduces.  The default is 1 minute.  This
+         parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-worker-launch-interval" xreflabel="io_worker_launch_interval">
+       <term><varname>io_worker_launch_interval</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_worker_launch_interval</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the minimum time between launching new I/O workers.  This can be used to avoid
+         creating too many for a short lived burst of demand.  The default is 100ms.
+         This parameter can only be set in the
          <filename>postgresql.conf</filename> file or on the server command
          line.
         </para>
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3fac46c402b..2ff2e43c504 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -408,6 +408,8 @@ static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
 /* State for IO worker management. */
+static TimestampTz io_worker_launch_next_time = 0;
+static TimestampTz io_worker_launch_last_time = 0;
 static int	io_worker_count = 0;
 static PMChild *io_worker_children[MAX_IO_WORKERS];
 
@@ -1550,10 +1552,9 @@ DetermineSleepTime(void)
 
 	/*
 	 * Normal case: either there are no background workers at all, or we're in
-	 * a shutdown sequence (during which we ignore bgworkers altogether).
+	 * a shutdown sequence.
 	 */
-	if (Shutdown > NoShutdown ||
-		(!StartWorkerNeeded && !HaveCrashedWorker))
+	if (Shutdown > NoShutdown)
 	{
 		if (AbortStartTime != 0)
 		{
@@ -1573,13 +1574,16 @@ DetermineSleepTime(void)
 
 			return seconds * 1000;
 		}
-		else
-			return 60 * 1000;
 	}
 
-	if (StartWorkerNeeded)
+	/* Handle background workers, unless we're shutting down. */
+	if (StartWorkerNeeded && Shutdown == NoShutdown)
 		return 0;
 
+	/* If we need a new IO worker, defer until launch interval expires. */
+	if (pgaio_worker_test_grow() && io_worker_count < io_max_workers)
+		next_wakeup = io_worker_launch_next_time;
+
 	if (HaveCrashedWorker)
 	{
 		dlist_mutable_iter iter;
@@ -3776,6 +3780,15 @@ process_pm_pmsignal(void)
 		StartWorkerNeeded = true;
 	}
 
+	/* Process IO worker start requests. */
+	if (CheckPostmasterSignal(PMSIGNAL_IO_WORKER_GROW))
+	{
+		/*
+		 * No local flag, as the state is exposed through pgaio_worker_*()
+		 * functions.  This signal is received on potentially actionable level
+		 * changes, so that maybe_adjust_io_workers() will run.
+		 */
+	}
 	/* Process background worker state changes. */
 	if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
 	{
@@ -4380,8 +4393,9 @@ maybe_reap_io_worker(int pid)
 /*
  * Start or stop IO workers, to close the gap between the number of running
  * workers and the number of configured workers.  Used to respond to change of
- * the io_workers GUC (by increasing and decreasing the number of workers), as
- * well as workers terminating in response to errors (by starting
+ * the io_{min,max}_workers GUCs (by increasing and decreasing the number of
+ * workers) and requests to start a new one due to submission queue backlog,
+ * as well as workers terminating in response to errors (by starting
  * "replacement" workers).
  */
 static void
@@ -4410,12 +4424,47 @@ maybe_adjust_io_workers(void)
 
 	Assert(pmState < PM_WAIT_IO_WORKERS);
 
-	/* Not enough running? */
-	while (io_worker_count < io_workers)
+	/* Not enough workers running? */
+	while (io_worker_count < io_max_workers)
 	{
 		PMChild    *child;
 		int			i;
 
+		/* Respect launch interval after minimum pool is reached. */
+		if (io_worker_count >= io_min_workers)
+		{
+			TimestampTz now = GetCurrentTimestamp();
+
+			/*
+			 * Still waiting for launch interval to expire, or no launch
+			 * requested?
+			 */
+			if (now < io_worker_launch_next_time ||
+				!pgaio_worker_test_and_clear_grow())
+				break;
+
+			/*
+			 * Compute next launch time relative to the existing value, so
+			 * that the postmaster's other duties and the advancing clock
+			 * don't produce an inaccurate launch interval.
+			 */
+			io_worker_launch_next_time =
+				TimestampTzPlusMilliseconds(io_worker_launch_next_time,
+											io_worker_launch_interval);
+
+			/*
+			 * If that's already in the past, the interval is either
+			 * impossibly short or we received no requests for new workers for
+			 * a period.  Compute a new future time relative to the last
+			 * actual launch time instead, and proceed to launch a worker.
+			 */
+			if (io_worker_launch_next_time <= now)
+				io_worker_launch_next_time =
+					TimestampTzPlusMilliseconds(io_worker_launch_last_time,
+												io_worker_launch_interval);
+			io_worker_launch_last_time = now;
+		}
+
 		/* find unused entry in io_worker_children array */
 		for (i = 0; i < MAX_IO_WORKERS; ++i)
 		{
@@ -4436,19 +4485,11 @@ maybe_adjust_io_workers(void)
 			break;				/* try again next time */
 	}
 
-	/* Too many running? */
-	if (io_worker_count > io_workers)
-	{
-		/* ask the IO worker in the highest slot to exit */
-		for (int i = MAX_IO_WORKERS - 1; i >= 0; --i)
-		{
-			if (io_worker_children[i] != NULL)
-			{
-				kill(io_worker_children[i]->pid, SIGUSR2);
-				break;
-			}
-		}
-	}
+	/*
+	 * If there are too many running because io_max_workers changed, that will
+	 * be handled by the IO workers themselves so they can shut down in
+	 * preferred order.
+	 */
 }
 
 
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index b1b0b6848a0..e4d3348e98e 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -11,9 +11,8 @@
  * infrastructure for reopening the file, and must processed synchronously by
  * the client code when submitted.
  *
- * So that the submitter can make just one system call when submitting a batch
- * of IOs, wakeups "fan out"; each woken IO worker can wake two more. XXX This
- * could be improved by using futexes instead of latches to wake N waiters.
+ * The pool tries to stabilize at a size that can handle recently seen
+ * variation in demand, within the configured limits.
  *
  * This method of AIO is available in all builds on all operating systems, and
  * is the default.
@@ -29,6 +28,8 @@
 
 #include "postgres.h"
 
+#include <limits.h>
+
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -40,6 +41,8 @@
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/injection_point.h"
@@ -47,10 +50,11 @@
 #include "utils/ps_status.h"
 #include "utils/wait_event.h"
 
+/* Saturation for stats counters used to estimate wakeup:work ratio. */
+#define PGAIO_WORKER_STATS_MAX 4
 
-/* How many workers should each worker wake up if needed? */
-#define IO_WORKER_WAKEUP_FANOUT 2
-
+/* Debugging only: show activity and statistics in ps command line. */
+/* #define PGAIO_WORKER_SHOW_PS_INFO */
 
 typedef struct PgAioWorkerSubmissionQueue
 {
@@ -62,17 +66,37 @@ typedef struct PgAioWorkerSubmissionQueue
 
 typedef struct PgAioWorkerSlot
 {
-	Latch	   *latch;
-	bool		in_use;
+	ProcNumber	proc_number;
 } PgAioWorkerSlot;
 
+/*
+ * Sets of worker IDs are held in a simple bitmap, accessed through functions
+ * that provide a more readable abstraction.  If we wanted to support more
+ * workers than that, the contention on the single queue would surely get too
+ * high, so we might want to consider multiple pools instead of widening this.
+ */
+typedef uint64 PgAioWorkerSet;
+
+#define PGAIO_WORKER_SET_BITS (sizeof(PgAioWorkerSet) * CHAR_BIT)
+
+static_assert(PGAIO_WORKER_SET_BITS >= MAX_IO_WORKERS, "too small");
+
 typedef struct PgAioWorkerControl
 {
-	uint64		idle_worker_mask;
+	/* Seen by postmaster */
+	volatile bool grow;
+
+	/* Protected by AioWorkerSubmissionQueueLock. */
+	PgAioWorkerSet idle_worker_set;
+
+	/* Protected by AioWorkerControlLock. */
+	PgAioWorkerSet worker_set;
+	int			nworkers;
+
+	/* Protected by AioWorkerControlLock. */
 	PgAioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
 } PgAioWorkerControl;
 
-
 static size_t pgaio_worker_shmem_size(void);
 static void pgaio_worker_shmem_init(bool first_time);
 
@@ -90,15 +114,103 @@ const IoMethodOps pgaio_worker_ops = {
 
 
 /* GUCs */
-int			io_workers = 3;
+int			io_min_workers = 1;
+int			io_max_workers = 8;
+int			io_worker_idle_timeout = 60000;
+int			io_worker_launch_interval = 100;
 
 
 static int	io_worker_queue_size = 64;
-static int	MyIoWorkerId;
+static int	MyIoWorkerId = -1;
 static PgAioWorkerSubmissionQueue *io_worker_submission_queue;
 static PgAioWorkerControl *io_worker_control;
 
 
+static void
+pgaio_worker_set_initialize(PgAioWorkerSet *set)
+{
+	*set = 0;
+}
+
+static bool
+pgaio_worker_set_is_empty(PgAioWorkerSet *set)
+{
+	return *set == 0;
+}
+
+static PgAioWorkerSet
+pgaio_worker_set_singleton(int worker)
+{
+	return UINT64_C(1) << worker;
+}
+
+static void
+pgaio_worker_set_fill(PgAioWorkerSet *set)
+{
+	*set = UINT64_MAX >> (PGAIO_WORKER_SET_BITS - MAX_IO_WORKERS);
+}
+
+static void
+pgaio_worker_set_subtract(PgAioWorkerSet *set1, const PgAioWorkerSet *set2)
+{
+	*set1 &= ~*set2;
+}
+
+static void
+pgaio_worker_set_insert(PgAioWorkerSet *set, int worker)
+{
+	*set |= pgaio_worker_set_singleton(worker);
+}
+
+static void
+pgaio_worker_set_remove(PgAioWorkerSet *set, int worker)
+{
+	*set &= ~pgaio_worker_set_singleton(worker);
+}
+
+static void
+pgaio_worker_set_remove_less_than(PgAioWorkerSet *set, int worker)
+{
+	*set &= ~(pgaio_worker_set_singleton(worker) - 1);
+}
+
+static int
+pgaio_worker_set_get_highest(PgAioWorkerSet *set)
+{
+	Assert(!pgaio_worker_set_is_empty(set));
+	return pg_leftmost_one_pos64(*set);
+}
+
+static int
+pgaio_worker_set_get_lowest(PgAioWorkerSet *set)
+{
+	Assert(!pgaio_worker_set_is_empty(set));
+	return pg_rightmost_one_pos64(*set);
+}
+
+static int
+pgaio_worker_set_pop_lowest(PgAioWorkerSet *set)
+{
+	int			worker = pgaio_worker_set_get_lowest(set);
+
+	pgaio_worker_set_remove(set, worker);
+	return worker;
+}
+
+#ifdef USE_ASSERT_CHECKING
+static bool
+pgaio_worker_set_contains(PgAioWorkerSet *set, int worker)
+{
+	return (*set & pgaio_worker_set_singleton(worker)) != 0;
+}
+
+static int
+pgaio_worker_set_count(PgAioWorkerSet *set)
+{
+	return pg_popcount64(*set);
+}
+#endif
+
 static size_t
 pgaio_worker_queue_shmem_size(int *queue_size)
 {
@@ -151,37 +263,113 @@ pgaio_worker_shmem_init(bool first_time)
 						&found);
 	if (!found)
 	{
-		io_worker_control->idle_worker_mask = 0;
+		io_worker_control->grow = false;
+		pgaio_worker_set_initialize(&io_worker_control->worker_set);
+		pgaio_worker_set_initialize(&io_worker_control->idle_worker_set);
 		for (int i = 0; i < MAX_IO_WORKERS; ++i)
+			io_worker_control->workers[i].proc_number = INVALID_PROC_NUMBER;
+	}
+}
+
+static void
+pgaio_worker_grow(bool grow)
+{
+	/*
+	 * This is called from sites that don't hold AioWorkerControlLock, but
+	 * these values change infrequently and an up-to-date value is not
+	 * required for this heuristic purpose.
+	 */
+	if (!grow)
+	{
+		/* Avoid dirtying memory if not already set. */
+		if (io_worker_control->grow)
+			io_worker_control->grow = false;
+	}
+	else
+	{
+		/* Do nothing if request already pending. */
+		if (!io_worker_control->grow)
 		{
-			io_worker_control->workers[i].latch = NULL;
-			io_worker_control->workers[i].in_use = false;
+			io_worker_control->grow = true;
+			SendPostmasterSignal(PMSIGNAL_IO_WORKER_GROW);
 		}
 	}
 }
 
+/*
+ * Called by the postmaster to check if a new worker is needed.
+ */
+bool
+pgaio_worker_test_grow(void)
+{
+	return io_worker_control && io_worker_control->grow;
+}
+
+/*
+ * Called by the postmaster to check if a new worker is needed when it's ready
+ * to launch one, and clear the flag.
+ */
+bool
+pgaio_worker_test_and_clear_grow(void)
+{
+	bool		result;
+
+	result = io_worker_control->grow;
+	if (result)
+		io_worker_control->grow = false;
+
+	return result;
+}
+
 static int
-pgaio_worker_choose_idle(void)
+pgaio_worker_choose_idle(int minimum_worker)
 {
+	PgAioWorkerSet worker_set;
 	int			worker;
 
-	if (io_worker_control->idle_worker_mask == 0)
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
+	worker_set = io_worker_control->idle_worker_set;
+	pgaio_worker_set_remove_less_than(&worker_set, minimum_worker);
+	if (pgaio_worker_set_is_empty(&worker_set))
 		return -1;
 
-	/* Find the lowest bit position, and clear it. */
-	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
-	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
-	Assert(io_worker_control->workers[worker].in_use);
+	/* Find the lowest numbered idle worker and mark it not idle. */
+	worker = pgaio_worker_set_get_lowest(&worker_set);
+	pgaio_worker_set_remove(&io_worker_control->idle_worker_set, worker);
 
 	return worker;
 }
 
+/*
+ * Try to wake a worker by setting its latch, to tell it there are IOs to
+ * process in the submission queue.
+ */
+static void
+pgaio_worker_wake(int worker)
+{
+	ProcNumber	proc_number;
+
+	/*
+	 * If the selected worker is concurrently exiting, then pgaio_worker_die()
+	 * had not yet removed it as of when we saw it in idle_worker_set.  That's
+	 * OK, because it will wake all remaining workers to close wakeup-vs-exit
+	 * races: *someone* will see the queued IO.  If there are no workers
+	 * running, the postmaster will start a new one.
+	 */
+	proc_number = io_worker_control->workers[worker].proc_number;
+	if (proc_number != INVALID_PROC_NUMBER)
+		SetLatch(&GetPGProcByNumber(proc_number)->procLatch);
+}
+
 static bool
 pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
 {
 	PgAioWorkerSubmissionQueue *queue;
 	uint32		new_head;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	queue = io_worker_submission_queue;
 	new_head = (queue->head + 1) & (queue->size - 1);
 	if (new_head == queue->tail)
@@ -203,6 +391,8 @@ pgaio_worker_submission_queue_consume(void)
 	PgAioWorkerSubmissionQueue *queue;
 	int			result;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	queue = io_worker_submission_queue;
 	if (queue->tail == queue->head)
 		return -1;				/* empty */
@@ -219,6 +409,8 @@ pgaio_worker_submission_queue_depth(void)
 	uint32		head;
 	uint32		tail;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	head = io_worker_submission_queue->head;
 	tail = io_worker_submission_queue->tail;
 
@@ -244,8 +436,7 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 {
 	PgAioHandle **synchronous_ios = NULL;
 	int			nsync = 0;
-	Latch	   *wakeup = NULL;
-	int			worker;
+	int			worker = -1;
 
 	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
 
@@ -269,18 +460,12 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 
 				break;
 			}
+		}
 
-			if (wakeup == NULL)
-			{
-				/* Choose an idle worker to wake up if we haven't already. */
-				worker = pgaio_worker_choose_idle();
-				if (worker >= 0)
-					wakeup = io_worker_control->workers[worker].latch;
-
-				pgaio_debug_io(DEBUG4, staged_ios[i],
-							   "choosing worker %d",
-							   worker);
-			}
+		if (worker == -1)
+		{
+			/* Choose an idle worker to wake up if we haven't already. */
+			worker = pgaio_worker_choose_idle(0);
 		}
 		LWLockRelease(AioWorkerSubmissionQueueLock);
 	}
@@ -291,8 +476,12 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 		nsync = num_staged_ios;
 	}
 
-	if (wakeup)
-		SetLatch(wakeup);
+	/*
+	 * If we didn't find a worker to wake up, the existing workers will
+	 * determine whether the pool is too small.
+	 */
+	if (worker != -1)
+		pgaio_worker_wake(worker);
 
 	/* Run whatever is left synchronously. */
 	while (nsync > 0)
@@ -303,7 +492,7 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 		/* Between synchronous operations, try to enqueue again. */
 		if (nsync > 0)
 		{
-			wakeup = NULL;
+			worker = -1;
 			if (LWLockConditionalAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE))
 			{
 				while (nsync > 0 &&
@@ -311,13 +500,13 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 				{
 					synchronous_ios++;
 					nsync--;
-					if (wakeup == NULL && (worker = pgaio_worker_choose_idle()) >= 0)
-						wakeup = io_worker_control->workers[worker].latch;
+					if (worker == -1)
+						worker = pgaio_worker_choose_idle(0);
 				}
 				LWLockRelease(AioWorkerSubmissionQueueLock);
 			}
-			if (wakeup)
-				SetLatch(wakeup);
+			if (worker != -1)
+				pgaio_worker_wake(worker);
 		}
 	}
 
@@ -331,14 +520,27 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 static void
 pgaio_worker_die(int code, Datum arg)
 {
-	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
-	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
-	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+	PgAioWorkerSet notify_set;
 
-	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
-	io_worker_control->workers[MyIoWorkerId].in_use = false;
-	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	pgaio_worker_set_remove(&io_worker_control->idle_worker_set, MyIoWorkerId);
 	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].proc_number == MyProcNumber);
+	io_worker_control->workers[MyIoWorkerId].proc_number = INVALID_PROC_NUMBER;
+	Assert(pgaio_worker_set_contains(&io_worker_control->worker_set, MyIoWorkerId));
+	pgaio_worker_set_remove(&io_worker_control->worker_set, MyIoWorkerId);
+	notify_set = io_worker_control->worker_set;
+	Assert(io_worker_control->nworkers > 0);
+	io_worker_control->nworkers--;
+	Assert(pgaio_worker_set_count(&io_worker_control->worker_set) ==
+		   io_worker_control->nworkers);
+	LWLockRelease(AioWorkerControlLock);
+
+	/* Notify other workers on pool change. */
+	while (!pgaio_worker_set_is_empty(&notify_set))
+		pgaio_worker_wake(pgaio_worker_set_pop_lowest(&notify_set));
 }
 
 /*
@@ -348,33 +550,34 @@ pgaio_worker_die(int code, Datum arg)
 static void
 pgaio_worker_register(void)
 {
+	PgAioWorkerSet free_worker_set;
+	PgAioWorkerSet old_worker_set;
+
 	MyIoWorkerId = -1;
 
-	/*
-	 * XXX: This could do with more fine-grained locking. But it's also not
-	 * very common for the number of workers to change at the moment...
-	 */
-	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
+	pgaio_worker_set_fill(&free_worker_set);
+	pgaio_worker_set_subtract(&free_worker_set, &io_worker_control->worker_set);
+	if (!pgaio_worker_set_is_empty(&free_worker_set))
+		MyIoWorkerId = pgaio_worker_set_get_lowest(&free_worker_set);
+	if (MyIoWorkerId == -1)
+		elog(ERROR, "couldn't find a free worker ID");
 
-	for (int i = 0; i < MAX_IO_WORKERS; ++i)
-	{
-		if (!io_worker_control->workers[i].in_use)
-		{
-			Assert(io_worker_control->workers[i].latch == NULL);
-			io_worker_control->workers[i].in_use = true;
-			MyIoWorkerId = i;
-			break;
-		}
-		else
-			Assert(io_worker_control->workers[i].latch != NULL);
-	}
+	Assert(io_worker_control->workers[MyIoWorkerId].proc_number ==
+		   INVALID_PROC_NUMBER);
+	io_worker_control->workers[MyIoWorkerId].proc_number = MyProcNumber;
 
-	if (MyIoWorkerId == -1)
-		elog(ERROR, "couldn't find a free worker slot");
+	old_worker_set = io_worker_control->worker_set;
+	Assert(!pgaio_worker_set_contains(&old_worker_set, MyIoWorkerId));
+	pgaio_worker_set_insert(&io_worker_control->worker_set, MyIoWorkerId);
+	io_worker_control->nworkers++;
+	Assert(pgaio_worker_set_count(&io_worker_control->worker_set) ==
+		   io_worker_control->nworkers);
+	LWLockRelease(AioWorkerControlLock);
 
-	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
-	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
-	LWLockRelease(AioWorkerSubmissionQueueLock);
+	/* Notify other workers on pool change. */
+	while (!pgaio_worker_set_is_empty(&old_worker_set))
+		pgaio_worker_wake(pgaio_worker_set_pop_lowest(&old_worker_set));
 
 	on_shmem_exit(pgaio_worker_die, 0);
 }
@@ -400,14 +603,47 @@ pgaio_worker_error_callback(void *arg)
 	errcontext("I/O worker executing I/O on behalf of process %d", owner_pid);
 }
 
+/*
+ * Check if this backend is allowed to time out, and thus should use a
+ * non-infinite sleep time.  Only the highest-numbered worker is allowed to
+ * time out, and only if the pool is above io_min_workers.  Serializing
+ * timeouts keeps IDs in a range 0..N without gaps, and avoids undershooting
+ * io_min_workers.
+ *
+ * The result is only instantaneously true and may be temporarily inconsistent
+ * in different workers around transitions, but all workers are woken up on
+ * pool size or GUC changes making the result eventually consistent.
+ */
+static bool
+pgaio_worker_can_timeout(void)
+{
+	PgAioWorkerSet worker_set;
+
+	/* Serialize against pool size changes. */
+	LWLockAcquire(AioWorkerControlLock, LW_SHARED);
+	worker_set = io_worker_control->worker_set;
+	LWLockRelease(AioWorkerControlLock);
+
+	if (MyIoWorkerId != pgaio_worker_set_get_highest(&worker_set))
+		return false;
+	if (MyIoWorkerId < io_min_workers)
+		return false;
+
+	return true;
+}
+
 void
 IoWorkerMain(const void *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	TimestampTz idle_timeout_abs = 0;
+	int			timeout_guc_used = 0;
 	PgAioHandle *volatile error_ioh = NULL;
 	ErrorContextCallback errcallback = {0};
 	volatile int error_errno = 0;
 	char		cmd[128];
+	int			ios = 0;
+	int			wakeups = 0;
 
 	AuxiliaryProcessMainCommon();
 
@@ -475,10 +711,9 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	while (!ShutdownRequestPending)
 	{
 		uint32		io_index;
-		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
-		int			nlatches = 0;
-		int			nwakeups = 0;
-		int			worker;
+		int			worker = -1;
+		int			queue_depth = 0;
+		bool		grow = false;
 
 		/*
 		 * Try to get a job to do.
@@ -489,38 +724,55 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
 		if ((io_index = pgaio_worker_submission_queue_consume()) == -1)
 		{
-			/*
-			 * Nothing to do.  Mark self idle.
-			 *
-			 * XXX: Invent some kind of back pressure to reduce useless
-			 * wakeups?
-			 */
-			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+			/* Nothing to do.  Mark self idle. */
+			pgaio_worker_set_insert(&io_worker_control->idle_worker_set,
+									MyIoWorkerId);
 		}
 		else
 		{
 			/* Got one.  Clear idle flag. */
-			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+			pgaio_worker_set_remove(&io_worker_control->idle_worker_set,
+									MyIoWorkerId);
 
-			/* See if we can wake up some peers. */
-			nwakeups = Min(pgaio_worker_submission_queue_depth(),
-						   IO_WORKER_WAKEUP_FANOUT);
-			for (int i = 0; i < nwakeups; ++i)
+			/*
+			 * See if we should wake up a higher numbered peer.  Only do this
+			 * if this worker is itself not receiving spurious wakeups.  This
+			 * heuristic discovers the useful wakeup propagation chain length.
+			 */
+			if (wakeups <= ios)
 			{
-				if ((worker = pgaio_worker_choose_idle()) < 0)
-					break;
-				latches[nlatches++] = io_worker_control->workers[worker].latch;
+				queue_depth = pgaio_worker_submission_queue_depth();
+				worker = pgaio_worker_choose_idle(MyIoWorkerId + 1);
+
+				/*
+				 * If there were no idle higher numbered peers and there are
+				 * more than enough IOs queued for me and all lower numbered
+				 * peers, then try to start a new worker.
+				 */
+				if (worker == -1 && queue_depth > MyIoWorkerId)
+					grow = true;
 			}
 		}
 		LWLockRelease(AioWorkerSubmissionQueueLock);
 
-		for (int i = 0; i < nlatches; ++i)
-			SetLatch(latches[i]);
+		/* Propagate wakeups. */
+		if (worker != -1)
+			pgaio_worker_wake(worker);
+		else if (grow)
+			pgaio_worker_grow(true);
 
 		if (io_index != -1)
 		{
 			PgAioHandle *ioh = NULL;
 
+			/* Cancel timeout and update wakeup:work ratio. */
+			idle_timeout_abs = 0;
+			if (++ios == PGAIO_WORKER_STATS_MAX)
+			{
+				ios /= 2;
+				wakeups /= 2;
+			}
+
 			ioh = &pgaio_ctl->io_handles[io_index];
 			error_ioh = ioh;
 			errcallback.arg = ioh;
@@ -573,6 +825,14 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 			}
 #endif
 
+#ifdef PGAIO_WORKER_SHOW_PS_INFO
+			sprintf(cmd, "%d: [%s] %s",
+					MyIoWorkerId,
+					pgaio_io_get_op_name(ioh),
+					pgaio_io_get_target_description(ioh));
+			set_ps_display(cmd);
+#endif
+
 			/*
 			 * We don't expect this to ever fail with ERROR or FATAL, no need
 			 * to keep error_ioh set to the IO.
@@ -586,8 +846,75 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		}
 		else
 		{
-			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-					  WAIT_EVENT_IO_WORKER_MAIN);
+			int			timeout_ms;
+
+			/* Cancel new worker if pending. */
+			pgaio_worker_grow(false);
+
+			/* Compute the remaining allowed idle time. */
+			if (io_worker_idle_timeout == -1)
+			{
+				/* Never time out. */
+				timeout_ms = -1;
+			}
+			else
+			{
+				TimestampTz now = GetCurrentTimestamp();
+
+				/* If the GUC changes, reset timer. */
+				if (idle_timeout_abs != 0 &&
+					io_worker_idle_timeout != timeout_guc_used)
+					idle_timeout_abs = 0;
+
+				/* On first sleep, compute absolute timeout. */
+				if (idle_timeout_abs == 0)
+				{
+					idle_timeout_abs =
+						TimestampTzPlusMilliseconds(now,
+													io_worker_idle_timeout);
+					timeout_guc_used = io_worker_idle_timeout;
+				}
+
+				/*
+				 * All workers maintain the absolute timeout value, but only
+				 * the highest worker can actually time out and only if
+				 * io_min_workers is satisfied.  All others wait only for
+				 * explicit wakeups caused by queue insertion, wakeup
+				 * propagation, change of pool size (possibly promoting one to
+				 * new highest) or GUC reload.
+				 */
+				if (pgaio_worker_can_timeout())
+					timeout_ms =
+						TimestampDifferenceMilliseconds(now,
+														idle_timeout_abs);
+				else
+					timeout_ms = -1;
+			}
+
+#ifdef PGAIO_WORKER_SHOW_PS_INFO
+			sprintf(cmd, "%d: idle, ios:wakeups = %d:%d",
+					MyIoWorkerId, ios, wakeups);
+			set_ps_display(cmd);
+#endif
+
+			if (WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT,
+						  timeout_ms,
+						  WAIT_EVENT_IO_WORKER_MAIN) == WL_TIMEOUT)
+			{
+				/* WL_TIMEOUT */
+				if (pgaio_worker_can_timeout())
+					if (GetCurrentTimestamp() >= idle_timeout_abs)
+						break;
+			}
+			else
+			{
+				/* WL_LATCH_SET */
+				if (++wakeups == PGAIO_WORKER_STATS_MAX)
+				{
+					ios /= 2;
+					wakeups /= 2;
+				}
+			}
 			ResetLatch(MyLatch);
 		}
 
@@ -597,6 +924,10 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		{
 			ConfigReloadPending = false;
 			ProcessConfigFile(PGC_SIGHUP);
+
+			/* If io_max_workers has been decreased, exit highest first. */
+			if (MyIoWorkerId >= io_max_workers)
+				break;
 		}
 	}
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 6be80d2daad..5b58f45e18b 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -365,6 +365,7 @@ SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> s
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 LogicalDecodingControl	"Waiting to read or update logical decoding status information."
+AioWorkerControl	"Waiting to update AIO worker information."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 0a862693fcd..9301f7f806f 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1381,6 +1381,14 @@
   check_hook => 'check_io_max_concurrency',
 },
 
+{ name => 'io_max_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Maximum number of I/O worker processes, for io_method=worker.',
+  variable => 'io_max_workers',
+  boot_val => '8',
+  min => '1',
+  max => 'MAX_IO_WORKERS',
+},
+
 { name => 'io_method', type => 'enum', context => 'PGC_POSTMASTER', group => 'RESOURCES_IO',
   short_desc => 'Selects the method for executing asynchronous I/O.',
   variable => 'io_method',
@@ -1389,14 +1397,32 @@
   assign_hook => 'assign_io_method',
 },
 
-{ name => 'io_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
-  short_desc => 'Number of IO worker processes, for io_method=worker.',
-  variable => 'io_workers',
-  boot_val => '3',
+{ name => 'io_min_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Minimum number of I/O worker processes, for io_method=worker.',
+  variable => 'io_min_workers',
+  boot_val => '1',
   min => '1',
   max => 'MAX_IO_WORKERS',
 },
 
+{ name => 'io_worker_idle_timeout', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Maximum time before idle I/O worker processes time out, for io_method=worker.',
+  variable => 'io_worker_idle_timeout',
+  flags => 'GUC_UNIT_MS',
+  boot_val => '60000',
+  min => '0',
+  max => 'INT_MAX',
+},
+
+{ name => 'io_worker_launch_interval', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Minimum time before launching a new I/O worker process, for io_method=worker.',
+  variable => 'io_worker_launch_interval',
+  flags => 'GUC_UNIT_MS',
+  boot_val => '100',
+  min => '0',
+  max => 'INT_MAX',
+},
+
 # Not for general use --- used by SET SESSION AUTHORIZATION and SET
 # ROLE
 { name => 'is_superuser', type => 'bool', context => 'PGC_INTERNAL', group => 'UNGROUPED',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cf15597385b..643dd2866e0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -218,7 +218,11 @@
                                         # can execute simultaneously
                                         # -1 sets based on shared_buffers
                                         # (change requires restart)
-#io_workers = 3                         # 1-32;
+
+#io_min_workers = 1                     # 1-32 (change requires pg_reload_conf())
+#io_max_workers = 8                     # 1-32
+#io_worker_idle_timeout = 60s
+#io_worker_launch_interval = 100ms
 
 # - Worker Processes -
 
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
index f7d5998a138..40559fb2428 100644
--- a/src/include/storage/io_worker.h
+++ b/src/include/storage/io_worker.h
@@ -17,6 +17,14 @@
 
 pg_noreturn extern void IoWorkerMain(const void *startup_data, size_t startup_data_len);
 
-extern PGDLLIMPORT int io_workers;
+/* Public GUCs. */
+extern PGDLLIMPORT int io_min_workers;
+extern PGDLLIMPORT int io_max_workers;
+extern PGDLLIMPORT int io_worker_idle_timeout;
+extern PGDLLIMPORT int io_worker_launch_interval;
+
+/* Interfaces visible to the postmaster. */
+extern bool pgaio_worker_test_grow(void);
+extern bool pgaio_worker_test_and_clear_grow(void);
 
 #endif							/* IO_WORKER_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 59ee097977d..e39d5a947fa 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -87,6 +87,7 @@ PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
 PG_LWLOCK(54, WaitLSN)
 PG_LWLOCK(55, LogicalDecodingControl)
+PG_LWLOCK(56, AioWorkerControl)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 206fb78f8a5..00e1b426d69 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -38,6 +38,7 @@ typedef enum
 	PMSIGNAL_ROTATE_LOGFILE,	/* send SIGUSR1 to syslogger to rotate logfile */
 	PMSIGNAL_START_AUTOVAC_LAUNCHER,	/* start an autovacuum launcher */
 	PMSIGNAL_START_AUTOVAC_WORKER,	/* start an autovacuum worker */
+	PMSIGNAL_IO_WORKER_GROW,	/* I/O worker pool wants to grow */
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
diff --git a/src/test/modules/test_aio/t/002_io_workers.pl b/src/test/modules/test_aio/t/002_io_workers.pl
index 34bc132ea08..b9775811d4d 100644
--- a/src/test/modules/test_aio/t/002_io_workers.pl
+++ b/src/test/modules/test_aio/t/002_io_workers.pl
@@ -14,6 +14,9 @@ $node->init();
 $node->append_conf(
 	'postgresql.conf', qq(
 io_method=worker
+io_worker_idle_timeout=0ms
+io_worker_launch_interval=0ms
+io_max_workers=32
 ));
 
 $node->start();
@@ -31,7 +34,7 @@ sub test_number_of_io_workers_dynamic
 {
 	my $node = shift;
 
-	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_workers');
+	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_min_workers');
 
 	# Verify that worker count can't be set to 0
 	change_number_of_io_workers($node, 0, $prev_worker_count, 1);
@@ -62,24 +65,24 @@ sub change_number_of_io_workers
 	my ($result, $stdout, $stderr);
 
 	($result, $stdout, $stderr) =
-	  $node->psql('postgres', "ALTER SYSTEM SET io_workers = $worker_count");
+	  $node->psql('postgres', "ALTER SYSTEM SET io_min_workers = $worker_count");
 	$node->safe_psql('postgres', 'SELECT pg_reload_conf()');
 
 	if ($expect_failure)
 	{
 		like(
 			$stderr,
-			qr/$worker_count is outside the valid range for parameter "io_workers"/,
-			"updating number of io_workers to $worker_count failed, as expected"
+			qr/$worker_count is outside the valid range for parameter "io_min_workers"/,
+			"updating io_min_workers to $worker_count failed, as expected"
 		);
 
 		return $prev_worker_count;
 	}
 	else
 	{
-		is( $node->safe_psql('postgres', 'SHOW io_workers'),
+		is( $node->safe_psql('postgres', 'SHOW io_min_workers'),
 			$worker_count,
-			"updating number of io_workers from $prev_worker_count to $worker_count"
+			"updating number of io_min_workers from $prev_worker_count to $worker_count"
 		);
 
 		check_io_worker_count($node, $worker_count);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e3c1007abdf..7e340a9e791 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2250,6 +2250,7 @@ PgAioUringCaps
 PgAioUringContext
 PgAioWaitRef
 PgAioWorkerControl
+PgAioWorkerSet
 PgAioWorkerSlot
 PgAioWorkerSubmissionQueue
 PgArchData
-- 
2.53.0



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-07-12 05:08       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-07-30 10:14         ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-08-04 05:30           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-03-28 09:31             ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
@ 2026-04-06 15:02               ` Thomas Munro <[email protected]>
  2026-04-06 18:14                 ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Thomas Munro @ 2026-04-06 15:02 UTC (permalink / raw)
  To: Dmitry Dolgov <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

Here's an updated patch.  It's mostly just rebased over the recent
firehose, but with lots of comments and a few names (hopefully)
improved.  There is one code change to highlight though:

maybe_start_io_workers() knows when it's not allowed to create new
workers, an interesting case being FatalError before we have started
the new world.  The previous coding of DetermineSleepTime() didn't
know about that, so it could return 0 (don't sleep), and then the
postmaster could busy-wait for restart progress.  Maybe there were
other cases like that, but in general DetermineSleepTime() and
maybe_start_io_workers() really need to be 100% in agreement.  So I
have moved that knowledge into a new function
maybe_start_io_workers_scheduled_at().  Both DetermineSleepTime() and
maybe_start_io_workers() call that so there is a single source of
truth.

I think I got confused about that because it's not that obvious why
the existing code doesn't test FatalError.

I thought of a slightly bigger refactoring that might deconfuse
DetermineSleepTime() a bit more.  Probably material for the next
cycle, but basically the idea is to stop using a bunch of different
conditions and different units of time and convert the whole thing to
a simple find-the-lowest-time function.  I kept that separate.

I'll post a new version of the patch that was v3-0002 separately.


Attachments:

  [text/x-patch] v4-0002-Refactor-the-postmaster-s-periodic-job-scheduling.patch (14.9K, 2-v4-0002-Refactor-the-postmaster-s-periodic-job-scheduling.patch)
  download | inline diff:
From ccc5b6fc9cf7d30359b015c953c04f481c66657e Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Mon, 6 Apr 2026 20:54:53 +1200
Subject: [PATCH v4 2/2] Refactor the postmaster's periodic job scheduling.

DetermineSleepTime() considers the following reasons for ServerLoop() to
wake up:

 * bgworker restart delay reached
 * I/O worker launch interval reached
 * SIGKILL timeout reached during immediate shutdown/crash restart
 * periodically checking the lock file
 * periodically touching socket files

To make it easier to follow:

 * move the next-bgworker-wakeup logic out to its own function
 * standardize the unit of timekeeping
 * convert DetermineSleepTime() to just: which is soonest?

As a side-effect, SIGKILL, lockfile and socket files duties are now
performed with more accurate timing.
---
 src/backend/postmaster/postmaster.c | 247 ++++++++++++++--------------
 1 file changed, 125 insertions(+), 122 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index c42564500c6..2a6887eb6c2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -361,13 +361,24 @@ static PMState pmState = PM_INIT;
  */
 static bool connsAllowed = true;
 
-/* Start time of SIGKILL timeout during immediate shutdown or child crash */
-/* Zero means timeout is not running */
-static time_t AbortStartTime = 0;
+/* Special values for scheduling Postmaster duties at certain times. */
+#define PM_SCHEDULE_NEVER				TIMESTAMP_INFINITY
+#define PM_SCHEDULE_IMMEDIATELY			TIMESTAMP_MINUS_INFINITY
+
+/* Time of SIGKILL during immediate shutdown or child crash */
+static TimestampTz sigkill_children_scheduled_at = PM_SCHEDULE_NEVER;
 
 /* Length of said timeout */
 #define SIGKILL_CHILDREN_AFTER_SECS		5
 
+/* Time of next lockfile check and socket touch. */
+static TimestampTz lockfile_check_scheduled_at;
+static TimestampTz socket_touch_scheduled_at;
+
+/* Length of said timeouts */
+#define LOCKFILE_CHECK_SECS				60
+#define SOCKET_TOUCH_SECS				(58 * SECS_PER_MINUTE)
+
 static bool ReachedNormalRunning = false;	/* T if we've reached PM_RUN */
 
 bool		ClientAuthInProgress = false;	/* T during new-client
@@ -409,8 +420,8 @@ static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
 /* State for IO worker management. */
-static TimestampTz io_worker_launch_next_time = 0;
-static TimestampTz io_worker_launch_last_time = 0;
+static TimestampTz io_worker_launch_next_time;
+static TimestampTz io_worker_launch_last_time;
 static int	io_worker_count = 0;
 static PMChild *io_worker_children[MAX_IO_WORKERS];
 
@@ -448,6 +459,7 @@ static void TerminateChildren(int signal);
 static int	CountChildren(BackendTypeMask targetMask);
 static void LaunchMissingBackgroundProcesses(void);
 static void maybe_start_bgworkers(void);
+static TimestampTz maybe_start_bgworkers_scheduled_at(void);
 static bool maybe_reap_io_worker(int pid);
 static void maybe_start_io_workers(void);
 static TimestampTz maybe_start_io_workers_scheduled_at(void);
@@ -1546,98 +1558,33 @@ checkControlFile(void)
 	FreeFile(fp);
 }
 
+static void
+compute_next_wakeup(TimestampTz *next_wakeup, TimestampTz wakeup)
+{
+	if (*next_wakeup > wakeup)
+		*next_wakeup = wakeup;
+}
+
 /*
  * Determine how long should we let ServerLoop sleep, in milliseconds.
- *
- * In normal conditions we wait at most one minute, to ensure that the other
- * background tasks handled by ServerLoop get done even when no requests are
- * arriving.  However, if there are background workers waiting to be started,
- * we don't actually sleep so that they are quickly serviced.  Other exception
- * cases are as shown in the code.
+ * Returns the time to wait for the next of ServerLoop()'s scheduled duties.
+ * The longest possible wait is one minute (LOCKFILE_CHECK_SECS), but it could
+ * be as low as zero if one the jobs below is due/overdue now.
  */
 static int
 DetermineSleepTime(void)
 {
-	TimestampTz next_wakeup;
-
-	/*
-	 * If an ImmediateShutdown or a crash restart has set a SIGKILL timeout,
-	 * ignore everything else and wait for that.
-	 */
-	if (Shutdown >= ImmediateShutdown || FatalError)
-	{
-		if (AbortStartTime != 0)
-		{
-			time_t		curtime = time(NULL);
-			int			seconds;
-
-			/*
-			 * time left to abort; clamp to 0 if it already expired, or if
-			 * time goes backwards
-			 */
-			if (curtime < AbortStartTime ||
-				curtime - AbortStartTime >= SIGKILL_CHILDREN_AFTER_SECS)
-				seconds = 0;
-			else
-				seconds = SIGKILL_CHILDREN_AFTER_SECS -
-					(curtime - AbortStartTime);
-
-			return seconds * 1000;
-		}
-	}
-
-	/* Time of next maybe_start_io_workers() call, or 0 for none. */
-	next_wakeup = maybe_start_io_workers_scheduled_at();
-
-	/* Ignore bgworkers during shutdown. */
-	if (StartWorkerNeeded && Shutdown == NoShutdown)
-		return 0;
-
-	if (HaveCrashedWorker && Shutdown == NoShutdown)
-	{
-		dlist_mutable_iter iter;
-
-		/*
-		 * When there are crashed bgworkers, we sleep just long enough that
-		 * they are restarted when they request to be.  Scan the list to
-		 * determine the minimum of all wakeup times according to most recent
-		 * crash time and requested restart interval.
-		 */
-		dlist_foreach_modify(iter, &BackgroundWorkerList)
-		{
-			RegisteredBgWorker *rw;
-			TimestampTz this_wakeup;
-
-			rw = dlist_container(RegisteredBgWorker, rw_lnode, iter.cur);
-
-			if (rw->rw_crashed_at == 0)
-				continue;
-
-			if (rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART
-				|| rw->rw_terminate)
-			{
-				ForgetBackgroundWorker(rw);
-				continue;
-			}
+	TimestampTz next_wakeup = PM_SCHEDULE_NEVER;
 
-			this_wakeup = TimestampTzPlusMilliseconds(rw->rw_crashed_at,
-													  1000L * rw->rw_worker.bgw_restart_time);
-			if (next_wakeup == 0 || this_wakeup < next_wakeup)
-				next_wakeup = this_wakeup;
-		}
-	}
+	/* Find the time of the next scheduled ServerLoop() duty. */
+	compute_next_wakeup(&next_wakeup, sigkill_children_scheduled_at);
+	compute_next_wakeup(&next_wakeup, lockfile_check_scheduled_at);
+	compute_next_wakeup(&next_wakeup, socket_touch_scheduled_at);
+	compute_next_wakeup(&next_wakeup, maybe_start_io_workers_scheduled_at());
+	compute_next_wakeup(&next_wakeup, maybe_start_bgworkers_scheduled_at());
 
-	if (next_wakeup != 0)
-	{
-		int			ms;
-
-		/* result of TimestampDifferenceMilliseconds is in [0, INT_MAX] */
-		ms = (int) TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
-												   next_wakeup);
-		return Min(60 * 1000, ms);
-	}
-
-	return 60 * 1000;
+	/* result of TimestampDifferenceMilliseconds is in [0, INT_MAX] */
+	return TimestampDifferenceMilliseconds(GetCurrentTimestamp(), next_wakeup);
 }
 
 /*
@@ -1675,17 +1622,17 @@ ConfigurePostmasterWaitSet(bool accept_connections)
 static int
 ServerLoop(void)
 {
-	time_t		last_lockfile_recheck_time,
-				last_touch_time;
 	WaitEvent	events[MAXLISTEN];
 	int			nevents;
 
 	ConfigurePostmasterWaitSet(true);
-	last_lockfile_recheck_time = last_touch_time = time(NULL);
+
+	lockfile_check_scheduled_at = GetCurrentTimestamp();
+	socket_touch_scheduled_at = GetCurrentTimestamp();
 
 	for (;;)
 	{
-		time_t		now;
+		TimestampTz now;
 
 		nevents = WaitEventSetWait(pm_wait_set,
 								   DetermineSleepTime(),
@@ -1760,12 +1707,9 @@ ServerLoop(void)
 		/*
 		 * Lastly, check to see if it's time to do some things that we don't
 		 * want to do every single time through the loop, because they're a
-		 * bit expensive.  Note that there's up to a minute of slop in when
-		 * these tasks will be performed, since DetermineSleepTime() will let
-		 * us sleep at most that long; except for SIGKILL timeout which has
-		 * special-case logic there.
+		 * bit expensive.
 		 */
-		now = time(NULL);
+		now = GetCurrentTimestamp();
 
 		/*
 		 * If we already sent SIGQUIT to children and they are slow to shut
@@ -1776,10 +1720,10 @@ ServerLoop(void)
 		 *
 		 * Note we also do this during recovery from a process crash.
 		 */
-		if ((Shutdown >= ImmediateShutdown || FatalError) &&
-			AbortStartTime != 0 &&
-			(now - AbortStartTime) >= SIGKILL_CHILDREN_AFTER_SECS)
+		if (now >= sigkill_children_scheduled_at)
 		{
+			Assert(Shutdown >= ImmediateShutdown || FatalError);
+
 			/* We were gentle with them before. Not anymore */
 			ereport(LOG,
 			/* translator: %s is SIGKILL or SIGABRT */
@@ -1787,7 +1731,7 @@ ServerLoop(void)
 							send_abort_for_kill ? "SIGABRT" : "SIGKILL")));
 			TerminateChildren(send_abort_for_kill ? SIGABRT : SIGKILL);
 			/* reset flag so we don't SIGKILL again */
-			AbortStartTime = 0;
+			sigkill_children_scheduled_at = PM_SCHEDULE_NEVER;
 		}
 
 		/*
@@ -1800,7 +1744,7 @@ ServerLoop(void)
 		 * starting a new postmaster.  Data corruption is likely to ensue from
 		 * that anyway, but we can minimize the damage by aborting ASAP.
 		 */
-		if (now - last_lockfile_recheck_time >= 1 * SECS_PER_MINUTE)
+		if (now >= lockfile_check_scheduled_at)
 		{
 			if (!RecheckDataDirLockFile())
 			{
@@ -1808,7 +1752,9 @@ ServerLoop(void)
 						(errmsg("performing immediate shutdown because data directory lock file is invalid")));
 				kill(MyProcPid, SIGQUIT);
 			}
-			last_lockfile_recheck_time = now;
+			lockfile_check_scheduled_at =
+				TimestampTzPlusSeconds(lockfile_check_scheduled_at,
+									   LOCKFILE_CHECK_SECS);
 		}
 
 		/*
@@ -1816,11 +1762,13 @@ ServerLoop(void)
 		 * they are not removed by overzealous /tmp-cleaning tasks.  We assume
 		 * no one runs cleaners with cutoff times of less than an hour ...
 		 */
-		if (now - last_touch_time >= 58 * SECS_PER_MINUTE)
+		if (now >= socket_touch_scheduled_at)
 		{
 			TouchSocketFiles();
 			TouchSocketLockFiles();
-			last_touch_time = now;
+			socket_touch_scheduled_at =
+				TimestampTzPlusSeconds(socket_touch_scheduled_at,
+									   SOCKET_TOUCH_SECS);
 		}
 	}
 }
@@ -2231,7 +2179,9 @@ process_pm_shutdown_request(void)
 			UpdatePMState(PM_WAIT_BACKENDS);
 
 			/* set stopwatch for them to die */
-			AbortStartTime = time(NULL);
+			sigkill_children_scheduled_at =
+				TimestampTzPlusSeconds(GetCurrentTimestamp(),
+									   SIGKILL_CHILDREN_AFTER_SECS);
 
 			/*
 			 * Now wait for backends to exit.  If there are none,
@@ -2354,7 +2304,7 @@ process_pm_child_exit(void)
 			 */
 			StartupStatus = STARTUP_NOT_RUNNING;
 			FatalError = false;
-			AbortStartTime = 0;
+			sigkill_children_scheduled_at = PM_SCHEDULE_NEVER;
 			ReachedNormalRunning = true;
 			UpdatePMState(PM_RUN);
 			connsAllowed = true;
@@ -2815,8 +2765,10 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 	 * .. and if this doesn't happen quickly enough, now the clock is ticking
 	 * for us to kill them without mercy.
 	 */
-	if (AbortStartTime == 0)
-		AbortStartTime = time(NULL);
+	if (sigkill_children_scheduled_at == PM_SCHEDULE_NEVER)
+		sigkill_children_scheduled_at =
+			TimestampTzPlusSeconds(GetCurrentTimestamp(),
+								   SIGKILL_CHILDREN_AFTER_SECS);
 }
 
 /*
@@ -3282,7 +3234,7 @@ PostmasterStateMachine(void)
 		Assert(StartupPMChild != NULL);
 		StartupStatus = STARTUP_RUNNING;
 		/* crash recovery started, reset SIGKILL flag */
-		AbortStartTime = 0;
+		sigkill_children_scheduled_at = PM_SCHEDULE_NEVER;
 
 		/* start accepting server socket connection events again */
 		ConfigurePostmasterWaitSet(true);
@@ -3759,7 +3711,7 @@ process_pm_pmsignal(void)
 	{
 		/* WAL redo has started. We're out of reinitialization. */
 		FatalError = false;
-		AbortStartTime = 0;
+		sigkill_children_scheduled_at = PM_SCHEDULE_NEVER;
 		reachedConsistency = false;
 
 		/*
@@ -4278,6 +4230,56 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 	return false;
 }
 
+static TimestampTz
+maybe_start_bgworkers_scheduled_at(void)
+{
+	TimestampTz next_wakeup;
+
+	/* Background workers are ignored during shutdown. */
+	if (Shutdown != NoShutdown)
+		return PM_SCHEDULE_NEVER;
+
+	/* Do we need a worker right now? */
+	if (StartWorkerNeeded)
+		return PM_SCHEDULE_IMMEDIATELY;
+
+	next_wakeup = PM_SCHEDULE_NEVER;
+	if (HaveCrashedWorker)
+	{
+		dlist_mutable_iter iter;
+
+		/*
+		 * When there are crashed bgworkers, we sleep just long enough that
+		 * they are restarted when they request to be.  Scan the list to
+		 * determine the minimum of all wakeup times according to most recent
+		 * crash time and requested restart interval.
+		 */
+		dlist_foreach_modify(iter, &BackgroundWorkerList)
+		{
+			RegisteredBgWorker *rw;
+			TimestampTz this_wakeup;
+
+			rw = dlist_container(RegisteredBgWorker, rw_lnode, iter.cur);
+
+			if (rw->rw_crashed_at == 0)
+				continue;
+
+			if (rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART
+				|| rw->rw_terminate)
+			{
+				ForgetBackgroundWorker(rw);
+				continue;
+			}
+
+			this_wakeup = TimestampTzPlusMilliseconds(rw->rw_crashed_at,
+													  1000L * rw->rw_worker.bgw_restart_time);
+			compute_next_wakeup(&next_wakeup, this_wakeup);
+		}
+	}
+
+	return next_wakeup;
+}
+
 /*
  * If the time is right, start background worker(s).
  *
@@ -4423,8 +4425,8 @@ maybe_reap_io_worker(int pid)
 
 /*
  * Returns the next time at which maybe_start_io_workers() would start one or
- * more I/O workers.  Any time in the past means ASAP, and 0 means no worker
- * is currently scheduled.
+ * more I/O workers, or one of the special values PM_SCHEDULE_IMMEDIATELY and
+ * PM_SCHEDULE_NEVER.
  *
  * This is called by DetermineSleepTime() and also maybe_start_io_workers()
  * itself, to make sure that they agree.
@@ -4433,25 +4435,25 @@ static TimestampTz
 maybe_start_io_workers_scheduled_at(void)
 {
 	if (!pgaio_workers_enabled())
-		return 0;
+		return PM_SCHEDULE_NEVER;
 
 	/*
 	 * If we're in final shutting down state, then we're just waiting for all
 	 * processes to exit.
 	 */
 	if (pmState >= PM_WAIT_IO_WORKERS)
-		return 0;
+		return PM_SCHEDULE_NEVER;
 
 	/* Don't start new workers during an immediate shutdown either. */
 	if (Shutdown >= ImmediateShutdown)
-		return 0;
+		return PM_SCHEDULE_NEVER;
 
 	/*
 	 * Don't start new workers if we're in the shutdown phase of a crash
 	 * restart. But we *do* need to start if we're already starting up again.
 	 */
 	if (FatalError && pmState >= PM_STOP_BACKENDS)
-		return 0;
+		return PM_SCHEDULE_NEVER;
 
 	/*
 	 * Don't start a worker if we're at or above the maximum.  (Excess workers
@@ -4459,15 +4461,15 @@ maybe_start_io_workers_scheduled_at(void)
 	 * until they are reaped.)
 	 */
 	if (io_worker_count >= io_max_workers)
-		return 0;
+		return PM_SCHEDULE_NEVER;
 
 	/* If we're under the minimum, start a worker as soon as possible. */
 	if (io_worker_count < io_min_workers)
-		return TIMESTAMP_MINUS_INFINITY;	/* start worker ASAP */
+		return PM_SCHEDULE_IMMEDIATELY;
 
 	/* Only proceed if a "grow" request is pending from existing workers. */
 	if (!pgaio_worker_test_grow())
-		return 0;
+		return PM_SCHEDULE_NEVER;
 
 	/*
 	 * maybe_start_io_workers() should start a new I/O worker after this time,
@@ -4487,7 +4489,8 @@ maybe_start_io_workers(void)
 {
 	TimestampTz scheduled_at;
 
-	while ((scheduled_at = maybe_start_io_workers_scheduled_at()) != 0)
+	while ((scheduled_at = maybe_start_io_workers_scheduled_at()) !=
+		   PM_SCHEDULE_NEVER)
 	{
 		TimestampTz now = GetCurrentTimestamp();
 		PMChild    *child;
-- 
2.47.3



  [text/x-patch] v4-0001-aio-Adjust-I-O-worker-pool-size-automatically.patch (42.1K, 3-v4-0001-aio-Adjust-I-O-worker-pool-size-automatically.patch)
  download | inline diff:
From 6c5d16a15add62c68bb7f9c7b6a1e3bde1f406d8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Sat, 22 Mar 2025 00:36:49 +1300
Subject: [PATCH v4 1/2] aio: Adjust I/O worker pool size automatically.

The size of the I/O worker pool used to implement io_method=worker was
previously controlled by the io_workers setting, defaulting to 3.  It
was hard to know how to tune it effectively.  It is now replaced with:

  io_min_workers=1
  io_max_workers=8 (up to 32)
  io_worker_idle_timeout=60s
  io_worker_launch_interval=100ms

The pool is automatically sized within the configured range according to
recent variation in demand.  It grows when existing workers detect a
backlog, and shrinks when the highest numbered worker is idle for too
long.  Work was already concentrated into low-numbered workers in
anticipation of this logic.

The logic for waking extra workers now also tries to measure and reduce
the number of spurious wakeups, though they are not entirely eliminated.

Reviewed-by: Dmitry Dolgov <[email protected]>
Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  69 ++-
 src/backend/postmaster/postmaster.c           | 161 ++++--
 src/backend/storage/aio/method_worker.c       | 505 +++++++++++++++---
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/misc/guc_parameters.dat     |  34 +-
 src/backend/utils/misc/postgresql.conf.sample |   6 +-
 src/include/storage/io_worker.h               |  10 +-
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pmsignal.h                |   1 +
 src/test/modules/test_aio/t/002_io_workers.pl |  15 +-
 src/tools/pgindent/typedefs.list              |   1 +
 11 files changed, 659 insertions(+), 145 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b44231a362d..94eec85bd96 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2870,16 +2870,75 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
-      <varlistentry id="guc-io-workers" xreflabel="io_workers">
-       <term><varname>io_workers</varname> (<type>integer</type>)
+      <varlistentry id="guc-io-min-workers" xreflabel="io_min_workers">
+       <term><varname>io_min_workers</varname> (<type>integer</type>)
        <indexterm>
-        <primary><varname>io_workers</varname> configuration parameter</primary>
+        <primary><varname>io_min_workers</varname> configuration parameter</primary>
        </indexterm>
        </term>
        <listitem>
         <para>
-         Selects the number of I/O worker processes to use. The default is
-         3. This parameter can only be set in the
+         Sets the minimum number of I/O worker processes. The default is
+         1. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-max-workers" xreflabel="io_max_workers">
+       <term><varname>io_max_workers</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_max_workers</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the maximum number of I/O worker processes. The default is
+         8. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-worker-idle-timeout" xreflabel="io_worker_idle_timeout">
+       <term><varname>io_worker_idle_timeout</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_worker_idle_timeout</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the time after which entirely idle I/O worker processes exit, reducing the
+         size of pool to match demand.  The default is 1 minute.  This
+         parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-worker-launch-interval" xreflabel="io_worker_launch_interval">
+       <term><varname>io_worker_launch_interval</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_worker_launch_interval</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the minimum time before another I/O worker can be launched.  This avoids
+         creating too many for an unsustained burst of activity.  The default is 100ms.
+         This parameter can only be set in the
          <filename>postgresql.conf</filename> file or on the server command
          line.
         </para>
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6f13e8f40a0..c42564500c6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -409,6 +409,8 @@ static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
 /* State for IO worker management. */
+static TimestampTz io_worker_launch_next_time = 0;
+static TimestampTz io_worker_launch_last_time = 0;
 static int	io_worker_count = 0;
 static PMChild *io_worker_children[MAX_IO_WORKERS];
 
@@ -447,7 +449,8 @@ static int	CountChildren(BackendTypeMask targetMask);
 static void LaunchMissingBackgroundProcesses(void);
 static void maybe_start_bgworkers(void);
 static bool maybe_reap_io_worker(int pid);
-static void maybe_adjust_io_workers(void);
+static void maybe_start_io_workers(void);
+static TimestampTz maybe_start_io_workers_scheduled_at(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static PMChild *StartChildProcess(BackendType type);
 static void StartSysLogger(void);
@@ -1391,7 +1394,7 @@ PostmasterMain(int argc, char *argv[])
 	UpdatePMState(PM_STARTUP);
 
 	/* Make sure we can perform I/O while starting up. */
-	maybe_adjust_io_workers();
+	maybe_start_io_workers();
 
 	/* Start bgwriter and checkpointer so they can help with recovery */
 	if (CheckpointerPMChild == NULL)
@@ -1555,14 +1558,13 @@ checkControlFile(void)
 static int
 DetermineSleepTime(void)
 {
-	TimestampTz next_wakeup = 0;
+	TimestampTz next_wakeup;
 
 	/*
-	 * Normal case: either there are no background workers at all, or we're in
-	 * a shutdown sequence (during which we ignore bgworkers altogether).
+	 * If an ImmediateShutdown or a crash restart has set a SIGKILL timeout,
+	 * ignore everything else and wait for that.
 	 */
-	if (Shutdown > NoShutdown ||
-		(!StartWorkerNeeded && !HaveCrashedWorker))
+	if (Shutdown >= ImmediateShutdown || FatalError)
 	{
 		if (AbortStartTime != 0)
 		{
@@ -1582,14 +1584,16 @@ DetermineSleepTime(void)
 
 			return seconds * 1000;
 		}
-		else
-			return 60 * 1000;
 	}
 
-	if (StartWorkerNeeded)
+	/* Time of next maybe_start_io_workers() call, or 0 for none. */
+	next_wakeup = maybe_start_io_workers_scheduled_at();
+
+	/* Ignore bgworkers during shutdown. */
+	if (StartWorkerNeeded && Shutdown == NoShutdown)
 		return 0;
 
-	if (HaveCrashedWorker)
+	if (HaveCrashedWorker && Shutdown == NoShutdown)
 	{
 		dlist_mutable_iter iter;
 
@@ -2542,7 +2546,17 @@ process_pm_child_exit(void)
 			if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
 				HandleChildCrash(pid, exitstatus, _("io worker"));
 
-			maybe_adjust_io_workers();
+			/*
+			 * A worker that exited with an error might have brought the pool
+			 * size below io_min_workers, or allowed the queue to grow to the
+			 * point where another worker called for growth.
+			 *
+			 * In the common case that a worker timed out due to idleness, no
+			 * replacement needs to be started.  maybe_start_io_workers() will
+			 * figure that out.
+			 */
+			maybe_start_io_workers();
+
 			continue;
 		}
 
@@ -3262,7 +3276,7 @@ PostmasterStateMachine(void)
 		UpdatePMState(PM_STARTUP);
 
 		/* Make sure we can perform I/O while starting up. */
-		maybe_adjust_io_workers();
+		maybe_start_io_workers();
 
 		StartupPMChild = StartChildProcess(B_STARTUP);
 		Assert(StartupPMChild != NULL);
@@ -3336,7 +3350,7 @@ LaunchMissingBackgroundProcesses(void)
 	 * A config file change will always lead to this function being called, so
 	 * we always will process the config change in a timely manner.
 	 */
-	maybe_adjust_io_workers();
+	maybe_start_io_workers();
 
 	/*
 	 * The checkpointer and the background writer are active from the start,
@@ -3797,6 +3811,15 @@ process_pm_pmsignal(void)
 		StartWorkerNeeded = true;
 	}
 
+	/* Process IO worker start requests. */
+	if (CheckPostmasterSignal(PMSIGNAL_IO_WORKER_GROW))
+	{
+		/*
+		 * No local flag, as the state is exposed through pgaio_worker_*()
+		 * functions.  This signal is received on potentially actionable level
+		 * changes, so that maybe_start_io_workers() will run.
+		 */
+	}
 	/* Process background worker state changes. */
 	if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
 	{
@@ -4399,44 +4422,106 @@ maybe_reap_io_worker(int pid)
 }
 
 /*
- * Start or stop IO workers, to close the gap between the number of running
- * workers and the number of configured workers.  Used to respond to change of
- * the io_workers GUC (by increasing and decreasing the number of workers), as
- * well as workers terminating in response to errors (by starting
- * "replacement" workers).
+ * Returns the next time at which maybe_start_io_workers() would start one or
+ * more I/O workers.  Any time in the past means ASAP, and 0 means no worker
+ * is currently scheduled.
+ *
+ * This is called by DetermineSleepTime() and also maybe_start_io_workers()
+ * itself, to make sure that they agree.
  */
-static void
-maybe_adjust_io_workers(void)
+static TimestampTz
+maybe_start_io_workers_scheduled_at(void)
 {
 	if (!pgaio_workers_enabled())
-		return;
+		return 0;
 
 	/*
 	 * If we're in final shutting down state, then we're just waiting for all
 	 * processes to exit.
 	 */
 	if (pmState >= PM_WAIT_IO_WORKERS)
-		return;
+		return 0;
 
 	/* Don't start new workers during an immediate shutdown either. */
 	if (Shutdown >= ImmediateShutdown)
-		return;
+		return 0;
 
 	/*
 	 * Don't start new workers if we're in the shutdown phase of a crash
 	 * restart. But we *do* need to start if we're already starting up again.
 	 */
 	if (FatalError && pmState >= PM_STOP_BACKENDS)
-		return;
+		return 0;
+
+	/*
+	 * Don't start a worker if we're at or above the maximum.  (Excess workers
+	 * exit when the GUC is lowered, but the count can be temporarily too high
+	 * until they are reaped.)
+	 */
+	if (io_worker_count >= io_max_workers)
+		return 0;
+
+	/* If we're under the minimum, start a worker as soon as possible. */
+	if (io_worker_count < io_min_workers)
+		return TIMESTAMP_MINUS_INFINITY;	/* start worker ASAP */
+
+	/* Only proceed if a "grow" request is pending from existing workers. */
+	if (!pgaio_worker_test_grow())
+		return 0;
 
-	Assert(pmState < PM_WAIT_IO_WORKERS);
+	/*
+	 * maybe_start_io_workers() should start a new I/O worker after this time,
+	 * or as soon as possible if is already in the past.
+	 */
+	return io_worker_launch_next_time;
+}
+
+/*
+ * Start I/O workers if required.  Used at startup, to respond to change of
+ * the io_min_workers GUC, when asked to start a new one due to submission
+ * queue backlog, and after workers terminate in response to errors (by
+ * starting "replacement" workers).
+ */
+static void
+maybe_start_io_workers(void)
+{
+	TimestampTz scheduled_at;
 
-	/* Not enough running? */
-	while (io_worker_count < io_workers)
+	while ((scheduled_at = maybe_start_io_workers_scheduled_at()) != 0)
 	{
+		TimestampTz now = GetCurrentTimestamp();
 		PMChild    *child;
 		int			i;
 
+		Assert(pmState < PM_WAIT_IO_WORKERS);
+
+		/* Still waiting for the scheduled time? */
+		if (scheduled_at > now)
+			break;
+
+		/* Clear the grow request flag if it is set. */
+		pgaio_worker_clear_grow();
+
+		/*
+		 * Compute next launch time relative to the previous value, so that
+		 * time spent on the postmaster's other duties don't result in an
+		 * inaccurate launch interval.
+		 */
+		io_worker_launch_next_time =
+			TimestampTzPlusMilliseconds(io_worker_launch_next_time,
+										io_worker_launch_interval);
+
+		/*
+		 * If that's already in the past, the interval is either impossibly
+		 * short or we received no requests for new workers for a period.
+		 * Compute a new future time relative to the last launch time instead.
+		 */
+		if (io_worker_launch_next_time <= now)
+			io_worker_launch_next_time =
+				TimestampTzPlusMilliseconds(io_worker_launch_last_time,
+											io_worker_launch_interval);
+		io_worker_launch_last_time = now;
+
 		/* find unused entry in io_worker_children array */
 		for (i = 0; i < MAX_IO_WORKERS; ++i)
 		{
@@ -4454,20 +4539,14 @@ maybe_adjust_io_workers(void)
 			++io_worker_count;
 		}
 		else
-			break;				/* try again next time */
-	}
-
-	/* Too many running? */
-	if (io_worker_count > io_workers)
-	{
-		/* ask the IO worker in the highest slot to exit */
-		for (int i = MAX_IO_WORKERS - 1; i >= 0; --i)
 		{
-			if (io_worker_children[i] != NULL)
-			{
-				kill(io_worker_children[i]->pid, SIGUSR2);
-				break;
-			}
+			/*
+			 * Fork failure: we'll try again after the launch interval
+			 * expires, or be called again without delay if we don't yet have
+			 * io_min_workers.  Don't loop here though, the postmaster has
+			 * other duties.
+			 */
+			break;
 		}
 	}
 }
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index eb686cede1a..863c7dc0104 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -11,9 +11,8 @@
  * infrastructure for reopening the file, and must processed synchronously by
  * the client code when submitted.
  *
- * So that the submitter can make just one system call when submitting a batch
- * of IOs, wakeups "fan out"; each woken IO worker can wake two more. XXX This
- * could be improved by using futexes instead of latches to wake N waiters.
+ * The pool tries to stabilize at a size that can handle recently seen
+ * variation in demand, within the configured limits.
  *
  * This method of AIO is available in all builds on all operating systems, and
  * is the default.
@@ -29,6 +28,8 @@
 
 #include "postgres.h"
 
+#include <limits.h>
+
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -40,6 +41,8 @@
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
 #include "tcop/tcopprot.h"
@@ -48,10 +51,11 @@
 #include "utils/ps_status.h"
 #include "utils/wait_event.h"
 
+/* Saturation for counters used to estimate wakeup:work ratio. */
+#define PGAIO_WORKER_STATS_MAX 4
 
-/* How many workers should each worker wake up if needed? */
-#define IO_WORKER_WAKEUP_FANOUT 2
-
+/* Debugging support: show current IO and wakeups:ios statistics in ps. */
+/* #define PGAIO_WORKER_SHOW_PS_INFO */
 
 typedef struct PgAioWorkerSubmissionQueue
 {
@@ -63,13 +67,34 @@ typedef struct PgAioWorkerSubmissionQueue
 
 typedef struct PgAioWorkerSlot
 {
-	Latch	   *latch;
-	bool		in_use;
+	ProcNumber	proc_number;
 } PgAioWorkerSlot;
 
+/*
+ * Sets of worker IDs are held in a simple bitmap, accessed through functions
+ * that provide a more readable abstraction.  If we wanted to support more
+ * workers than that, the contention on the single queue would surely get too
+ * high, so we might want to consider multiple pools instead of widening this.
+ */
+typedef uint64 PgAioWorkerSet;
+
+#define PGAIO_WORKER_SET_BITS (sizeof(PgAioWorkerSet) * CHAR_BIT)
+
+static_assert(PGAIO_WORKER_SET_BITS >= MAX_IO_WORKERS, "too small");
+
 typedef struct PgAioWorkerControl
 {
-	uint64		idle_worker_mask;
+	/* Seen by postmaster */
+	volatile bool grow;
+
+	/* Protected by AioWorkerSubmissionQueueLock. */
+	PgAioWorkerSet idle_worker_set;
+
+	/* Protected by AioWorkerControlLock. */
+	PgAioWorkerSet worker_set;
+	int			nworkers;
+
+	/* Protected by AioWorkerControlLock. */
 	PgAioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
 } PgAioWorkerControl;
 
@@ -91,15 +116,103 @@ const IoMethodOps pgaio_worker_ops = {
 
 
 /* GUCs */
-int			io_workers = 3;
+int			io_min_workers = 1;
+int			io_max_workers = 8;
+int			io_worker_idle_timeout = 60000;
+int			io_worker_launch_interval = 100;
 
 
 static int	io_worker_queue_size = 64;
-static int	MyIoWorkerId;
+static int	MyIoWorkerId = -1;
 static PgAioWorkerSubmissionQueue *io_worker_submission_queue;
 static PgAioWorkerControl *io_worker_control;
 
 
+static void
+pgaio_worker_set_initialize(PgAioWorkerSet *set)
+{
+	*set = 0;
+}
+
+static bool
+pgaio_worker_set_is_empty(PgAioWorkerSet *set)
+{
+	return *set == 0;
+}
+
+static PgAioWorkerSet
+pgaio_worker_set_singleton(int worker)
+{
+	return UINT64_C(1) << worker;
+}
+
+static void
+pgaio_worker_set_fill(PgAioWorkerSet *set)
+{
+	*set = UINT64_MAX >> (PGAIO_WORKER_SET_BITS - MAX_IO_WORKERS);
+}
+
+static void
+pgaio_worker_set_subtract(PgAioWorkerSet *set1, const PgAioWorkerSet *set2)
+{
+	*set1 &= ~*set2;
+}
+
+static void
+pgaio_worker_set_insert(PgAioWorkerSet *set, int worker)
+{
+	*set |= pgaio_worker_set_singleton(worker);
+}
+
+static void
+pgaio_worker_set_remove(PgAioWorkerSet *set, int worker)
+{
+	*set &= ~pgaio_worker_set_singleton(worker);
+}
+
+static void
+pgaio_worker_set_remove_less_than(PgAioWorkerSet *set, int worker)
+{
+	*set &= ~(pgaio_worker_set_singleton(worker) - 1);
+}
+
+static int
+pgaio_worker_set_get_highest(PgAioWorkerSet *set)
+{
+	Assert(!pgaio_worker_set_is_empty(set));
+	return pg_leftmost_one_pos64(*set);
+}
+
+static int
+pgaio_worker_set_get_lowest(PgAioWorkerSet *set)
+{
+	Assert(!pgaio_worker_set_is_empty(set));
+	return pg_rightmost_one_pos64(*set);
+}
+
+static int
+pgaio_worker_set_pop_lowest(PgAioWorkerSet *set)
+{
+	int			worker = pgaio_worker_set_get_lowest(set);
+
+	pgaio_worker_set_remove(set, worker);
+	return worker;
+}
+
+#ifdef USE_ASSERT_CHECKING
+static bool
+pgaio_worker_set_contains(PgAioWorkerSet *set, int worker)
+{
+	return (*set & pgaio_worker_set_singleton(worker)) != 0;
+}
+
+static int
+pgaio_worker_set_count(PgAioWorkerSet *set)
+{
+	return pg_popcount64(*set);
+}
+#endif
+
 static void
 pgaio_worker_shmem_request(void *arg)
 {
@@ -133,37 +246,107 @@ pgaio_worker_shmem_init(void *arg)
 	io_worker_submission_queue->size = queue_size;
 	io_worker_submission_queue->head = 0;
 	io_worker_submission_queue->tail = 0;
+	io_worker_control->grow = false;
+	pgaio_worker_set_initialize(&io_worker_control->worker_set);
+	pgaio_worker_set_initialize(&io_worker_control->idle_worker_set);
 
-	io_worker_control->idle_worker_mask = 0;
 	for (int i = 0; i < MAX_IO_WORKERS; ++i)
+		io_worker_control->workers[i].proc_number = INVALID_PROC_NUMBER;
+}
+
+static void
+pgaio_worker_grow(bool grow)
+{
+	/*
+	 * This is called from sites that don't hold AioWorkerControlLock, but
+	 * these values change infrequently and an up-to-date value is not
+	 * required for this heuristic purpose.
+	 */
+	if (!grow)
+	{
+		/* Avoid dirtying memory if not already set. */
+		if (io_worker_control->grow)
+			io_worker_control->grow = false;
+	}
+	else
 	{
-		io_worker_control->workers[i].latch = NULL;
-		io_worker_control->workers[i].in_use = false;
+		/* Do nothing if request already pending. */
+		if (!io_worker_control->grow)
+		{
+			io_worker_control->grow = true;
+			SendPostmasterSignal(PMSIGNAL_IO_WORKER_GROW);
+		}
 	}
 }
 
+/*
+ * Called by the postmaster to check if a new worker is needed.
+ */
+bool
+pgaio_worker_test_grow(void)
+{
+	return io_worker_control && io_worker_control->grow;
+}
+
+/*
+ * Called by the postmaster to clear the grow flag.
+ */
+void
+pgaio_worker_clear_grow(void)
+{
+	if (io_worker_control)
+		io_worker_control->grow = false;
+}
+
 static int
-pgaio_worker_choose_idle(void)
+pgaio_worker_choose_idle(int minimum_worker)
 {
+	PgAioWorkerSet worker_set;
 	int			worker;
 
-	if (io_worker_control->idle_worker_mask == 0)
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
+	worker_set = io_worker_control->idle_worker_set;
+	pgaio_worker_set_remove_less_than(&worker_set, minimum_worker);
+	if (pgaio_worker_set_is_empty(&worker_set))
 		return -1;
 
-	/* Find the lowest bit position, and clear it. */
-	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
-	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
-	Assert(io_worker_control->workers[worker].in_use);
+	/* Find the lowest numbered idle worker and mark it not idle. */
+	worker = pgaio_worker_set_get_lowest(&worker_set);
+	pgaio_worker_set_remove(&io_worker_control->idle_worker_set, worker);
 
 	return worker;
 }
 
+/*
+ * Try to wake a worker by setting its latch, to tell it there are IOs to
+ * process in the submission queue.
+ */
+static void
+pgaio_worker_wake(int worker)
+{
+	ProcNumber	proc_number;
+
+	/*
+	 * If the selected worker is concurrently exiting, then pgaio_worker_die()
+	 * had not yet removed it as of when we saw it in idle_worker_set.  That's
+	 * OK, because it will wake all remaining workers to close wakeup-vs-exit
+	 * races: *someone* will see the queued IO.  If there are no workers
+	 * running, the postmaster will start a new one.
+	 */
+	proc_number = io_worker_control->workers[worker].proc_number;
+	if (proc_number != INVALID_PROC_NUMBER)
+		SetLatch(&GetPGProcByNumber(proc_number)->procLatch);
+}
+
 static bool
 pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
 {
 	PgAioWorkerSubmissionQueue *queue;
 	uint32		new_head;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	queue = io_worker_submission_queue;
 	new_head = (queue->head + 1) & (queue->size - 1);
 	if (new_head == queue->tail)
@@ -185,6 +368,8 @@ pgaio_worker_submission_queue_consume(void)
 	PgAioWorkerSubmissionQueue *queue;
 	int			result;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	queue = io_worker_submission_queue;
 	if (queue->tail == queue->head)
 		return -1;				/* empty */
@@ -201,6 +386,8 @@ pgaio_worker_submission_queue_depth(void)
 	uint32		head;
 	uint32		tail;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	head = io_worker_submission_queue->head;
 	tail = io_worker_submission_queue->tail;
 
@@ -226,8 +413,7 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 {
 	PgAioHandle **synchronous_ios = NULL;
 	int			nsync = 0;
-	Latch	   *wakeup = NULL;
-	int			worker;
+	int			worker = -1;
 
 	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
 
@@ -252,19 +438,15 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 				break;
 			}
 
-			if (wakeup == NULL)
-			{
-				/* Choose an idle worker to wake up if we haven't already. */
-				worker = pgaio_worker_choose_idle();
-				if (worker >= 0)
-					wakeup = io_worker_control->workers[worker].latch;
-
-				pgaio_debug_io(DEBUG4, staged_ios[i],
-							   "choosing worker %d",
-							   worker);
-			}
+			/* Choose one worker to wake for this batch. */
+			if (worker == -1)
+				worker = pgaio_worker_choose_idle(0);
 		}
 		LWLockRelease(AioWorkerSubmissionQueueLock);
+
+		/* Wake up chosen worker.  It will wake peers if necessary. */
+		if (worker != -1)
+			pgaio_worker_wake(worker);
 	}
 	else
 	{
@@ -273,9 +455,6 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 		nsync = num_staged_ios;
 	}
 
-	if (wakeup)
-		SetLatch(wakeup);
-
 	/* Run whatever is left synchronously. */
 	if (nsync > 0)
 	{
@@ -295,14 +474,27 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 static void
 pgaio_worker_die(int code, Datum arg)
 {
-	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
-	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
-	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+	PgAioWorkerSet notify_set;
 
-	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
-	io_worker_control->workers[MyIoWorkerId].in_use = false;
-	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	pgaio_worker_set_remove(&io_worker_control->idle_worker_set, MyIoWorkerId);
 	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].proc_number == MyProcNumber);
+	io_worker_control->workers[MyIoWorkerId].proc_number = INVALID_PROC_NUMBER;
+	Assert(pgaio_worker_set_contains(&io_worker_control->worker_set, MyIoWorkerId));
+	pgaio_worker_set_remove(&io_worker_control->worker_set, MyIoWorkerId);
+	notify_set = io_worker_control->worker_set;
+	Assert(io_worker_control->nworkers > 0);
+	io_worker_control->nworkers--;
+	Assert(pgaio_worker_set_count(&io_worker_control->worker_set) ==
+		   io_worker_control->nworkers);
+	LWLockRelease(AioWorkerControlLock);
+
+	/* Notify other workers on pool change. */
+	while (!pgaio_worker_set_is_empty(&notify_set))
+		pgaio_worker_wake(pgaio_worker_set_pop_lowest(&notify_set));
 }
 
 /*
@@ -312,33 +504,34 @@ pgaio_worker_die(int code, Datum arg)
 static void
 pgaio_worker_register(void)
 {
+	PgAioWorkerSet free_worker_set;
+	PgAioWorkerSet old_worker_set;
+
 	MyIoWorkerId = -1;
 
-	/*
-	 * XXX: This could do with more fine-grained locking. But it's also not
-	 * very common for the number of workers to change at the moment...
-	 */
-	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
+	pgaio_worker_set_fill(&free_worker_set);
+	pgaio_worker_set_subtract(&free_worker_set, &io_worker_control->worker_set);
+	if (!pgaio_worker_set_is_empty(&free_worker_set))
+		MyIoWorkerId = pgaio_worker_set_get_lowest(&free_worker_set);
+	if (MyIoWorkerId == -1)
+		elog(ERROR, "couldn't find a free worker ID");
 
-	for (int i = 0; i < MAX_IO_WORKERS; ++i)
-	{
-		if (!io_worker_control->workers[i].in_use)
-		{
-			Assert(io_worker_control->workers[i].latch == NULL);
-			io_worker_control->workers[i].in_use = true;
-			MyIoWorkerId = i;
-			break;
-		}
-		else
-			Assert(io_worker_control->workers[i].latch != NULL);
-	}
+	Assert(io_worker_control->workers[MyIoWorkerId].proc_number ==
+		   INVALID_PROC_NUMBER);
+	io_worker_control->workers[MyIoWorkerId].proc_number = MyProcNumber;
 
-	if (MyIoWorkerId == -1)
-		elog(ERROR, "couldn't find a free worker slot");
+	old_worker_set = io_worker_control->worker_set;
+	Assert(!pgaio_worker_set_contains(&old_worker_set, MyIoWorkerId));
+	pgaio_worker_set_insert(&io_worker_control->worker_set, MyIoWorkerId);
+	io_worker_control->nworkers++;
+	Assert(pgaio_worker_set_count(&io_worker_control->worker_set) ==
+		   io_worker_control->nworkers);
+	LWLockRelease(AioWorkerControlLock);
 
-	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
-	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
-	LWLockRelease(AioWorkerSubmissionQueueLock);
+	/* Notify other workers on pool change. */
+	while (!pgaio_worker_set_is_empty(&old_worker_set))
+		pgaio_worker_wake(pgaio_worker_set_pop_lowest(&old_worker_set));
 
 	on_shmem_exit(pgaio_worker_die, 0);
 }
@@ -364,14 +557,48 @@ pgaio_worker_error_callback(void *arg)
 	errcontext("I/O worker executing I/O on behalf of process %d", owner_pid);
 }
 
+/*
+ * Check if this backend is allowed to time out, and thus should use a
+ * non-infinite sleep time.  Only the highest-numbered worker is allowed to
+ * time out, and only if the pool is above io_min_workers.  Serializing
+ * timeouts keeps IDs in a range 0..N without gaps, and avoids undershooting
+ * io_min_workers.
+ *
+ * The result is only instantaneously true and may be temporarily inconsistent
+ * in different workers around transitions, but all workers are woken up on
+ * pool size or GUC changes making the result eventually consistent.
+ */
+static bool
+pgaio_worker_can_timeout(void)
+{
+	PgAioWorkerSet worker_set;
+
+	/* Serialize against pool size changes. */
+	LWLockAcquire(AioWorkerControlLock, LW_SHARED);
+	worker_set = io_worker_control->worker_set;
+	LWLockRelease(AioWorkerControlLock);
+
+	if (MyIoWorkerId != pgaio_worker_set_get_highest(&worker_set))
+		return false;
+
+	if (MyIoWorkerId < io_min_workers)
+		return false;
+
+	return true;
+}
+
 void
 IoWorkerMain(const void *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	TimestampTz idle_timeout_abs = 0;
+	int			timeout_guc_used = 0;
 	PgAioHandle *volatile error_ioh = NULL;
 	ErrorContextCallback errcallback = {0};
 	volatile int error_errno = 0;
 	char		cmd[128];
+	int			ios = 0;
+	int			wakeups = 0;
 
 	AuxiliaryProcessMainCommon();
 
@@ -439,10 +666,9 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	while (!ShutdownRequestPending)
 	{
 		uint32		io_index;
-		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
-		int			nlatches = 0;
-		int			nwakeups = 0;
-		int			worker;
+		int			worker = -1;
+		int			queue_depth = 0;
+		bool		grow = false;
 
 		/*
 		 * Try to get a job to do.
@@ -453,38 +679,64 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
 		if ((io_index = pgaio_worker_submission_queue_consume()) == -1)
 		{
-			/*
-			 * Nothing to do.  Mark self idle.
-			 *
-			 * XXX: Invent some kind of back pressure to reduce useless
-			 * wakeups?
-			 */
-			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+			/* Nothing to do.  Mark self idle. */
+			pgaio_worker_set_insert(&io_worker_control->idle_worker_set,
+									MyIoWorkerId);
 		}
 		else
 		{
 			/* Got one.  Clear idle flag. */
-			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+			pgaio_worker_set_remove(&io_worker_control->idle_worker_set,
+									MyIoWorkerId);
 
-			/* See if we can wake up some peers. */
-			nwakeups = Min(pgaio_worker_submission_queue_depth(),
-						   IO_WORKER_WAKEUP_FANOUT);
-			for (int i = 0; i < nwakeups; ++i)
+			/*
+			 * See if we should wake up a higher numbered peer.  Only do that
+			 * if this worker is not receiving spurious wakeups itself.
+			 *
+			 * This heuristic tries to discover the useful wakeup propagation
+			 * chain length when IOs are very fast and workers wake up to find
+			 * that all IOs have already been taken.
+			 *
+			 * If we chose not to wake a worker when we ideally should have,
+			 * the ratio will soon be corrected.
+			 */
+			if (wakeups <= ios)
 			{
-				if ((worker = pgaio_worker_choose_idle()) < 0)
-					break;
-				latches[nlatches++] = io_worker_control->workers[worker].latch;
+				queue_depth = pgaio_worker_submission_queue_depth();
+				if (queue_depth > 0)
+				{
+					worker = pgaio_worker_choose_idle(MyIoWorkerId + 1);
+
+					/*
+					 * If there were no idle higher numbered peers and there
+					 * are more than enough IOs queued for me and all lower
+					 * numbered peers, then try to start a new worker.
+					 */
+					if (worker == -1 && queue_depth > MyIoWorkerId)
+						grow = true;
+				}
 			}
 		}
 		LWLockRelease(AioWorkerSubmissionQueueLock);
 
-		for (int i = 0; i < nlatches; ++i)
-			SetLatch(latches[i]);
+		/* Propagate wakeups. */
+		if (worker != -1)
+			pgaio_worker_wake(worker);
+		else if (grow)
+			pgaio_worker_grow(true);
 
 		if (io_index != -1)
 		{
 			PgAioHandle *ioh = NULL;
 
+			/* Cancel timeout and update wakeup:work ratio. */
+			idle_timeout_abs = 0;
+			if (++ios == PGAIO_WORKER_STATS_MAX)
+			{
+				wakeups /= 2;
+				ios /= 2;
+			}
+
 			ioh = &pgaio_ctl->io_handles[io_index];
 			error_ioh = ioh;
 			errcallback.arg = ioh;
@@ -537,6 +789,14 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 			}
 #endif
 
+#ifdef PGAIO_WORKER_SHOW_PS_INFO
+			sprintf(cmd, "%d: [%s] %s",
+					MyIoWorkerId,
+					pgaio_io_get_op_name(ioh),
+					pgaio_io_get_target_description(ioh));
+			set_ps_display(cmd);
+#endif
+
 			/*
 			 * We don't expect this to ever fail with ERROR or FATAL, no need
 			 * to keep error_ioh set to the IO.
@@ -550,8 +810,75 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		}
 		else
 		{
-			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-					  WAIT_EVENT_IO_WORKER_MAIN);
+			int			timeout_ms;
+
+			/* Cancel new worker request if pending. */
+			pgaio_worker_grow(false);
+
+			/* Compute the remaining allowed idle time. */
+			if (io_worker_idle_timeout == -1)
+			{
+				/* Never time out. */
+				timeout_ms = -1;
+			}
+			else
+			{
+				TimestampTz now = GetCurrentTimestamp();
+
+				/* If the GUC changes, reset timer. */
+				if (idle_timeout_abs != 0 &&
+					io_worker_idle_timeout != timeout_guc_used)
+					idle_timeout_abs = 0;
+
+				/* On first sleep, compute absolute timeout. */
+				if (idle_timeout_abs == 0)
+				{
+					idle_timeout_abs =
+						TimestampTzPlusMilliseconds(now,
+													io_worker_idle_timeout);
+					timeout_guc_used = io_worker_idle_timeout;
+				}
+
+				/*
+				 * All workers maintain the absolute timeout value, but only
+				 * the highest worker can actually time out and only if
+				 * io_min_workers is satisfied.  All others wait only for
+				 * explicit wakeups caused by queue insertion, wakeup
+				 * propagation, change of pool size (possibly promoting one to
+				 * new highest) or GUC reload.
+				 */
+				if (pgaio_worker_can_timeout())
+					timeout_ms =
+						TimestampDifferenceMilliseconds(now,
+														idle_timeout_abs);
+				else
+					timeout_ms = -1;
+			}
+
+#ifdef PGAIO_WORKER_SHOW_PS_INFO
+			sprintf(cmd, "%d: idle, wakeups:ios = %d:%d",
+					MyIoWorkerId, wakeups, ios);
+			set_ps_display(cmd);
+#endif
+
+			if (WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT,
+						  timeout_ms,
+						  WAIT_EVENT_IO_WORKER_MAIN) == WL_TIMEOUT)
+			{
+				/* WL_TIMEOUT */
+				if (pgaio_worker_can_timeout())
+					if (GetCurrentTimestamp() >= idle_timeout_abs)
+						break;
+			}
+			else
+			{
+				/* WL_LATCH_SET */
+				if (++wakeups == PGAIO_WORKER_STATS_MAX)
+				{
+					wakeups /= 2;
+					ios /= 2;
+				}
+			}
 			ResetLatch(MyLatch);
 		}
 
@@ -561,6 +888,10 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		{
 			ConfigReloadPending = false;
 			ProcessConfigFile(PGC_SIGHUP);
+
+			/* If io_max_workers has been decreased, exit highest first. */
+			if (MyIoWorkerId >= io_max_workers)
+				break;
 		}
 	}
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0a6d16f8154..4f9e88f1402 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -368,6 +368,7 @@ AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 LogicalDecodingControl	"Waiting to read or update logical decoding status information."
 DataChecksumsWorker	"Waiting for data checksums worker."
+AioWorkerControl	"Waiting to update AIO worker information."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 7a8a5d0764c..4b27856ea44 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1382,6 +1382,14 @@
   check_hook => 'check_io_max_concurrency',
 },
 
+{ name => 'io_max_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Maximum number of I/O worker processes, for io_method=worker.',
+  variable => 'io_max_workers',
+  boot_val => '8',
+  min => '1',
+  max => 'MAX_IO_WORKERS',
+},
+
 { name => 'io_method', type => 'enum', context => 'PGC_POSTMASTER', group => 'RESOURCES_IO',
   short_desc => 'Selects the method for executing asynchronous I/O.',
   variable => 'io_method',
@@ -1390,14 +1398,32 @@
   assign_hook => 'assign_io_method',
 },
 
-{ name => 'io_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
-  short_desc => 'Number of IO worker processes, for io_method=worker.',
-  variable => 'io_workers',
-  boot_val => '3',
+{ name => 'io_min_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Minimum number of I/O worker processes, for io_method=worker.',
+  variable => 'io_min_workers',
+  boot_val => '1',
   min => '1',
   max => 'MAX_IO_WORKERS',
 },
 
+{ name => 'io_worker_idle_timeout', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Maximum time before idle I/O worker processes time out, for io_method=worker.',
+  variable => 'io_worker_idle_timeout',
+  flags => 'GUC_UNIT_MS',
+  boot_val => '60000',
+  min => '0',
+  max => 'INT_MAX',
+},
+
+{ name => 'io_worker_launch_interval', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Minimum time before launching a new I/O worker process, for io_method=worker.',
+  variable => 'io_worker_launch_interval',
+  flags => 'GUC_UNIT_MS',
+  boot_val => '100',
+  min => '0',
+  max => 'INT_MAX',
+},
+
 # Not for general use --- used by SET SESSION AUTHORIZATION and SET
 # ROLE
 { name => 'is_superuser', type => 'bool', context => 'PGC_INTERNAL', group => 'UNGROUPED',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 10a281dfd4b..4d6321029b3 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -218,7 +218,11 @@
                                         # can execute simultaneously
                                         # -1 sets based on shared_buffers
                                         # (change requires restart)
-#io_workers = 3                         # 1-32;
+
+#io_min_workers = 1                     # 1-32 (change requires pg_reload_conf())
+#io_max_workers = 8                     # 1-32
+#io_worker_idle_timeout = 60s
+#io_worker_launch_interval = 100ms
 
 # - Worker Processes -
 
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
index f7d5998a138..78f49d6ccf0 100644
--- a/src/include/storage/io_worker.h
+++ b/src/include/storage/io_worker.h
@@ -17,6 +17,14 @@
 
 pg_noreturn extern void IoWorkerMain(const void *startup_data, size_t startup_data_len);
 
-extern PGDLLIMPORT int io_workers;
+/* Public GUCs. */
+extern PGDLLIMPORT int io_min_workers;
+extern PGDLLIMPORT int io_max_workers;
+extern PGDLLIMPORT int io_worker_idle_timeout;
+extern PGDLLIMPORT int io_worker_launch_interval;
+
+/* Interfaces visible to the postmaster. */
+extern bool pgaio_worker_test_grow(void);
+extern void pgaio_worker_clear_grow(void);
 
 #endif							/* IO_WORKER_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index af8553bcb6c..d7eb648bd27 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -88,6 +88,7 @@ PG_LWLOCK(53, AioWorkerSubmissionQueue)
 PG_LWLOCK(54, WaitLSN)
 PG_LWLOCK(55, LogicalDecodingControl)
 PG_LWLOCK(56, DataChecksumsWorker)
+PG_LWLOCK(57, AioWorkerControl)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 001e6eea61c..bcce4011790 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -38,6 +38,7 @@ typedef enum
 	PMSIGNAL_ROTATE_LOGFILE,	/* send SIGUSR1 to syslogger to rotate logfile */
 	PMSIGNAL_START_AUTOVAC_LAUNCHER,	/* start an autovacuum launcher */
 	PMSIGNAL_START_AUTOVAC_WORKER,	/* start an autovacuum worker */
+	PMSIGNAL_IO_WORKER_GROW,	/* I/O worker pool wants to grow */
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
diff --git a/src/test/modules/test_aio/t/002_io_workers.pl b/src/test/modules/test_aio/t/002_io_workers.pl
index 34bc132ea08..b9775811d4d 100644
--- a/src/test/modules/test_aio/t/002_io_workers.pl
+++ b/src/test/modules/test_aio/t/002_io_workers.pl
@@ -14,6 +14,9 @@ $node->init();
 $node->append_conf(
 	'postgresql.conf', qq(
 io_method=worker
+io_worker_idle_timeout=0ms
+io_worker_launch_interval=0ms
+io_max_workers=32
 ));
 
 $node->start();
@@ -31,7 +34,7 @@ sub test_number_of_io_workers_dynamic
 {
 	my $node = shift;
 
-	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_workers');
+	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_min_workers');
 
 	# Verify that worker count can't be set to 0
 	change_number_of_io_workers($node, 0, $prev_worker_count, 1);
@@ -62,24 +65,24 @@ sub change_number_of_io_workers
 	my ($result, $stdout, $stderr);
 
 	($result, $stdout, $stderr) =
-	  $node->psql('postgres', "ALTER SYSTEM SET io_workers = $worker_count");
+	  $node->psql('postgres', "ALTER SYSTEM SET io_min_workers = $worker_count");
 	$node->safe_psql('postgres', 'SELECT pg_reload_conf()');
 
 	if ($expect_failure)
 	{
 		like(
 			$stderr,
-			qr/$worker_count is outside the valid range for parameter "io_workers"/,
-			"updating number of io_workers to $worker_count failed, as expected"
+			qr/$worker_count is outside the valid range for parameter "io_min_workers"/,
+			"updating io_min_workers to $worker_count failed, as expected"
 		);
 
 		return $prev_worker_count;
 	}
 	else
 	{
-		is( $node->safe_psql('postgres', 'SHOW io_workers'),
+		is( $node->safe_psql('postgres', 'SHOW io_min_workers'),
 			$worker_count,
-			"updating number of io_workers from $prev_worker_count to $worker_count"
+			"updating number of io_min_workers from $prev_worker_count to $worker_count"
 		);
 
 		check_io_worker_count($node, $worker_count);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e9430e07b36..a0955420d35 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2265,6 +2265,7 @@ PgAioUringCaps
 PgAioUringContext
 PgAioWaitRef
 PgAioWorkerControl
+PgAioWorkerSet
 PgAioWorkerSlot
 PgAioWorkerSubmissionQueue
 PgArchData
-- 
2.47.3



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-07-12 05:08       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-07-30 10:14         ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-08-04 05:30           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-03-28 09:31             ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 15:02               ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
@ 2026-04-06 18:14                 ` Andres Freund <[email protected]>
  2026-04-07 10:39                   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Andres Freund @ 2026-04-06 18:14 UTC (permalink / raw)
  To: Thomas Munro <[email protected]>; +Cc: Dmitry Dolgov <[email protected]>; PostgreSQL Hackers <[email protected]>

Hi,

On 2026-04-07 03:02:52 +1200, Thomas Munro wrote:
> Here's an updated patch.  It's mostly just rebased over the recent
> firehose, but with lots of comments and a few names (hopefully)
> improved.  There is one code change to highlight though:
>
> maybe_start_io_workers() knows when it's not allowed to create new
> workers, an interesting case being FatalError before we have started
> the new world.

*worker, I assume?


> The previous coding of DetermineSleepTime() didn't
> know about that, so it could return 0 (don't sleep), and then the
> postmaster could busy-wait for restart progress.

In master or the prior version of your patch?


> Maybe there were
> other cases like that, but in general DetermineSleepTime() and
> maybe_start_io_workers() really need to be 100% in agreement.  So I
> have moved that knowledge into a new function
> maybe_start_io_workers_scheduled_at().  Both DetermineSleepTime() and
> maybe_start_io_workers() call that so there is a single source of
> truth.
>
> I think I got confused about that because it's not that obvious why
> the existing code doesn't test FatalError.
>
> I thought of a slightly bigger refactoring that might deconfuse
> DetermineSleepTime() a bit more.  Probably material for the next
> cycle, but basically the idea is to stop using a bunch of different
> conditions and different units of time and convert the whole thing to
> a simple find-the-lowest-time function.  I kept that separate.
>
> I'll post a new version of the patch that was v3-0002 separately.


> From 6c5d16a15add62c68bb7f9c7b6a1e3bde1f406d8 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <[email protected]>
> Date: Sat, 22 Mar 2025 00:36:49 +1300
> Subject: [PATCH v4 1/2] aio: Adjust I/O worker pool size automatically.
>
> The size of the I/O worker pool used to implement io_method=worker was
> previously controlled by the io_workers setting, defaulting to 3.  It
> was hard to know how to tune it effectively.  It is now replaced with:
>
>   io_min_workers=1
>   io_max_workers=8 (up to 32)
>   io_worker_idle_timeout=60s
>   io_worker_launch_interval=100ms

I'm a bit concerned about defaulting to io_min_workers=1. That means in an
intermittent workload, there will be no IO concurrency for short running but
IO intensive queries, while having the dispatch overhead to the worker.  It
can still be a win if the query is CPU intensive, but far from all are.

I'd therefore argue that the minimum ought to be at least 2.


> diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
> index 6f13e8f40a0..c42564500c6 100644
> --- a/src/backend/postmaster/postmaster.c
> +++ b/src/backend/postmaster/postmaster.c


> @@ -1555,14 +1558,13 @@ checkControlFile(void)
>  static int
>  DetermineSleepTime(void)
>  {
> -	TimestampTz next_wakeup = 0;
> +	TimestampTz next_wakeup;
>
>  	/*
> -	 * Normal case: either there are no background workers at all, or we're in
> -	 * a shutdown sequence (during which we ignore bgworkers altogether).
> +	 * If an ImmediateShutdown or a crash restart has set a SIGKILL timeout,
> +	 * ignore everything else and wait for that.
>  	 */
> -	if (Shutdown > NoShutdown ||
> -		(!StartWorkerNeeded && !HaveCrashedWorker))
> +	if (Shutdown >= ImmediateShutdown || FatalError)
>  	{
>  		if (AbortStartTime != 0)
>  		{
> @@ -1582,14 +1584,16 @@ DetermineSleepTime(void)
>
>  			return seconds * 1000;
>  		}
> -		else
> -			return 60 * 1000;
>  	}
>
> -	if (StartWorkerNeeded)
> +	/* Time of next maybe_start_io_workers() call, or 0 for none. */
> +	next_wakeup = maybe_start_io_workers_scheduled_at();
> +
> +	/* Ignore bgworkers during shutdown. */
> +	if (StartWorkerNeeded && Shutdown == NoShutdown)
>  		return 0;

Why is the maybe_start_io_workers_scheduled_at() thing before the return 0
here?

> -	if (HaveCrashedWorker)
> +	if (HaveCrashedWorker && Shutdown == NoShutdown)
>  	{
>  		dlist_mutable_iter iter;
>


> @@ -3797,6 +3811,15 @@ process_pm_pmsignal(void)
>  		StartWorkerNeeded = true;
>  	}
>
> +	/* Process IO worker start requests. */
> +	if (CheckPostmasterSignal(PMSIGNAL_IO_WORKER_GROW))
> +	{
> +		/*
> +		 * No local flag, as the state is exposed through pgaio_worker_*()
> +		 * functions.  This signal is received on potentially actionable level
> +		 * changes, so that maybe_start_io_workers() will run.
> +		 */
> +	}
>  	/* Process background worker state changes. */
>  	if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
>  	{

Absolute nitpick - the different blocks so far have been separated by an empty
line.



> +	/* Only proceed if a "grow" request is pending from existing workers. */
> +	if (!pgaio_worker_test_grow())
> +		return 0;

So this accesses shared memory from postmaster.  I think this amount of access
is safe enough that that's ok. You'd have to somehow have corrupted
postmaster's copy of io_worker_control, or unmapped the shared memory it is
pointed to, for that to cause a crash.  The first shouldn't be an issue, the
latter would be quite the confusion fo the state machine.


> +/*
> + * Start I/O workers if required.  Used at startup, to respond to change of
> + * the io_min_workers GUC, when asked to start a new one due to submission
> + * queue backlog, and after workers terminate in response to errors (by
> + * starting "replacement" workers).
> + */
> +static void
> +maybe_start_io_workers(void)
> +{
> +	TimestampTz scheduled_at;
>
> -	/* Not enough running? */
> -	while (io_worker_count < io_workers)
> +	while ((scheduled_at = maybe_start_io_workers_scheduled_at()) != 0)
>  	{
> +		TimestampTz now = GetCurrentTimestamp();
>  		PMChild    *child;
>  		int			i;
>
> +		Assert(pmState < PM_WAIT_IO_WORKERS);
> +
> +		/* Still waiting for the scheduled time? */
> +		if (scheduled_at > now)
> +			break;
> +
> +		/* Clear the grow request flag if it is set. */
> +		pgaio_worker_clear_grow();
> +
> +		/*
> +		 * Compute next launch time relative to the previous value, so that
> +		 * time spent on the postmaster's other duties don't result in an
> +		 * inaccurate launch interval.
> +		 */
> +		io_worker_launch_next_time =
> +			TimestampTzPlusMilliseconds(io_worker_launch_next_time,
> +										io_worker_launch_interval);
> +
> +		/*
> +		 * If that's already in the past, the interval is either impossibly
> +		 * short or we received no requests for new workers for a period.
> +		 * Compute a new future time relative to the last launch time instead.
> +		 */
> +		if (io_worker_launch_next_time <= now)
> +			io_worker_launch_next_time =
> +				TimestampTzPlusMilliseconds(io_worker_launch_last_time,
> +											io_worker_launch_interval);

Did you intend to use TimestampTzPlusMilliseconds(now, ...) here?  Or did you
want to have this if after the next line:

> +		io_worker_launch_last_time = now;
> +

Because otherwise I don't understand how this is intended to work.


>  		/* find unused entry in io_worker_children array */
>  		for (i = 0; i < MAX_IO_WORKERS; ++i)
>  		{
> @@ -4454,20 +4539,14 @@ maybe_adjust_io_workers(void)
>  			++io_worker_count;
>  		}
>  		else
> -			break;				/* try again next time */
> -	}
> -
> -	/* Too many running? */
> -	if (io_worker_count > io_workers)
> -	{
> -		/* ask the IO worker in the highest slot to exit */
> -		for (int i = MAX_IO_WORKERS - 1; i >= 0; --i)
>  		{
> -			if (io_worker_children[i] != NULL)
> -			{
> -				kill(io_worker_children[i]->pid, SIGUSR2);
> -				break;
> -			}
> +			/*
> +			 * Fork failure: we'll try again after the launch interval
> +			 * expires, or be called again without delay if we don't yet have
> +			 * io_min_workers.  Don't loop here though, the postmaster has
> +			 * other duties.
> +			 */
> +			break;
>  		}
>  	}
>  }

Reading just this part of the diff I am wondering what is reponsible for
reducing the number of workers below the max after a config change.  I assume
it's done in the workers, but it might be worth putting a comment here noting
that.

> +/* Debugging support: show current IO and wakeups:ios statistics in ps. */
> +/* #define PGAIO_WORKER_SHOW_PS_INFO */
>
>  typedef struct PgAioWorkerSubmissionQueue
>  {
> @@ -63,13 +67,34 @@ typedef struct PgAioWorkerSubmissionQueue
>
>  typedef struct PgAioWorkerSlot
>  {
> -	Latch	   *latch;
> -	bool		in_use;
> +	ProcNumber	proc_number;
>  } PgAioWorkerSlot;
>
> +/*
> + * Sets of worker IDs are held in a simple bitmap, accessed through functions
> + * that provide a more readable abstraction.  If we wanted to support more
> + * workers than that, the contention on the single queue would surely get too
> + * high, so we might want to consider multiple pools instead of widening this.
> + */
> +typedef uint64 PgAioWorkerSet;

> +#define PGAIO_WORKER_SET_BITS (sizeof(PgAioWorkerSet) * CHAR_BIT)
> +
> +static_assert(PGAIO_WORKER_SET_BITS >= MAX_IO_WORKERS, "too small");
> +
>  typedef struct PgAioWorkerControl
>  {
> -	uint64		idle_worker_mask;
> +	/* Seen by postmaster */
> +	volatile bool grow;

What's that volatile intending to do here? It avoids the needs for some
compiler barriers, but it's not clear to me those would be needed here anyway.
And it doesn't imply memory ordering, which I'm not sure is entirely wise
here.  I'd probably just plop a full memory barrier in the few relevant
places, easier to reason about that way, and it can't matter given the
infrequency of access.  I'd say we should just use a proper atomic, but right
now I don't think we do that in postmaster.


> +	/* Protected by AioWorkerSubmissionQueueLock. */
> +	PgAioWorkerSet idle_worker_set;
> +
> +	/* Protected by AioWorkerControlLock. */
> +	PgAioWorkerSet worker_set;
> +	int			nworkers;
> +
> +	/* Protected by AioWorkerControlLock. */
>  	PgAioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
>  } PgAioWorkerControl;
>
> @@ -91,15 +116,103 @@ const IoMethodOps pgaio_worker_ops = {
>
>
> +static bool
> +pgaio_worker_set_is_empty(PgAioWorkerSet *set)
> +{
> +	return *set == 0;
> +}
> +
> +static PgAioWorkerSet
> +pgaio_worker_set_singleton(int worker)
> +{
> +	return UINT64_C(1) << worker;
> +}

I guess an assert about `worker` being small enough wouldn't hurt.


> +static void
> +pgaio_worker_set_fill(PgAioWorkerSet *set)
> +{
> +	*set = UINT64_MAX >> (PGAIO_WORKER_SET_BITS - MAX_IO_WORKERS);
> +}

What does "_fill" really mean?  Just that all valid bits are set? Why wouldn't
it be _all() or _full()?


> +static int
> +pgaio_worker_set_get_highest(PgAioWorkerSet *set)
> +{
> +	Assert(!pgaio_worker_set_is_empty(set));
> +	return pg_leftmost_one_pos64(*set);
> +}

"worker_set_get*" reads quite awkwardly.  Maybe just going for
pgaio_workerset_* would help?

Or maybe just name it PgAioWset/pgaio_wset_ or such?


> +static void
> +pgaio_worker_grow(bool grow)
> +{
> +	/*
> +	 * This is called from sites that don't hold AioWorkerControlLock, but
> +	 * these values change infrequently and an up-to-date value is not
> +	 * required for this heuristic purpose.
> +	 */

Is it actually useful to do this while not holding the control lock?  Ah, I
see, this is due to the split of submission and control lock.


> +	if (!grow)
> +	{
> +		/* Avoid dirtying memory if not already set. */
> +		if (io_worker_control->grow)
> +			io_worker_control->grow = false;

Hm. pgaio_worker_grow(grow=false) is a bit odd.  And this is basically a copy
of pgaio_worker_cancel_grow() - I realize that's intended for postmaster, but
somehow it's a bit odd.

Maybe just name it pgaio_worker_set_grow()?



> +/*
> + * Called by the postmaster to check if a new worker is needed.
> + */
> +bool
> +pgaio_worker_test_grow(void)
> +{
> +	return io_worker_control && io_worker_control->grow;
> +}
> +
> +/*
> + * Called by the postmaster to clear the grow flag.
> + */
> +void
> +pgaio_worker_clear_grow(void)
> +{
> +	if (io_worker_control)
> +		io_worker_control->grow = false;
> +}

Maybe we should add _pm_ in there to make it clearer that they're not for
general use?


> @@ -226,8 +413,7 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
>  {
>  	PgAioHandle **synchronous_ios = NULL;
>  	int			nsync = 0;
> -	Latch	   *wakeup = NULL;
> -	int			worker;
> +	int			worker = -1;
>
>  	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
>
> @@ -252,19 +438,15 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
>  				break;
>  			}
>
> -			if (wakeup == NULL)
> -			{
> -				/* Choose an idle worker to wake up if we haven't already. */
> -				worker = pgaio_worker_choose_idle();
> -				if (worker >= 0)
> -					wakeup = io_worker_control->workers[worker].latch;
> -
> -				pgaio_debug_io(DEBUG4, staged_ios[i],
> -							   "choosing worker %d",
> -							   worker);
> -			}
> +			/* Choose one worker to wake for this batch. */
> +			if (worker == -1)
> +				worker = pgaio_worker_choose_idle(0);
>  		}

If we only want to do this once per "batch", why not just do it outside the
num_staged_ios loop?


> @@ -295,14 +474,27 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
>  static void
>  pgaio_worker_die(int code, Datum arg)
>  {
> -	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
> -	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
> -	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
> +	PgAioWorkerSet notify_set;
>
> -	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
> -	io_worker_control->workers[MyIoWorkerId].in_use = false;
> -	io_worker_control->workers[MyIoWorkerId].latch = NULL;
> +	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
> +	pgaio_worker_set_remove(&io_worker_control->idle_worker_set, MyIoWorkerId);
>  	LWLockRelease(AioWorkerSubmissionQueueLock);
> +
> +	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
> +	Assert(io_worker_control->workers[MyIoWorkerId].proc_number == MyProcNumber);
> +	io_worker_control->workers[MyIoWorkerId].proc_number = INVALID_PROC_NUMBER;
> +	Assert(pgaio_worker_set_contains(&io_worker_control->worker_set, MyIoWorkerId));
> +	pgaio_worker_set_remove(&io_worker_control->worker_set, MyIoWorkerId);
> +	notify_set = io_worker_control->worker_set;
> +	Assert(io_worker_control->nworkers > 0);
> +	io_worker_control->nworkers--;
> +	Assert(pgaio_worker_set_count(&io_worker_control->worker_set) ==
> +		   io_worker_control->nworkers);
> +	LWLockRelease(AioWorkerControlLock);
> +
> +	/* Notify other workers on pool change. */

Why are we notifying them on pool changes?


> +	while (!pgaio_worker_set_is_empty(&notify_set))
> +		pgaio_worker_wake(pgaio_worker_set_pop_lowest(&notify_set));

I did already wonder further up if pgaio_worker_wake() should just receive a
worker_set as an argument.


> @@ -312,33 +504,34 @@ pgaio_worker_die(int code, Datum arg)
>  static void
>  pgaio_worker_register(void)
>  {
> +	PgAioWorkerSet free_worker_set;
> +	PgAioWorkerSet old_worker_set;
> +
>  	MyIoWorkerId = -1;
>
> -	/*
> -	 * XXX: This could do with more fine-grained locking. But it's also not
> -	 * very common for the number of workers to change at the moment...
> -	 */
> -	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
> +	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);

I guess it could be useful to assert that nworkers is small enough before
doing anything.


> +	pgaio_worker_set_fill(&free_worker_set);
> +	pgaio_worker_set_subtract(&free_worker_set, &io_worker_control->worker_set);
> +	if (!pgaio_worker_set_is_empty(&free_worker_set))
> +		MyIoWorkerId = pgaio_worker_set_get_lowest(&free_worker_set);
> +	if (MyIoWorkerId == -1)
> +		elog(ERROR, "couldn't find a free worker ID");

I'd probably add a comment saying "/* find lowest unused worker ID */" or
such, that was more immediately obvious in the old code.


> +/*
> + * Check if this backend is allowed to time out, and thus should use a
> + * non-infinite sleep time.  Only the highest-numbered worker is allowed to
> + * time out, and only if the pool is above io_min_workers.  Serializing
> + * timeouts keeps IDs in a range 0..N without gaps, and avoids undershooting
> + * io_min_workers.

But it's ok if a lower numbered worker errors out, right?  There will be a
temporary gap, but we will start a new worker for it?  Does that happen even
if there's a shrink of the set of required workers at the same time as a lower
numbered worker errors out?


> @@ -439,10 +666,9 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
>  	while (!ShutdownRequestPending)
>  	{
>  		uint32		io_index;
> -		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
> -		int			nlatches = 0;
> -		int			nwakeups = 0;
> -		int			worker;
> +		int			worker = -1;
> +		int			queue_depth = 0;
> +		bool		grow = false;
>
>  		/*
>  		 * Try to get a job to do.
> @@ -453,38 +679,64 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
>  		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
>  		if ((io_index = pgaio_worker_submission_queue_consume()) == -1)
>  		{
> -			/*
> -			 * Nothing to do.  Mark self idle.
> -			 *
> -			 * XXX: Invent some kind of back pressure to reduce useless
> -			 * wakeups?
> -			 */
> -			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
> +			/* Nothing to do.  Mark self idle. */
> +			pgaio_worker_set_insert(&io_worker_control->idle_worker_set,
> +									MyIoWorkerId);
>  		}
>  		else
>  		{
>  			/* Got one.  Clear idle flag. */
> -			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
> +			pgaio_worker_set_remove(&io_worker_control->idle_worker_set,
> +									MyIoWorkerId);

Wonder if we should keep track of whether we marked ourselves idle to avoid
needing to do that.  But that would be a separate optimization really.



> +			/*
> +			 * See if we should wake up a higher numbered peer.  Only do that
> +			 * if this worker is not receiving spurious wakeups itself.

The "not receiving spurious wakeups" condition is wakeups < ios?

I think both 'wakeups" and "ios" are a bit too generically named. Based on the
names I have no idea what this heuristic might be.


> +			 * This heuristic tries to discover the useful wakeup propagation
> +			 * chain length when IOs are very fast and workers wake up to find
> +			 * that all IOs have already been taken.
> +			 *
> +			 * If we chose not to wake a worker when we ideally should have,
> +			 * the ratio will soon be corrected.
> +			 */
> +			if (wakeups <= ios)
>  			{
> +				queue_depth = pgaio_worker_submission_queue_depth();
> +				if (queue_depth > 0)
> +				{
> +					worker = pgaio_worker_choose_idle(MyIoWorkerId + 1);

Is it a problem that we are passing an ID that's potentially bigger than the
biggest legal worker ID?  It's probably fine as long as MAX_WORKERS is 32 and
the bitmap is a 64bit integer, but ...


> +					/*
> +					 * If there were no idle higher numbered peers and there
> +					 * are more than enough IOs queued for me and all lower
> +					 * numbered peers, then try to start a new worker.
> +					 */
> +					if (worker == -1 && queue_depth > MyIoWorkerId)
> +						grow = true;
> +				}

We probably shouldn't request growth when already at the cap? That could
generate a *lot* of pmsignal traffic, I think?



I don't have an immediate intuitive understanding of why the submission queue
depth is a good measure here.

If there are 10 workers that are busy 100% of the time, and the submission
queue is usually 6 deep with not-being-worked-on IOs, why do we not want to
start more workers?

It actually seems to work - but I don't actually understand why.


ninja install-test-files
io_max_workers=32
debug_io_direct=data
effective_io_concurrency=16
shared_buffers=5GB

pgbench -i -q -s 100 --fillfactor=30

CREATE EXTENSION IF NOT EXISTS test_aio;
CREATE EXTENSION IF NOT EXISTS pg_buffercache;
DROP TABLE IF EXISTS pattern_random_pgbench;
CREATE TABLE pattern_random_pgbench AS SELECT ARRAY(SELECT random(0, pg_relation_size('pgbench_accounts')/8192 - 1)::int4 FROM generate_series(1, pg_relation_size('pgbench_accounts')/8192)) AS pattern;

My test is:

SET effective_io_concurrency = 20;
SELECT pg_buffercache_evict_relation('pgbench_accounts');
SELECT read_stream_for_blocks('pgbench_accounts', pattern) FROM pattern_random_pgbench LIMIT 1;


We end up with ~24-28 workers, even though we never have more than 20 IOs in
flight. Not entirely sure why. I guess it's just that after doing an IO the
worker needs to mark itself idle etc?



>  		if (io_index != -1)
>  		{
>  			PgAioHandle *ioh = NULL;
>
> +			/* Cancel timeout and update wakeup:work ratio. */
> +			idle_timeout_abs = 0;
> +			if (++ios == PGAIO_WORKER_STATS_MAX)
> +			{
> +				wakeups /= 2;
> +				ios /= 2;
> +			}


/* Saturation for counters used to estimate wakeup:work ratio. */
#define PGAIO_WORKER_STATS_MAX 4

STATS_MAX sounds like it's just about some reporting or such.


>  			ioh = &pgaio_ctl->io_handles[io_index];
>  			error_ioh = ioh;
>  			errcallback.arg = ioh;
> @@ -537,6 +789,14 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
>  			}
>  #endif
>
> +#ifdef PGAIO_WORKER_SHOW_PS_INFO
> +			sprintf(cmd, "%d: [%s] %s",
> +					MyIoWorkerId,
> +					pgaio_io_get_op_name(ioh),
> +					pgaio_io_get_target_description(ioh));
> +			set_ps_display(cmd);
> +#endif

Note that this leaks memory. See the target_description comment:

/*
 * Return a stringified description of the IO's target.
 *
 * The string is localized and allocated in the current memory context.
 */


>  			/*
>  			 * We don't expect this to ever fail with ERROR or FATAL, no need
>  			 * to keep error_ioh set to the IO.
> @@ -550,8 +810,75 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
>  		}
>  		else
>  		{
> -			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
> -					  WAIT_EVENT_IO_WORKER_MAIN);
> +			int			timeout_ms;
> +
> +			/* Cancel new worker request if pending. */
> +			pgaio_worker_grow(false);

That seems to happen very frequently.


> +			/* Compute the remaining allowed idle time. */
> +			if (io_worker_idle_timeout == -1)
> +			{
> +				/* Never time out. */
> +				timeout_ms = -1;
> +			}
> +			else
> +			{
> +				TimestampTz now = GetCurrentTimestamp();
> +
> +				/* If the GUC changes, reset timer. */
> +				if (idle_timeout_abs != 0 &&
> +					io_worker_idle_timeout != timeout_guc_used)
> +					idle_timeout_abs = 0;
> +
> +				/* On first sleep, compute absolute timeout. */
> +				if (idle_timeout_abs == 0)
> +				{
> +					idle_timeout_abs =
> +						TimestampTzPlusMilliseconds(now,
> +													io_worker_idle_timeout);
> +					timeout_guc_used = io_worker_idle_timeout;
> +				}
> +
> +				/*
> +				 * All workers maintain the absolute timeout value, but only
> +				 * the highest worker can actually time out and only if
> +				 * io_min_workers is satisfied.  All others wait only for
> +				 * explicit wakeups caused by queue insertion, wakeup
> +				 * propagation, change of pool size (possibly promoting one to
> +				 * new highest) or GUC reload.
> +				 */
> +				if (pgaio_worker_can_timeout())
> +					timeout_ms =
> +						TimestampDifferenceMilliseconds(now,
> +														idle_timeout_abs);
> +				else
> +					timeout_ms = -1;


Hm. This way you get very rapid worker pool reductions.  Configured
io_worker_idle_timeout=1s, started a bunch of work of and observed the worker
count after the work finishes:

Mon 06 Apr 2026 02:08:28 PM EDT (every 1s)

count
32
(1 row)
Mon 06 Apr 2026 02:08:29 PM EDT (every 1s)

count
32
(1 row)
Mon 06 Apr 2026 02:08:30 PM EDT (every 1s)

count
1
(1 row)
Mon 06 Apr 2026 02:08:31 PM EDT (every 1s)

count
1
(1 row)


Of course this is a ridiculuously low setting, but it does seems like starting
the timeout even when not the highest numbered worker will lead to a lot of
quick yoyoing.




Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-07-12 05:08       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-07-30 10:14         ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-08-04 05:30           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-03-28 09:31             ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 15:02               ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 18:14                 ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
@ 2026-04-07 10:39                   ` Thomas Munro <[email protected]>
  2026-04-07 19:01                     ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Thomas Munro @ 2026-04-07 10:39 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Dmitry Dolgov <[email protected]>; PostgreSQL Hackers <[email protected]>

On Tue, Apr 7, 2026 at 6:14 AM Andres Freund <[email protected]> wrote:
> On 2026-04-07 03:02:52 +1200, Thomas Munro wrote:
> > Here's an updated patch.  It's mostly just rebased over the recent
> > firehose, but with lots of comments and a few names (hopefully)
> > improved.  There is one code change to highlight though:
> >
> > maybe_start_io_workers() knows when it's not allowed to create new
> > workers, an interesting case being FatalError before we have started
> > the new world.
>
> *worker, I assume?

Thanks for the review and testing!

I meant the new world when "we're already starting up again", as in
this pre-existing code from master:

    /*
     * Don't start new workers if we're in the shutdown phase of a crash
     * restart. But we *do* need to start if we're already starting up again.
     */
    if (FatalError && pmState >= PM_STOP_BACKENDS)
        return;

> > The previous coding of DetermineSleepTime() didn't
> > know about that, so it could return 0 (don't sleep), and then the
> > postmaster could busy-wait for restart progress.
>
> In master or the prior version of your patch?

master

This code that checks AbortStartTime and overrides the sleep time.
But it wouldn't be entered if FatalError is true but StartWorkerNeeded
or HaveCrashedWorker also happens to be true.  Maybe that's OK but I
found it odd.

https://github.com/postgres/postgres/blob/a006bc7b1699d952afcb6d786343e8bf0ecc61d6/src/backend/postm...

> > Maybe there were
> > other cases like that, but in general DetermineSleepTime() and
> > maybe_start_io_workers() really need to be 100% in agreement.  So I
> > have moved that knowledge into a new function
> > maybe_start_io_workers_scheduled_at().  Both DetermineSleepTime() and
> > maybe_start_io_workers() call that so there is a single source of
> > truth.
> >
> > I think I got confused about that because it's not that obvious why
> > the existing code doesn't test FatalError.
> >
> > I thought of a slightly bigger refactoring that might deconfuse
> > DetermineSleepTime() a bit more.  Probably material for the next
> > cycle, but basically the idea is to stop using a bunch of different
> > conditions and different units of time and convert the whole thing to
> > a simple find-the-lowest-time function.  I kept that separate.
> >
> > I'll post a new version of the patch that was v3-0002 separately.
>
>
> > From 6c5d16a15add62c68bb7f9c7b6a1e3bde1f406d8 Mon Sep 17 00:00:00 2001
> > From: Thomas Munro <[email protected]>
> > Date: Sat, 22 Mar 2025 00:36:49 +1300
> > Subject: [PATCH v4 1/2] aio: Adjust I/O worker pool size automatically.
> >
> > The size of the I/O worker pool used to implement io_method=worker was
> > previously controlled by the io_workers setting, defaulting to 3.  It
> > was hard to know how to tune it effectively.  It is now replaced with:
> >
> >   io_min_workers=1
> >   io_max_workers=8 (up to 32)
> >   io_worker_idle_timeout=60s
> >   io_worker_launch_interval=100ms
>
> I'm a bit concerned about defaulting to io_min_workers=1. That means in an
> intermittent workload, there will be no IO concurrency for short running but
> IO intensive queries, while having the dispatch overhead to the worker.  It
> can still be a win if the query is CPU intensive, but far from all are.
>
> I'd therefore argue that the minimum ought to be at least 2.

WFM.

> > diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
> > index 6f13e8f40a0..c42564500c6 100644
> > --- a/src/backend/postmaster/postmaster.c
> > +++ b/src/backend/postmaster/postmaster.c
>
>
> > @@ -1555,14 +1558,13 @@ checkControlFile(void)
> >  static int
> >  DetermineSleepTime(void)
> >  {
> > -     TimestampTz next_wakeup = 0;
> > +     TimestampTz next_wakeup;
> >
> >       /*
> > -      * Normal case: either there are no background workers at all, or we're in
> > -      * a shutdown sequence (during which we ignore bgworkers altogether).
> > +      * If an ImmediateShutdown or a crash restart has set a SIGKILL timeout,
> > +      * ignore everything else and wait for that.
> >        */
> > -     if (Shutdown > NoShutdown ||
> > -             (!StartWorkerNeeded && !HaveCrashedWorker))
> > +     if (Shutdown >= ImmediateShutdown || FatalError)
> >       {
> >               if (AbortStartTime != 0)
> >               {
> > @@ -1582,14 +1584,16 @@ DetermineSleepTime(void)
> >
> >                       return seconds * 1000;
> >               }
> > -             else
> > -                     return 60 * 1000;
> >       }
> >
> > -     if (StartWorkerNeeded)
> > +     /* Time of next maybe_start_io_workers() call, or 0 for none. */
> > +     next_wakeup = maybe_start_io_workers_scheduled_at();
> > +
> > +     /* Ignore bgworkers during shutdown. */
> > +     if (StartWorkerNeeded && Shutdown == NoShutdown)
> >               return 0;
>
> Why is the maybe_start_io_workers_scheduled_at() thing before the return 0
> here?

Seems OK?  I mean sure I would to make this whole function more
uniform in structure, see my second patch, but...

> > -     if (HaveCrashedWorker)
> > +     if (HaveCrashedWorker && Shutdown == NoShutdown)
> >       {
> >               dlist_mutable_iter iter;
> >
>
>
> > @@ -3797,6 +3811,15 @@ process_pm_pmsignal(void)
> >               StartWorkerNeeded = true;
> >       }
> >
> > +     /* Process IO worker start requests. */
> > +     if (CheckPostmasterSignal(PMSIGNAL_IO_WORKER_GROW))
> > +     {
> > +             /*
> > +              * No local flag, as the state is exposed through pgaio_worker_*()
> > +              * functions.  This signal is received on potentially actionable level
> > +              * changes, so that maybe_start_io_workers() will run.
> > +              */
> > +     }
> >       /* Process background worker state changes. */
> >       if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
> >       {
>
> Absolute nitpick - the different blocks so far have been separated by an empty
> line.

Fixed.

> > +     /* Only proceed if a "grow" request is pending from existing workers. */
> > +     if (!pgaio_worker_test_grow())
> > +             return 0;
>
> So this accesses shared memory from postmaster.  I think this amount of access
> is safe enough that that's ok. You'd have to somehow have corrupted
> postmaster's copy of io_worker_control, or unmapped the shared memory it is
> pointed to, for that to cause a crash.  The first shouldn't be an issue, the
> latter would be quite the confusion fo the state machine.

Cool.

> > +/*
> > + * Start I/O workers if required.  Used at startup, to respond to change of
> > + * the io_min_workers GUC, when asked to start a new one due to submission
> > + * queue backlog, and after workers terminate in response to errors (by
> > + * starting "replacement" workers).
> > + */
> > +static void
> > +maybe_start_io_workers(void)
> > +{
> > +     TimestampTz scheduled_at;
> >
> > -     /* Not enough running? */
> > -     while (io_worker_count < io_workers)
> > +     while ((scheduled_at = maybe_start_io_workers_scheduled_at()) != 0)
> >       {
> > +             TimestampTz now = GetCurrentTimestamp();
> >               PMChild    *child;
> >               int                     i;
> >
> > +             Assert(pmState < PM_WAIT_IO_WORKERS);
> > +
> > +             /* Still waiting for the scheduled time? */
> > +             if (scheduled_at > now)
> > +                     break;
> > +
> > +             /* Clear the grow request flag if it is set. */
> > +             pgaio_worker_clear_grow();
> > +
> > +             /*
> > +              * Compute next launch time relative to the previous value, so that
> > +              * time spent on the postmaster's other duties don't result in an
> > +              * inaccurate launch interval.
> > +              */
> > +             io_worker_launch_next_time =
> > +                     TimestampTzPlusMilliseconds(io_worker_launch_next_time,
> > +                                                                             io_worker_launch_interval);
> > +
> > +             /*
> > +              * If that's already in the past, the interval is either impossibly
> > +              * short or we received no requests for new workers for a period.
> > +              * Compute a new future time relative to the last launch time instead.
> > +              */
> > +             if (io_worker_launch_next_time <= now)
> > +                     io_worker_launch_next_time =
> > +                             TimestampTzPlusMilliseconds(io_worker_launch_last_time,
> > +                                                                                     io_worker_launch_interval);
>
> Did you intend to use TimestampTzPlusMilliseconds(now, ...) here?  Or did you
> want to have this if after the next line:
>
> > +             io_worker_launch_last_time = now;
> > +
>
> Because otherwise I don't understand how this is intended to work.

I can't remember why I did it like that.  Changed.

> >               /* find unused entry in io_worker_children array */
> >               for (i = 0; i < MAX_IO_WORKERS; ++i)
> >               {
> > @@ -4454,20 +4539,14 @@ maybe_adjust_io_workers(void)
> >                       ++io_worker_count;
> >               }
> >               else
> > -                     break;                          /* try again next time */
> > -     }
> > -
> > -     /* Too many running? */
> > -     if (io_worker_count > io_workers)
> > -     {
> > -             /* ask the IO worker in the highest slot to exit */
> > -             for (int i = MAX_IO_WORKERS - 1; i >= 0; --i)
> >               {
> > -                     if (io_worker_children[i] != NULL)
> > -                     {
> > -                             kill(io_worker_children[i]->pid, SIGUSR2);
> > -                             break;
> > -                     }
> > +                     /*
> > +                      * Fork failure: we'll try again after the launch interval
> > +                      * expires, or be called again without delay if we don't yet have
> > +                      * io_min_workers.  Don't loop here though, the postmaster has
> > +                      * other duties.
> > +                      */
> > +                     break;
> >               }
> >       }
> >  }
>
> Reading just this part of the diff I am wondering what is reponsible for
> reducing the number of workers below the max after a config change.  I assume
> it's done in the workers, but it might be worth putting a comment here noting
> that.

Done.

> > +/* Debugging support: show current IO and wakeups:ios statistics in ps. */
> > +/* #define PGAIO_WORKER_SHOW_PS_INFO */
> >
> >  typedef struct PgAioWorkerSubmissionQueue
> >  {
> > @@ -63,13 +67,34 @@ typedef struct PgAioWorkerSubmissionQueue
> >
> >  typedef struct PgAioWorkerSlot
> >  {
> > -     Latch      *latch;
> > -     bool            in_use;
> > +     ProcNumber      proc_number;
> >  } PgAioWorkerSlot;
> >
> > +/*
> > + * Sets of worker IDs are held in a simple bitmap, accessed through functions
> > + * that provide a more readable abstraction.  If we wanted to support more
> > + * workers than that, the contention on the single queue would surely get too
> > + * high, so we might want to consider multiple pools instead of widening this.
> > + */
> > +typedef uint64 PgAioWorkerSet;
>
> > +#define PGAIO_WORKER_SET_BITS (sizeof(PgAioWorkerSet) * CHAR_BIT)
> > +
> > +static_assert(PGAIO_WORKER_SET_BITS >= MAX_IO_WORKERS, "too small");
> > +
> >  typedef struct PgAioWorkerControl
> >  {
> > -     uint64          idle_worker_mask;
> > +     /* Seen by postmaster */
> > +     volatile bool grow;
>
> What's that volatile intending to do here? It avoids the needs for some
> compiler barriers, but it's not clear to me those would be needed here anyway.
> And it doesn't imply memory ordering, which I'm not sure is entirely wise
> here.  I'd probably just plop a full memory barrier in the few relevant
> places, easier to reason about that way, and it can't matter given the
> infrequency of access.  I'd say we should just use a proper atomic, but right
> now I don't think we do that in postmaster.

Changed to full memory barrier.

> > +     /* Protected by AioWorkerSubmissionQueueLock. */
> > +     PgAioWorkerSet idle_worker_set;
> > +
> > +     /* Protected by AioWorkerControlLock. */
> > +     PgAioWorkerSet worker_set;
> > +     int                     nworkers;
> > +
> > +     /* Protected by AioWorkerControlLock. */
> >       PgAioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
> >  } PgAioWorkerControl;
> >
> > @@ -91,15 +116,103 @@ const IoMethodOps pgaio_worker_ops = {
> >
> >
> > +static bool
> > +pgaio_worker_set_is_empty(PgAioWorkerSet *set)
> > +{
> > +     return *set == 0;
> > +}
> > +
> > +static PgAioWorkerSet
> > +pgaio_worker_set_singleton(int worker)
> > +{
> > +     return UINT64_C(1) << worker;
> > +}
>
> I guess an assert about `worker` being small enough wouldn't hurt.

Done.

> > +static void
> > +pgaio_worker_set_fill(PgAioWorkerSet *set)
> > +{
> > +     *set = UINT64_MAX >> (PGAIO_WORKER_SET_BITS - MAX_IO_WORKERS);
> > +}
>
> What does "_fill" really mean?  Just that all valid bits are set? Why wouldn't
> it be _all() or _full()?

I guess I got that from sigset_t...  Trying pgaio_workerset_all().

> > +static int
> > +pgaio_worker_set_get_highest(PgAioWorkerSet *set)
> > +{
> > +     Assert(!pgaio_worker_set_is_empty(set));
> > +     return pg_leftmost_one_pos64(*set);
> > +}
>
> "worker_set_get*" reads quite awkwardly.  Maybe just going for
> pgaio_workerset_* would help?
>
> Or maybe just name it PgAioWset/pgaio_wset_ or such?

OK let's try "workerset".

> > +static void
> > +pgaio_worker_grow(bool grow)
> > +{
> > +     /*
> > +      * This is called from sites that don't hold AioWorkerControlLock, but
> > +      * these values change infrequently and an up-to-date value is not
> > +      * required for this heuristic purpose.
> > +      */
>
> Is it actually useful to do this while not holding the control lock?  Ah, I
> see, this is due to the split of submission and control lock.

Yeah actually that comment is just confusing.  Removed.  It's pretty
clear that this flag has the usual sort of postmaster request flag
semantics and tolerates a bit of fuzziness.

> > +     if (!grow)
> > +     {
> > +             /* Avoid dirtying memory if not already set. */
> > +             if (io_worker_control->grow)
> > +                     io_worker_control->grow = false;
>
> Hm. pgaio_worker_grow(grow=false) is a bit odd.  And this is basically a copy
> of pgaio_worker_cancel_grow() - I realize that's intended for postmaster, but
> somehow it's a bit odd.

Hmm, right.

> Maybe just name it pgaio_worker_set_grow()?

OK how about:

pgaio_worker_request_grow()
pgaio_worker_cancel_grow()


> > +/*
> > + * Called by the postmaster to check if a new worker is needed.
> > + */
> > +bool
> > +pgaio_worker_test_grow(void)
> > +{
> > +     return io_worker_control && io_worker_control->grow;
> > +}
> > +
> > +/*
> > + * Called by the postmaster to clear the grow flag.
> > + */
> > +void
> > +pgaio_worker_clear_grow(void)
> > +{
> > +     if (io_worker_control)
> > +             io_worker_control->grow = false;
> > +}
>
> Maybe we should add _pm_ in there to make it clearer that they're not for
> general use?

Done.

> > @@ -226,8 +413,7 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
> >  {
> >       PgAioHandle **synchronous_ios = NULL;
> >       int                     nsync = 0;
> > -     Latch      *wakeup = NULL;
> > -     int                     worker;
> > +     int                     worker = -1;
> >
> >       Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
> >
> > @@ -252,19 +438,15 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
> >                               break;
> >                       }
> >
> > -                     if (wakeup == NULL)
> > -                     {
> > -                             /* Choose an idle worker to wake up if we haven't already. */
> > -                             worker = pgaio_worker_choose_idle();
> > -                             if (worker >= 0)
> > -                                     wakeup = io_worker_control->workers[worker].latch;
> > -
> > -                             pgaio_debug_io(DEBUG4, staged_ios[i],
> > -                                                        "choosing worker %d",
> > -                                                        worker);
> > -                     }
> > +                     /* Choose one worker to wake for this batch. */
> > +                     if (worker == -1)
> > +                             worker = pgaio_worker_choose_idle(0);
> >               }
>
> If we only want to do this once per "batch", why not just do it outside the
> num_staged_ios loop?

Two steps: pgaio_worker_choose_idle() must be done while holding the
queue lock (will probably finish up revising this in future work on
removing locks...).  pgaio_worker_wake() is called outside the loop,
after releasing the lock.

> > @@ -295,14 +474,27 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
> >  static void
> >  pgaio_worker_die(int code, Datum arg)
> >  {
> > -     LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
> > -     Assert(io_worker_control->workers[MyIoWorkerId].in_use);
> > -     Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
> > +     PgAioWorkerSet notify_set;
> >
> > -     io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
> > -     io_worker_control->workers[MyIoWorkerId].in_use = false;
> > -     io_worker_control->workers[MyIoWorkerId].latch = NULL;
> > +     LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
> > +     pgaio_worker_set_remove(&io_worker_control->idle_worker_set, MyIoWorkerId);
> >       LWLockRelease(AioWorkerSubmissionQueueLock);
> > +
> > +     LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
> > +     Assert(io_worker_control->workers[MyIoWorkerId].proc_number == MyProcNumber);
> > +     io_worker_control->workers[MyIoWorkerId].proc_number = INVALID_PROC_NUMBER;
> > +     Assert(pgaio_worker_set_contains(&io_worker_control->worker_set, MyIoWorkerId));
> > +     pgaio_worker_set_remove(&io_worker_control->worker_set, MyIoWorkerId);
> > +     notify_set = io_worker_control->worker_set;
> > +     Assert(io_worker_control->nworkers > 0);
> > +     io_worker_control->nworkers--;
> > +     Assert(pgaio_worker_set_count(&io_worker_control->worker_set) ==
> > +                io_worker_control->nworkers);
> > +     LWLockRelease(AioWorkerControlLock);
> > +
> > +     /* Notify other workers on pool change. */
>
> Why are we notifying them on pool changes?

Comments added to explain.  It closes a wakeup-loss race (imagine if
you consumed a wakeup while you were exiting due to timeout; noone
else would wake up, which I fixed with this big hammer).

> > +     while (!pgaio_worker_set_is_empty(&notify_set))
> > +             pgaio_worker_wake(pgaio_worker_set_pop_lowest(&notify_set));
>
> I did already wonder further up if pgaio_worker_wake() should just receive a
> worker_set as an argument.

I have added pgaio_workerset_wake().

> > @@ -312,33 +504,34 @@ pgaio_worker_die(int code, Datum arg)
> >  static void
> >  pgaio_worker_register(void)
> >  {
> > +     PgAioWorkerSet free_worker_set;
> > +     PgAioWorkerSet old_worker_set;
> > +
> >       MyIoWorkerId = -1;
> >
> > -     /*
> > -      * XXX: This could do with more fine-grained locking. But it's also not
> > -      * very common for the number of workers to change at the moment...
> > -      */
> > -     LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
> > +     LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
>
> I guess it could be useful to assert that nworkers is small enough before
> doing anything.

OK.

> > +     pgaio_worker_set_fill(&free_worker_set);
> > +     pgaio_worker_set_subtract(&free_worker_set, &io_worker_control->worker_set);
> > +     if (!pgaio_worker_set_is_empty(&free_worker_set))
> > +             MyIoWorkerId = pgaio_worker_set_get_lowest(&free_worker_set);
> > +     if (MyIoWorkerId == -1)
> > +             elog(ERROR, "couldn't find a free worker ID");
>
> I'd probably add a comment saying "/* find lowest unused worker ID */" or
> such, that was more immediately obvious in the old code.

Done.

> > +/*
> > + * Check if this backend is allowed to time out, and thus should use a
> > + * non-infinite sleep time.  Only the highest-numbered worker is allowed to
> > + * time out, and only if the pool is above io_min_workers.  Serializing
> > + * timeouts keeps IDs in a range 0..N without gaps, and avoids undershooting
> > + * io_min_workers.
>
> But it's ok if a lower numbered worker errors out, right?  There will be a
> temporary gap, but we will start a new worker for it?

Yes it is OK for there to be gaps.

If any worker errors out, it will be replaced when reaped if we fell
below io_min_workers, and otherwise replaced via the usual means, ie
once the backlog detection and the launch delay allow it.  I did have
a version that always replaced *every* worker with exit code 1
immediately, but I started wondering if we really want persistent
errors to turn into high speed fork() loops.  I'm still not sure TBH.
We don't expect workers to error out, so it means something is already
pretty screwed up and you might appreciate the rate limiting?

I have an always-replace patch somewhere, as I've vacillated on that
point a couple of times.  I will post a separate fixup for
consideration.

> Does that happen even
> if there's a shrink of the set of required workers at the same time as a lower
> numbered worker errors out?

If a workers errors out (exit code 1) and an idle worker timed out
(exit code 0), then it's no different: if the new count dropped below
io_min_workers, we start a worker immediate after reaping the process.
Othewise we let the normal algorithm decide to start a new worker
if/when required.

> > @@ -439,10 +666,9 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
> >       while (!ShutdownRequestPending)
> >       {
> >               uint32          io_index;
> > -             Latch      *latches[IO_WORKER_WAKEUP_FANOUT];
> > -             int                     nlatches = 0;
> > -             int                     nwakeups = 0;
> > -             int                     worker;
> > +             int                     worker = -1;
> > +             int                     queue_depth = 0;
> > +             bool            grow = false;
> >
> >               /*
> >                * Try to get a job to do.
> > @@ -453,38 +679,64 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
> >               LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
> >               if ((io_index = pgaio_worker_submission_queue_consume()) == -1)
> >               {
> > -                     /*
> > -                      * Nothing to do.  Mark self idle.
> > -                      *
> > -                      * XXX: Invent some kind of back pressure to reduce useless
> > -                      * wakeups?
> > -                      */
> > -                     io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
> > +                     /* Nothing to do.  Mark self idle. */
> > +                     pgaio_worker_set_insert(&io_worker_control->idle_worker_set,
> > +                                                                     MyIoWorkerId);
> >               }
> >               else
> >               {
> >                       /* Got one.  Clear idle flag. */
> > -                     io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
> > +                     pgaio_worker_set_remove(&io_worker_control->idle_worker_set,
> > +                                                                     MyIoWorkerId);
>
> Wonder if we should keep track of whether we marked ourselves idle to avoid
> needing to do that.  But that would be a separate optimization really.

Fair point.  OK.

> > +                     /*
> > +                      * See if we should wake up a higher numbered peer.  Only do that
> > +                      * if this worker is not receiving spurious wakeups itself.
>
> The "not receiving spurious wakeups" condition is wakeups < ios?

Yes, see new comment near PGAIO_WORKER_WAKEUP_RATIO_SATURATE.

> I think both 'wakeups" and "ios" are a bit too generically named. Based on the
> names I have no idea what this heuristic might be.

I have struggled to name them.  Does wakeup_count and io_count help?

> > +                      * This heuristic tries to discover the useful wakeup propagation
> > +                      * chain length when IOs are very fast and workers wake up to find
> > +                      * that all IOs have already been taken.
> > +                      *
> > +                      * If we chose not to wake a worker when we ideally should have,
> > +                      * the ratio will soon be corrected.
> > +                      */
> > +                     if (wakeups <= ios)
> >                       {
> > +                             queue_depth = pgaio_worker_submission_queue_depth();
> > +                             if (queue_depth > 0)
> > +                             {
> > +                                     worker = pgaio_worker_choose_idle(MyIoWorkerId + 1);
>
> Is it a problem that we are passing an ID that's potentially bigger than the
> biggest legal worker ID?  It's probably fine as long as MAX_WORKERS is 32 and
> the bitmap is a 64bit integer, but ...

Oof.  Fixed.

> > +                                     /*
> > +                                      * If there were no idle higher numbered peers and there
> > +                                      * are more than enough IOs queued for me and all lower
> > +                                      * numbered peers, then try to start a new worker.
> > +                                      */
> > +                                     if (worker == -1 && queue_depth > MyIoWorkerId)
> > +                                             grow = true;
> > +                             }
>
> We probably shouldn't request growth when already at the cap? That could
> generate a *lot* of pmsignal traffic, I think?

No, we only set it if it isn't already set (like a latch), and only
send a pmsignal when we set it (like a latch), and the postmaster only
clears it if it can start a worker (unlike a latch).  That applies in
general, not just when we hit the cap of io_max_workers: while the
postmaster is waiting for launch interval to expire, it will leave the
flag set, suppressed for 100ms or whatever, and the in the special
case of io_max_workers, for as long as the count remains that high.

> I don't have an immediate intuitive understanding of why the submission queue
> depth is a good measure here.
>
> If there are 10 workers that are busy 100% of the time, and the submission
> queue is usually 6 deep with not-being-worked-on IOs, why do we not want to
> start more workers?
>
> It actually seems to work - but I don't actually understand why.

I should have made it clearer that that's a secondary condition.  The
primary condition is: a worker wanted to wake another worker, but
found that none were idle.  Unfortunately the whole system is a bit
too asynchronous for that to be a reliable cue on its own.  So, I also
check if the queue appears to be (1) obviously growing: that's clearly
too long and must be introducing latency, or (2) varying "too much".
Which I detect in exactly the same way.

Imagine a histogram that look like this:

LOG:  depth 00: 7898
LOG:  depth 01: 1630
LOG:  depth 02: 308
LOG:  depth 03: 93
LOG:  depth 04: 40
LOG:  depth 05: 19
LOG:  depth 06: 6
LOG:  depth 07: 4
LOG:  depth 08: 0
LOG:  depth 09: 1
LOG:  depth 10: 1
LOG:  depth 11: 0
LOG:  depth 12: 0
LOG:  depth 13: 0

If you're failing to find idle workers to wake up AND our managic
threshold is hit by something in that long tail, then it'll call for
backup.  Of course I'm totally sidestepping a lot of queueing theory
maths and just saying "I'd better be able to find an idle worker when
I want to" and if not, "there had better not be any outliers that
reach this far".

I've written a longer explanation in a long comment.  Including a
little challenge for someone to do better with real science and maths.
I hope it's a bit clearer at least.

> ninja install-test-files
> io_max_workers=32
> debug_io_direct=data
> effective_io_concurrency=16
> shared_buffers=5GB
>
> pgbench -i -q -s 100 --fillfactor=30
>
> CREATE EXTENSION IF NOT EXISTS test_aio;
> CREATE EXTENSION IF NOT EXISTS pg_buffercache;
> DROP TABLE IF EXISTS pattern_random_pgbench;
> CREATE TABLE pattern_random_pgbench AS SELECT ARRAY(SELECT random(0, pg_relation_size('pgbench_accounts')/8192 - 1)::int4 FROM generate_series(1, pg_relation_size('pgbench_accounts')/8192)) AS pattern;
>
> My test is:
>
> SET effective_io_concurrency = 20;
> SELECT pg_buffercache_evict_relation('pgbench_accounts');
> SELECT read_stream_for_blocks('pgbench_accounts', pattern) FROM pattern_random_pgbench LIMIT 1;
>
>
> We end up with ~24-28 workers, even though we never have more than 20 IOs in
> flight. Not entirely sure why. I guess it's just that after doing an IO the
> worker needs to mark itself idle etc?

Yep.  It would be nice to make it a bit more accurate in later cycles.
It tends to overprovision rather than under, since it thinks all other
workers are busy.  That information is a bit racy.  In this version
I've made a small improvement: it uses nworkers directly, under the
big new comment, instead of an unnecessarily complicated
approximation.

> >               if (io_index != -1)
> >               {
> >                       PgAioHandle *ioh = NULL;
> >
> > +                     /* Cancel timeout and update wakeup:work ratio. */
> > +                     idle_timeout_abs = 0;
> > +                     if (++ios == PGAIO_WORKER_STATS_MAX)
> > +                     {
> > +                             wakeups /= 2;
> > +                             ios /= 2;
> > +                     }
>
>
> /* Saturation for counters used to estimate wakeup:work ratio. */
> #define PGAIO_WORKER_STATS_MAX 4
>
> STATS_MAX sounds like it's just about some reporting or such.

I have renamed it to PGAIO_WORKER_RATIO_MAX and written a big comment
at the top to explain what it's for.io

> >                       ioh = &pgaio_ctl->io_handles[io_index];
> >                       error_ioh = ioh;
> >                       errcallback.arg = ioh;
> > @@ -537,6 +789,14 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
> >                       }
> >  #endif
> >
> > +#ifdef PGAIO_WORKER_SHOW_PS_INFO
> > +                     sprintf(cmd, "%d: [%s] %s",
> > +                                     MyIoWorkerId,
> > +                                     pgaio_io_get_op_name(ioh),
> > +                                     pgaio_io_get_target_description(ioh));
> > +                     set_ps_display(cmd);
> > +#endif
>
> Note that this leaks memory. See the target_description comment:
>
> /*
>  * Return a stringified description of the IO's target.
>  *
>  * The string is localized and allocated in the current memory context.
>  */

Fixed.

> >                       /*
> >                        * We don't expect this to ever fail with ERROR or FATAL, no need
> >                        * to keep error_ioh set to the IO.
> > @@ -550,8 +810,75 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
> >               }
> >               else
> >               {
> > -                     WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
> > -                                       WAIT_EVENT_IO_WORKER_MAIN);
> > +                     int                     timeout_ms;
> > +
> > +                     /* Cancel new worker request if pending. */
> > +                     pgaio_worker_grow(false);
>
> That seems to happen very frequently.

Yeah, but it doesn't write to memory after someone else does it.  This
again is part of the strategy for preventing excess workers from being
created.  If I've found the queue to be empty.

> > +                             /*
> > +                              * All workers maintain the absolute timeout value, but only
> > +                              * the highest worker can actually time out and only if
> > +                              * io_min_workers is satisfied.  All others wait only for
> > +                              * explicit wakeups caused by queue insertion, wakeup
> > +                              * propagation, change of pool size (possibly promoting one to
> > +                              * new highest) or GUC reload.
> > +                              */
> > +                             if (pgaio_worker_can_timeout())
> > +                                     timeout_ms =
> > +                                             TimestampDifferenceMilliseconds(now,
> > +                                                                                                             idle_timeout_abs);
> > +                             else
> > +                                     timeout_ms = -1;
>
>
> Hm. This way you get very rapid worker pool reductions.  Configured
> io_worker_idle_timeout=1s, started a bunch of work of and observed the worker
> count after the work finishes:
>
> Mon 06 Apr 2026 02:08:28 PM EDT (every 1s)
>
> count
> 32
> (1 row)
> Mon 06 Apr 2026 02:08:29 PM EDT (every 1s)
>
> count
> 32
> (1 row)
> Mon 06 Apr 2026 02:08:30 PM EDT (every 1s)
>
> count
> 1
> (1 row)
> Mon 06 Apr 2026 02:08:31 PM EDT (every 1s)
>
> count
> 1
> (1 row)
>
>
> Of course this is a ridiculuously low setting, but it does seems like starting
> the timeout even when not the highest numbered worker will lead to a lot of
> quick yoyoing.

I have changed it so that after one worker times out, the next one
begins its timeout count from 0.  (This is one of the reasons for that
"notify the whole pool when I exit" thing.)


Attachments:

  [text/x-patch] v5-0001-aio-Adjust-I-O-worker-pool-size-automatically.patch (45.3K, 2-v5-0001-aio-Adjust-I-O-worker-pool-size-automatically.patch)
  download | inline diff:
From 537a3df61bf2f2c258f71ea367c27e5550c9092c Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Sat, 22 Mar 2025 00:36:49 +1300
Subject: [PATCH v5] aio: Adjust I/O worker pool size automatically.

The size of the I/O worker pool used to implement io_method=worker was
previously controlled by the io_workers setting, defaulting to 3.  It
was hard to know how to tune it effectively.  It is now replaced with:

  io_min_workers=2
  io_max_workers=8 (up to 32)
  io_worker_idle_timeout=60s
  io_worker_launch_interval=100ms

The pool is automatically sized within the configured range according to
recent variation in demand.  It grows when existing workers detect a
backlog, and shrinks when the highest numbered worker is idle for too
long.  Work was already concentrated into low-numbered workers in
anticipation of this logic.

The logic for waking extra workers now also tries to measure and reduce
the number of spurious wakeups, though they are not entirely eliminated.

Reviewed-by: Dmitry Dolgov <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  69 +-
 src/backend/postmaster/postmaster.c           | 166 +++--
 src/backend/storage/aio/method_worker.c       | 592 +++++++++++++++---
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/misc/guc_parameters.dat     |  34 +-
 src/backend/utils/misc/postgresql.conf.sample |   6 +-
 src/include/storage/io_worker.h               |  10 +-
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pmsignal.h                |   1 +
 src/test/modules/test_aio/t/002_io_workers.pl |  15 +-
 src/tools/pgindent/typedefs.list              |   1 +
 11 files changed, 751 insertions(+), 145 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3324d2d3c49..86899c4be68 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2870,16 +2870,75 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
-      <varlistentry id="guc-io-workers" xreflabel="io_workers">
-       <term><varname>io_workers</varname> (<type>integer</type>)
+      <varlistentry id="guc-io-min-workers" xreflabel="io_min_workers">
+       <term><varname>io_min_workers</varname> (<type>integer</type>)
        <indexterm>
-        <primary><varname>io_workers</varname> configuration parameter</primary>
+        <primary><varname>io_min_workers</varname> configuration parameter</primary>
        </indexterm>
        </term>
        <listitem>
         <para>
-         Selects the number of I/O worker processes to use. The default is
-         3. This parameter can only be set in the
+         Sets the minimum number of I/O worker processes. The default is
+         2. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-max-workers" xreflabel="io_max_workers">
+       <term><varname>io_max_workers</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_max_workers</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the maximum number of I/O worker processes. The default is
+         8. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-worker-idle-timeout" xreflabel="io_worker_idle_timeout">
+       <term><varname>io_worker_idle_timeout</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_worker_idle_timeout</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the time after which entirely idle I/O worker processes exit, reducing the
+         size of pool to match demand.  The default is 1 minute.  This
+         parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-worker-launch-interval" xreflabel="io_worker_launch_interval">
+       <term><varname>io_worker_launch_interval</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_worker_launch_interval</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the minimum time before another I/O worker can be launched.  This avoids
+         creating too many for an unsustained burst of activity.  The default is 100ms.
+         This parameter can only be set in the
          <filename>postgresql.conf</filename> file or on the server command
          line.
         </para>
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6f13e8f40a0..cb2ccd9900c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -409,6 +409,7 @@ static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
 /* State for IO worker management. */
+static TimestampTz io_worker_launch_next_time = 0;
 static int	io_worker_count = 0;
 static PMChild *io_worker_children[MAX_IO_WORKERS];
 
@@ -447,7 +448,8 @@ static int	CountChildren(BackendTypeMask targetMask);
 static void LaunchMissingBackgroundProcesses(void);
 static void maybe_start_bgworkers(void);
 static bool maybe_reap_io_worker(int pid);
-static void maybe_adjust_io_workers(void);
+static void maybe_start_io_workers(void);
+static TimestampTz maybe_start_io_workers_scheduled_at(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static PMChild *StartChildProcess(BackendType type);
 static void StartSysLogger(void);
@@ -1391,7 +1393,7 @@ PostmasterMain(int argc, char *argv[])
 	UpdatePMState(PM_STARTUP);
 
 	/* Make sure we can perform I/O while starting up. */
-	maybe_adjust_io_workers();
+	maybe_start_io_workers();
 
 	/* Start bgwriter and checkpointer so they can help with recovery */
 	if (CheckpointerPMChild == NULL)
@@ -1555,14 +1557,15 @@ checkControlFile(void)
 static int
 DetermineSleepTime(void)
 {
-	TimestampTz next_wakeup = 0;
+	TimestampTz next_wakeup;
 
 	/*
-	 * Normal case: either there are no background workers at all, or we're in
-	 * a shutdown sequence (during which we ignore bgworkers altogether).
+	 * If in ImmediateShutdown with a SIGKILL timeout, ignore everything else
+	 * and wait for that.
+	 *
+	 * XXX Shouldn't this also test FatalError?
 	 */
-	if (Shutdown > NoShutdown ||
-		(!StartWorkerNeeded && !HaveCrashedWorker))
+	if (Shutdown >= ImmediateShutdown)
 	{
 		if (AbortStartTime != 0)
 		{
@@ -1582,14 +1585,16 @@ DetermineSleepTime(void)
 
 			return seconds * 1000;
 		}
-		else
-			return 60 * 1000;
 	}
 
-	if (StartWorkerNeeded)
+	/* Time of next maybe_start_io_workers() call, or 0 for none. */
+	next_wakeup = maybe_start_io_workers_scheduled_at();
+
+	/* Ignore bgworkers during shutdown. */
+	if (StartWorkerNeeded && Shutdown == NoShutdown)
 		return 0;
 
-	if (HaveCrashedWorker)
+	if (HaveCrashedWorker && Shutdown == NoShutdown)
 	{
 		dlist_mutable_iter iter;
 
@@ -2542,7 +2547,17 @@ process_pm_child_exit(void)
 			if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
 				HandleChildCrash(pid, exitstatus, _("io worker"));
 
-			maybe_adjust_io_workers();
+			/*
+			 * A worker that exited with an error might have brought the pool
+			 * size below io_min_workers, or allowed the queue to grow to the
+			 * point where another worker called for growth.
+			 *
+			 * In the common case that a worker timed out due to idleness, no
+			 * replacement needs to be started.  maybe_start_io_workers() will
+			 * figure that out.
+			 */
+			maybe_start_io_workers();
+
 			continue;
 		}
 
@@ -3262,7 +3277,7 @@ PostmasterStateMachine(void)
 		UpdatePMState(PM_STARTUP);
 
 		/* Make sure we can perform I/O while starting up. */
-		maybe_adjust_io_workers();
+		maybe_start_io_workers();
 
 		StartupPMChild = StartChildProcess(B_STARTUP);
 		Assert(StartupPMChild != NULL);
@@ -3336,7 +3351,7 @@ LaunchMissingBackgroundProcesses(void)
 	 * A config file change will always lead to this function being called, so
 	 * we always will process the config change in a timely manner.
 	 */
-	maybe_adjust_io_workers();
+	maybe_start_io_workers();
 
 	/*
 	 * The checkpointer and the background writer are active from the start,
@@ -3797,6 +3812,16 @@ process_pm_pmsignal(void)
 		StartWorkerNeeded = true;
 	}
 
+	/* Process IO worker start requests. */
+	if (CheckPostmasterSignal(PMSIGNAL_IO_WORKER_GROW))
+	{
+		/*
+		 * No local flag, as the state is exposed through pgaio_worker_*()
+		 * functions.  This signal is received on potentially actionable level
+		 * changes, so that maybe_start_io_workers() will run.
+		 */
+	}
+
 	/* Process background worker state changes. */
 	if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
 	{
@@ -4399,44 +4424,104 @@ maybe_reap_io_worker(int pid)
 }
 
 /*
- * Start or stop IO workers, to close the gap between the number of running
- * workers and the number of configured workers.  Used to respond to change of
- * the io_workers GUC (by increasing and decreasing the number of workers), as
- * well as workers terminating in response to errors (by starting
- * "replacement" workers).
+ * Returns the next time at which maybe_start_io_workers() would start one or
+ * more I/O workers.  Any time in the past means ASAP, and 0 means no worker
+ * is currently scheduled.
+ *
+ * This is called by DetermineSleepTime() and also maybe_start_io_workers()
+ * itself, to make sure that they agree.
  */
-static void
-maybe_adjust_io_workers(void)
+static TimestampTz
+maybe_start_io_workers_scheduled_at(void)
 {
 	if (!pgaio_workers_enabled())
-		return;
+		return 0;
 
 	/*
 	 * If we're in final shutting down state, then we're just waiting for all
 	 * processes to exit.
 	 */
 	if (pmState >= PM_WAIT_IO_WORKERS)
-		return;
+		return 0;
 
 	/* Don't start new workers during an immediate shutdown either. */
 	if (Shutdown >= ImmediateShutdown)
-		return;
+		return 0;
 
 	/*
 	 * Don't start new workers if we're in the shutdown phase of a crash
 	 * restart. But we *do* need to start if we're already starting up again.
 	 */
 	if (FatalError && pmState >= PM_STOP_BACKENDS)
-		return;
+		return 0;
+
+	/*
+	 * Don't start a worker if we're at or above the maximum.  (Excess workers
+	 * exit when the GUC is lowered, but the count can be temporarily too high
+	 * until they are reaped.)
+	 */
+	if (io_worker_count >= io_max_workers)
+		return 0;
+
+	/* If we're under the minimum, start a worker as soon as possible. */
+	if (io_worker_count < io_min_workers)
+		return TIMESTAMP_MINUS_INFINITY;	/* start worker ASAP */
+
+	/* Only proceed if a "grow" request is pending from existing workers. */
+	if (!pgaio_worker_pm_test_grow())
+		return 0;
 
-	Assert(pmState < PM_WAIT_IO_WORKERS);
+	/*
+	 * maybe_start_io_workers() should start a new I/O worker after this time,
+	 * or as soon as possible if is already in the past.
+	 */
+	return io_worker_launch_next_time;
+}
+
+/*
+ * Start I/O workers if required.  Used at startup, to respond to change of
+ * the io_min_workers GUC, when asked to start a new one due to submission
+ * queue backlog, and after workers terminate in response to errors (by
+ * starting "replacement" workers).
+ */
+static void
+maybe_start_io_workers(void)
+{
+	TimestampTz scheduled_at;
 
-	/* Not enough running? */
-	while (io_worker_count < io_workers)
+	while ((scheduled_at = maybe_start_io_workers_scheduled_at()) != 0)
 	{
+		TimestampTz now = GetCurrentTimestamp();
 		PMChild    *child;
 		int			i;
 
+		Assert(pmState < PM_WAIT_IO_WORKERS);
+
+		/* Still waiting for the scheduled time? */
+		if (scheduled_at > now)
+			break;
+
+		/* Clear the grow request flag if it is set. */
+		pgaio_worker_pm_clear_grow();
+
+		/*
+		 * Compute next launch time relative to the previous value, so that
+		 * time spent on the postmaster's other duties don't result in an
+		 * inaccurate launch interval.
+		 */
+		io_worker_launch_next_time =
+			TimestampTzPlusMilliseconds(io_worker_launch_next_time,
+										io_worker_launch_interval);
+
+		/*
+		 * If that's already in the past, the interval is either impossibly
+		 * short or we received no requests for new workers for a period.
+		 * Compute a new future time relative to the last launch time instead.
+		 */
+		if (io_worker_launch_next_time <= now)
+			io_worker_launch_next_time =
+				TimestampTzPlusMilliseconds(now, io_worker_launch_interval);
+
 		/* find unused entry in io_worker_children array */
 		for (i = 0; i < MAX_IO_WORKERS; ++i)
 		{
@@ -4454,22 +4539,21 @@ maybe_adjust_io_workers(void)
 			++io_worker_count;
 		}
 		else
-			break;				/* try again next time */
-	}
-
-	/* Too many running? */
-	if (io_worker_count > io_workers)
-	{
-		/* ask the IO worker in the highest slot to exit */
-		for (int i = MAX_IO_WORKERS - 1; i >= 0; --i)
 		{
-			if (io_worker_children[i] != NULL)
-			{
-				kill(io_worker_children[i]->pid, SIGUSR2);
-				break;
-			}
+			/*
+			 * Fork failure: we'll try again after the launch interval
+			 * expires, or be called again without delay if we don't yet have
+			 * io_min_workers.  Don't loop here though, the postmaster has
+			 * other duties.
+			 */
+			break;
 		}
 	}
+
+	/*
+	 * Workers decide when to shut down by themselves, according to the
+	 * io_max_workers and io_worker_idle_timeout GUCs.
+	 */
 }
 
 
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index eb686cede1a..fb7dca253c7 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -11,9 +11,8 @@
  * infrastructure for reopening the file, and must processed synchronously by
  * the client code when submitted.
  *
- * So that the submitter can make just one system call when submitting a batch
- * of IOs, wakeups "fan out"; each woken IO worker can wake two more. XXX This
- * could be improved by using futexes instead of latches to wake N waiters.
+ * The pool tries to stabilize at a size that can handle recently seen
+ * variation in demand, within the configured limits.
  *
  * This method of AIO is available in all builds on all operating systems, and
  * is the default.
@@ -29,6 +28,8 @@
 
 #include "postgres.h"
 
+#include <limits.h>
+
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -40,6 +41,8 @@
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
 #include "tcop/tcopprot.h"
@@ -48,10 +51,22 @@
 #include "utils/ps_status.h"
 #include "utils/wait_event.h"
 
+/*
+ * Saturation for counters used to estimate wakeup:IO ratio.
+ *
+ * We maintain wakeup_count for wakeups received and io_count for IOs
+ * processed by each worker.  When either counter reaches this saturation
+ * value, we divide both by two.  The result is an exponentially decaying
+ * ratio of wakeups to IOs, with a very short memory.
+ *
+ * If a worker is itself experiencing useless wakeups, it assumes that
+ * higher-numbered workers would experience even more, so it should end the
+ * chain.
+ */
+#define PGAIO_WORKER_WAKEUP_RATIO_SATURATE 4
 
-/* How many workers should each worker wake up if needed? */
-#define IO_WORKER_WAKEUP_FANOUT 2
-
+/* Debugging support: show current IO and wakeups:ios statistics in ps. */
+/* #define PGAIO_WORKER_SHOW_PS_INFO */
 
 typedef struct PgAioWorkerSubmissionQueue
 {
@@ -63,13 +78,34 @@ typedef struct PgAioWorkerSubmissionQueue
 
 typedef struct PgAioWorkerSlot
 {
-	Latch	   *latch;
-	bool		in_use;
+	ProcNumber	proc_number;
 } PgAioWorkerSlot;
 
+/*
+ * Sets of worker IDs are held in a simple bitmap, accessed through functions
+ * that provide a more readable abstraction.  If we wanted to support more
+ * workers than that, the contention on the single queue would surely get too
+ * high, so we might want to consider multiple pools instead of widening this.
+ */
+typedef uint64 PgAioWorkerSet;
+
+#define PGAIO_WORKERSET_BITS (sizeof(PgAioWorkerSet) * CHAR_BIT)
+
+static_assert(PGAIO_WORKERSET_BITS >= MAX_IO_WORKERS, "too small");
+
 typedef struct PgAioWorkerControl
 {
-	uint64		idle_worker_mask;
+	/* Seen by postmaster */
+	bool		grow;
+
+	/* Protected by AioWorkerSubmissionQueueLock. */
+	PgAioWorkerSet idle_workerset;
+
+	/* Protected by AioWorkerControlLock. */
+	PgAioWorkerSet workerset;
+	int			nworkers;
+
+	/* Protected by AioWorkerControlLock. */
 	PgAioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
 } PgAioWorkerControl;
 
@@ -91,15 +127,108 @@ const IoMethodOps pgaio_worker_ops = {
 
 
 /* GUCs */
-int			io_workers = 3;
+int			io_min_workers = 2;
+int			io_max_workers = 8;
+int			io_worker_idle_timeout = 60000;
+int			io_worker_launch_interval = 100;
 
 
 static int	io_worker_queue_size = 64;
-static int	MyIoWorkerId;
+static int	MyIoWorkerId = -1;
 static PgAioWorkerSubmissionQueue *io_worker_submission_queue;
 static PgAioWorkerControl *io_worker_control;
 
 
+static void
+pgaio_workerset_initialize(PgAioWorkerSet *set)
+{
+	*set = 0;
+}
+
+static bool
+pgaio_workerset_is_empty(PgAioWorkerSet *set)
+{
+	return *set == 0;
+}
+
+static PgAioWorkerSet
+pgaio_workerset_singleton(int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	return UINT64_C(1) << worker;
+}
+
+static void
+pgaio_workerset_all(PgAioWorkerSet *set)
+{
+	*set = UINT64_MAX >> (PGAIO_WORKERSET_BITS - MAX_IO_WORKERS);
+}
+
+static void
+pgaio_workerset_subtract(PgAioWorkerSet *set1, const PgAioWorkerSet *set2)
+{
+	*set1 &= ~*set2;
+}
+
+static void
+pgaio_workerset_insert(PgAioWorkerSet *set, int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	*set |= pgaio_workerset_singleton(worker);
+}
+
+static void
+pgaio_workerset_remove(PgAioWorkerSet *set, int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	*set &= ~pgaio_workerset_singleton(worker);
+}
+
+static void
+pgaio_workerset_remove_lte(PgAioWorkerSet *set, int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	*set &= (~(PgAioWorkerSet) 0) << (worker + 1);
+}
+
+static int
+pgaio_workerset_get_highest(PgAioWorkerSet *set)
+{
+	Assert(!pgaio_workerset_is_empty(set));
+	return pg_leftmost_one_pos64(*set);
+}
+
+static int
+pgaio_workerset_get_lowest(PgAioWorkerSet *set)
+{
+	Assert(!pgaio_workerset_is_empty(set));
+	return pg_rightmost_one_pos64(*set);
+}
+
+static int
+pgaio_workerset_pop_lowest(PgAioWorkerSet *set)
+{
+	int			worker = pgaio_workerset_get_lowest(set);
+
+	pgaio_workerset_remove(set, worker);
+	return worker;
+}
+
+#ifdef USE_ASSERT_CHECKING
+static bool
+pgaio_workerset_contains(PgAioWorkerSet *set, int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	return (*set & pgaio_workerset_singleton(worker)) != 0;
+}
+
+static int
+pgaio_workerset_count(PgAioWorkerSet *set)
+{
+	return pg_popcount64(*set);
+}
+#endif
+
 static void
 pgaio_worker_shmem_request(void *arg)
 {
@@ -133,37 +262,123 @@ pgaio_worker_shmem_init(void *arg)
 	io_worker_submission_queue->size = queue_size;
 	io_worker_submission_queue->head = 0;
 	io_worker_submission_queue->tail = 0;
+	io_worker_control->grow = false;
+	pgaio_workerset_initialize(&io_worker_control->workerset);
+	pgaio_workerset_initialize(&io_worker_control->idle_workerset);
 
-	io_worker_control->idle_worker_mask = 0;
 	for (int i = 0; i < MAX_IO_WORKERS; ++i)
+		io_worker_control->workers[i].proc_number = INVALID_PROC_NUMBER;
+}
+
+/*
+ * Tell postmaster that we think a new worker is needed.
+ */
+static void
+pgaio_worker_request_grow(void)
+{
+	if (!io_worker_control->grow)
+	{
+		io_worker_control->grow = true;
+		pg_memory_barrier();
+		SendPostmasterSignal(PMSIGNAL_IO_WORKER_GROW);
+	}
+}
+
+/*
+ * Cancel any request for a new worker, after observing an empty queue.
+ */
+static void
+pgaio_worker_cancel_grow(void)
+{
+	if (io_worker_control->grow)
 	{
-		io_worker_control->workers[i].latch = NULL;
-		io_worker_control->workers[i].in_use = false;
+		io_worker_control->grow = false;
+		pg_memory_barrier();
 	}
 }
 
+/*
+ * Called by the postmaster to check if a new worker is requested.
+ */
+bool
+pgaio_worker_pm_test_grow(void)
+{
+	pg_memory_barrier();
+	return io_worker_control && io_worker_control->grow;
+}
+
+/*
+ * Called by the postmaster to clear the request for a new worker.
+ */
+void
+pgaio_worker_pm_clear_grow(void)
+{
+	if (io_worker_control)
+		io_worker_control->grow = false;
+	pg_memory_barrier();
+}
+
 static int
-pgaio_worker_choose_idle(void)
+pgaio_worker_choose_idle(int only_workers_above)
 {
+	PgAioWorkerSet workerset;
 	int			worker;
 
-	if (io_worker_control->idle_worker_mask == 0)
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
+	workerset = io_worker_control->idle_workerset;
+	if (only_workers_above >= 0)
+		pgaio_workerset_remove_lte(&workerset, only_workers_above);
+	if (pgaio_workerset_is_empty(&workerset))
 		return -1;
 
-	/* Find the lowest bit position, and clear it. */
-	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
-	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
-	Assert(io_worker_control->workers[worker].in_use);
+	/* Find the lowest numbered idle worker and mark it not idle. */
+	worker = pgaio_workerset_get_lowest(&workerset);
+	pgaio_workerset_remove(&io_worker_control->idle_workerset, worker);
 
 	return worker;
 }
 
+/*
+ * Try to wake a worker by setting its latch, to tell it there are IOs to
+ * process in the submission queue.
+ */
+static void
+pgaio_worker_wake(int worker)
+{
+	ProcNumber	proc_number;
+
+	/*
+	 * If the selected worker is concurrently exiting, then pgaio_worker_die()
+	 * had not yet removed it as of when we saw it in idle_workerset.  That's
+	 * OK, because it will wake all remaining workers to close wakeup-vs-exit
+	 * races: *someone* will see the queued IO.  If there are no workers
+	 * running, the postmaster will start a new one.
+	 */
+	proc_number = io_worker_control->workers[worker].proc_number;
+	if (proc_number != INVALID_PROC_NUMBER)
+		SetLatch(&GetPGProcByNumber(proc_number)->procLatch);
+}
+
+/*
+ * Try to wake a set of workers.  Used on pool change, to close races
+ * described in the callers.
+ */
+static void
+pgaio_workerset_wake(PgAioWorkerSet workerset)
+{
+	while (!pgaio_workerset_is_empty(&workerset))
+		pgaio_worker_wake(pgaio_workerset_pop_lowest(&workerset));
+}
+
 static bool
 pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
 {
 	PgAioWorkerSubmissionQueue *queue;
 	uint32		new_head;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	queue = io_worker_submission_queue;
 	new_head = (queue->head + 1) & (queue->size - 1);
 	if (new_head == queue->tail)
@@ -185,6 +400,8 @@ pgaio_worker_submission_queue_consume(void)
 	PgAioWorkerSubmissionQueue *queue;
 	int			result;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	queue = io_worker_submission_queue;
 	if (queue->tail == queue->head)
 		return -1;				/* empty */
@@ -201,6 +418,8 @@ pgaio_worker_submission_queue_depth(void)
 	uint32		head;
 	uint32		tail;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	head = io_worker_submission_queue->head;
 	tail = io_worker_submission_queue->tail;
 
@@ -226,8 +445,7 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 {
 	PgAioHandle **synchronous_ios = NULL;
 	int			nsync = 0;
-	Latch	   *wakeup = NULL;
-	int			worker;
+	int			worker = -1;
 
 	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
 
@@ -252,19 +470,15 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 				break;
 			}
 
-			if (wakeup == NULL)
-			{
-				/* Choose an idle worker to wake up if we haven't already. */
-				worker = pgaio_worker_choose_idle();
-				if (worker >= 0)
-					wakeup = io_worker_control->workers[worker].latch;
-
-				pgaio_debug_io(DEBUG4, staged_ios[i],
-							   "choosing worker %d",
-							   worker);
-			}
+			/* Choose one worker to wake for this batch. */
+			if (worker == -1)
+				worker = pgaio_worker_choose_idle(-1);
 		}
 		LWLockRelease(AioWorkerSubmissionQueueLock);
+
+		/* Wake up chosen worker.  It will wake peers if necessary. */
+		if (worker != -1)
+			pgaio_worker_wake(worker);
 	}
 	else
 	{
@@ -273,9 +487,6 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 		nsync = num_staged_ios;
 	}
 
-	if (wakeup)
-		SetLatch(wakeup);
-
 	/* Run whatever is left synchronously. */
 	if (nsync > 0)
 	{
@@ -295,14 +506,30 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 static void
 pgaio_worker_die(int code, Datum arg)
 {
-	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
-	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
-	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+	PgAioWorkerSet notify_set;
 
-	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
-	io_worker_control->workers[MyIoWorkerId].in_use = false;
-	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	pgaio_workerset_remove(&io_worker_control->idle_workerset, MyIoWorkerId);
 	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].proc_number == MyProcNumber);
+	io_worker_control->workers[MyIoWorkerId].proc_number = INVALID_PROC_NUMBER;
+	Assert(pgaio_workerset_contains(&io_worker_control->workerset, MyIoWorkerId));
+	pgaio_workerset_remove(&io_worker_control->workerset, MyIoWorkerId);
+	notify_set = io_worker_control->workerset;
+	Assert(io_worker_control->nworkers > 0);
+	io_worker_control->nworkers--;
+	Assert(pgaio_workerset_count(&io_worker_control->workerset) ==
+		   io_worker_control->nworkers);
+	LWLockRelease(AioWorkerControlLock);
+
+	/*
+	 * Notify other workers on pool change.  This allows the new highest
+	 * worker to know that it is now the one that can time out, and closes a
+	 * wakeup-loss race described in pgaio_worker_wake().
+	 */
+	pgaio_workerset_wake(notify_set);
 }
 
 /*
@@ -312,33 +539,38 @@ pgaio_worker_die(int code, Datum arg)
 static void
 pgaio_worker_register(void)
 {
+	PgAioWorkerSet free_workerset;
+	PgAioWorkerSet old_workerset;
+
 	MyIoWorkerId = -1;
 
-	/*
-	 * XXX: This could do with more fine-grained locking. But it's also not
-	 * very common for the number of workers to change at the moment...
-	 */
-	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
+	/* Find lowest unused worker ID. */
+	pgaio_workerset_all(&free_workerset);
+	pgaio_workerset_subtract(&free_workerset, &io_worker_control->workerset);
+	if (!pgaio_workerset_is_empty(&free_workerset))
+		MyIoWorkerId = pgaio_workerset_get_lowest(&free_workerset);
+	if (MyIoWorkerId == -1)
+		elog(ERROR, "couldn't find a free worker ID");
 
-	for (int i = 0; i < MAX_IO_WORKERS; ++i)
-	{
-		if (!io_worker_control->workers[i].in_use)
-		{
-			Assert(io_worker_control->workers[i].latch == NULL);
-			io_worker_control->workers[i].in_use = true;
-			MyIoWorkerId = i;
-			break;
-		}
-		else
-			Assert(io_worker_control->workers[i].latch != NULL);
-	}
+	Assert(io_worker_control->workers[MyIoWorkerId].proc_number ==
+		   INVALID_PROC_NUMBER);
+	io_worker_control->workers[MyIoWorkerId].proc_number = MyProcNumber;
 
-	if (MyIoWorkerId == -1)
-		elog(ERROR, "couldn't find a free worker slot");
+	old_workerset = io_worker_control->workerset;
+	Assert(!pgaio_workerset_contains(&old_workerset, MyIoWorkerId));
+	pgaio_workerset_insert(&io_worker_control->workerset, MyIoWorkerId);
+	io_worker_control->nworkers++;
+	Assert(io_worker_control->nworkers <= MAX_IO_WORKERS);
+	Assert(pgaio_workerset_count(&io_worker_control->workerset) ==
+		   io_worker_control->nworkers);
+	LWLockRelease(AioWorkerControlLock);
 
-	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
-	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
-	LWLockRelease(AioWorkerSubmissionQueueLock);
+	/*
+	 * Notify other workers on pool change.  If we were the highest worker,
+	 * this allows the new highest worker to know that it can time out.
+	 */
+	pgaio_workerset_wake(old_workerset);
 
 	on_shmem_exit(pgaio_worker_die, 0);
 }
@@ -364,14 +596,48 @@ pgaio_worker_error_callback(void *arg)
 	errcontext("I/O worker executing I/O on behalf of process %d", owner_pid);
 }
 
+/*
+ * Check if this backend is allowed to time out, and thus should use a
+ * non-infinite sleep time.  Only the highest-numbered worker is allowed to
+ * time out, and only if the pool is above io_min_workers.  Serializing
+ * timeouts keeps IDs in a range 0..N without gaps, and avoids undershooting
+ * io_min_workers.
+ *
+ * The result is only instantaneously true and may be temporarily inconsistent
+ * in different workers around transitions, but all workers are woken up on
+ * pool size or GUC changes making the result eventually consistent.
+ */
+static bool
+pgaio_worker_can_timeout(void)
+{
+	PgAioWorkerSet workerset;
+
+	/* Serialize against pool size changes. */
+	LWLockAcquire(AioWorkerControlLock, LW_SHARED);
+	workerset = io_worker_control->workerset;
+	LWLockRelease(AioWorkerControlLock);
+
+	if (MyIoWorkerId != pgaio_workerset_get_highest(&workerset))
+		return false;
+
+	if (MyIoWorkerId < io_min_workers)
+		return false;
+
+	return true;
+}
+
 void
 IoWorkerMain(const void *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	TimestampTz idle_timeout_abs = 0;
+	int			timeout_guc_used = 0;
 	PgAioHandle *volatile error_ioh = NULL;
 	ErrorContextCallback errcallback = {0};
 	volatile int error_errno = 0;
 	char		cmd[128];
+	int			io_count = 0;
+	int			wakeup_count = 0;
 
 	AuxiliaryProcessMainCommon();
 
@@ -439,10 +705,9 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	while (!ShutdownRequestPending)
 	{
 		uint32		io_index;
-		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
-		int			nlatches = 0;
-		int			nwakeups = 0;
-		int			worker;
+		int			worker = -1;
+		int			queue_depth = 0;
+		bool		maybe_grow = false;
 
 		/*
 		 * Try to get a job to do.
@@ -453,38 +718,106 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
 		if ((io_index = pgaio_worker_submission_queue_consume()) == -1)
 		{
-			/*
-			 * Nothing to do.  Mark self idle.
-			 *
-			 * XXX: Invent some kind of back pressure to reduce useless
-			 * wakeups?
-			 */
-			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+			/* Nothing to do.  Mark self idle. */
+			pgaio_workerset_insert(&io_worker_control->idle_workerset,
+								   MyIoWorkerId);
 		}
 		else
 		{
 			/* Got one.  Clear idle flag. */
-			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+			pgaio_workerset_remove(&io_worker_control->idle_workerset,
+								   MyIoWorkerId);
 
-			/* See if we can wake up some peers. */
-			nwakeups = Min(pgaio_worker_submission_queue_depth(),
-						   IO_WORKER_WAKEUP_FANOUT);
-			for (int i = 0; i < nwakeups; ++i)
+			/*
+			 * See if we should wake up a higher numbered peer.  Only do that
+			 * if this worker is not receiving spurious wakeups itself.  The
+			 * intention is create a frontier beyond which idle workers stay
+			 * asleep.
+			 *
+			 * This heuristic tries to discover the useful wakeup propagation
+			 * chain length when IOs are very fast and workers wake up to find
+			 * that all IOs have already been taken.
+			 *
+			 * If we chose not to wake a worker when we ideally should have,
+			 * then ios will soon exceed wakeups.
+			 */
+			if (wakeup_count <= io_count)
 			{
-				if ((worker = pgaio_worker_choose_idle()) < 0)
-					break;
-				latches[nlatches++] = io_worker_control->workers[worker].latch;
+				queue_depth = pgaio_worker_submission_queue_depth();
+				if (queue_depth > 0)
+				{
+					/* Choose a worker higher than me to wake. */
+					worker = pgaio_worker_choose_idle(MyIoWorkerId);
+					if (worker == -1)
+						maybe_grow = true;
+				}
 			}
 		}
 		LWLockRelease(AioWorkerSubmissionQueueLock);
 
-		for (int i = 0; i < nlatches; ++i)
-			SetLatch(latches[i]);
+		/* Propagate wakeups. */
+		if (worker != -1)
+		{
+			pgaio_worker_wake(worker);
+		}
+		else if (maybe_grow)
+		{
+			/*
+			 * We know there was at least one more item in the queue, and we
+			 * failed to find a higher-numbered idle worker to wake.  Now we
+			 * decide if we should try to start one more worker.
+			 *
+			 * We do this with a simple heuristic: is the queue depth greater
+			 * than the current number of workers?
+			 *
+			 * Consider the following situations:
+			 *
+			 * 1. The queue depth is constantly increasing, because IOs are
+			 * arriving faster than they can possibly be serviced.  It doesn't
+			 * matter much which threshold we choose, as we will surely hit
+			 * it.  Crossing the current worker count is a useful signal
+			 * because it's clearly too deep to avoid queuing latency already,
+			 * but still leaves a small window of opportunity to improve the
+			 * situation before the queue oveflows.
+			 *
+			 * 2. The worker pool is keeping up, no latency is being
+			 * introduced and an extra worker would be a waste of resources.
+			 * Queue depth distributions tend to be heavily skewed, with long
+			 * tails of low probability spikes (due to submission clustering,
+			 * scheduling, jitter, stalls, noisy neighbors, etc).  We want a
+			 * number that is very unlikely to be triggered by an outlier, and
+			 * we bet that an exponential or similar distribution whose
+			 * outliers never reach this threshold must be almost entirely
+			 * concentrated at the low end.  If we do see a spike as big as
+			 * the worker count, we take it as a signal that the distribution
+			 * is surely too wide.
+			 *
+			 * On its own, this is an extremely crude signal.  When combined
+			 * with the wakeup propagation test that precedes it and the
+			 * io_worker_launch_delay, we can try each pool size until we find
+			 * one that doesn't trigger further growth.
+			 *
+			 * XXX Ideas from queueing theory or control theory could surely
+			 * do a much better job of this.
+			 */
+
+			/* Read nworkers without lock for this heuristic purpose. */
+			if (queue_depth > io_worker_control->nworkers)
+				pgaio_worker_request_grow();
+		}
 
 		if (io_index != -1)
 		{
 			PgAioHandle *ioh = NULL;
 
+			/* Cancel timeout and update wakeup:work ratio. */
+			idle_timeout_abs = 0;
+			if (++io_count == PGAIO_WORKER_WAKEUP_RATIO_SATURATE)
+			{
+				wakeup_count /= 2;
+				io_count /= 2;
+			}
+
 			ioh = &pgaio_ctl->io_handles[io_index];
 			error_ioh = ioh;
 			errcallback.arg = ioh;
@@ -537,6 +870,19 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 			}
 #endif
 
+#ifdef PGAIO_WORKER_SHOW_PS_INFO
+			{
+				char	   *description = pgaio_io_get_target_description(ioh);
+
+				sprintf(cmd, "%d: [%s] %s",
+						MyIoWorkerId,
+						pgaio_io_get_op_name(ioh),
+						pgaio_io_get_target_description(ioh));
+				pfree(description);
+				set_ps_display(cmd);
+			}
+#endif
+
 			/*
 			 * We don't expect this to ever fail with ERROR or FATAL, no need
 			 * to keep error_ioh set to the IO.
@@ -550,8 +896,76 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		}
 		else
 		{
-			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-					  WAIT_EVENT_IO_WORKER_MAIN);
+			int			timeout_ms;
+
+			/* Cancel new worker request if pending. */
+			pgaio_worker_cancel_grow();
+
+			/* Compute the remaining allowed idle time. */
+			if (io_worker_idle_timeout == -1)
+			{
+				/* Never time out. */
+				timeout_ms = -1;
+			}
+			else
+			{
+				TimestampTz now = GetCurrentTimestamp();
+
+				/* If the GUC changes, reset timer. */
+				if (idle_timeout_abs != 0 &&
+					io_worker_idle_timeout != timeout_guc_used)
+					idle_timeout_abs = 0;
+
+				/* Only the highest-numbered worker can time out. */
+				if (pgaio_worker_can_timeout())
+				{
+					if (idle_timeout_abs == 0)
+					{
+						/*
+						 * I have just been promoted to the timeout worker, or
+						 * the GUC changed.  Compute new absolute time from
+						 * now.
+						 */
+						idle_timeout_abs =
+							TimestampTzPlusMilliseconds(now,
+														io_worker_idle_timeout);
+						timeout_guc_used = io_worker_idle_timeout;
+					}
+					timeout_ms =
+						TimestampDifferenceMilliseconds(now, idle_timeout_abs);
+				}
+				else
+				{
+					/* No timeout for me. */
+					idle_timeout_abs = 0;
+					timeout_ms = -1;
+				}
+			}
+
+#ifdef PGAIO_WORKER_SHOW_PS_INFO
+			sprintf(cmd, "%d: idle, wakeups:ios = %d:%d",
+					MyIoWorkerId, wakeup_count, io_count);
+			set_ps_display(cmd);
+#endif
+
+			if (WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT,
+						  timeout_ms,
+						  WAIT_EVENT_IO_WORKER_MAIN) == WL_TIMEOUT)
+			{
+				/* WL_TIMEOUT */
+				if (pgaio_worker_can_timeout())
+					if (GetCurrentTimestamp() >= idle_timeout_abs)
+						break;
+			}
+			else
+			{
+				/* WL_LATCH_SET */
+				if (++wakeup_count == PGAIO_WORKER_WAKEUP_RATIO_SATURATE)
+				{
+					wakeup_count /= 2;
+					io_count /= 2;
+				}
+			}
 			ResetLatch(MyLatch);
 		}
 
@@ -561,6 +975,10 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		{
 			ConfigReloadPending = false;
 			ProcessConfigFile(PGC_SIGHUP);
+
+			/* If io_max_workers has been decreased, exit highest first. */
+			if (MyIoWorkerId >= io_max_workers)
+				break;
 		}
 	}
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7bda5298558..560659f9568 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -369,6 +369,7 @@ AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 LogicalDecodingControl	"Waiting to read or update logical decoding status information."
 DataChecksumsWorker	"Waiting for data checksums worker."
+AioWorkerControl	"Waiting to update AIO worker information."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index fcb6ab80583..584ff79d0ba 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1390,6 +1390,14 @@
   check_hook => 'check_io_max_concurrency',
 },
 
+{ name => 'io_max_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Maximum number of I/O worker processes, for io_method=worker.',
+  variable => 'io_max_workers',
+  boot_val => '8',
+  min => '1',
+  max => 'MAX_IO_WORKERS',
+},
+
 { name => 'io_method', type => 'enum', context => 'PGC_POSTMASTER', group => 'RESOURCES_IO',
   short_desc => 'Selects the method for executing asynchronous I/O.',
   variable => 'io_method',
@@ -1398,14 +1406,32 @@
   assign_hook => 'assign_io_method',
 },
 
-{ name => 'io_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
-  short_desc => 'Number of IO worker processes, for io_method=worker.',
-  variable => 'io_workers',
-  boot_val => '3',
+{ name => 'io_min_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Minimum number of I/O worker processes, for io_method=worker.',
+  variable => 'io_min_workers',
+  boot_val => '2',
   min => '1',
   max => 'MAX_IO_WORKERS',
 },
 
+{ name => 'io_worker_idle_timeout', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Maximum time before idle I/O worker processes time out, for io_method=worker.',
+  variable => 'io_worker_idle_timeout',
+  flags => 'GUC_UNIT_MS',
+  boot_val => '60000',
+  min => '0',
+  max => 'INT_MAX',
+},
+
+{ name => 'io_worker_launch_interval', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Minimum time before launching a new I/O worker process, for io_method=worker.',
+  variable => 'io_worker_launch_interval',
+  flags => 'GUC_UNIT_MS',
+  boot_val => '100',
+  min => '0',
+  max => 'INT_MAX',
+},
+
 # Not for general use --- used by SET SESSION AUTHORIZATION and SET
 # ROLE
 { name => 'is_superuser', type => 'bool', context => 'PGC_INTERNAL', group => 'UNGROUPED',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e3e462f3efb..e28599f478e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -218,7 +218,11 @@
                                         # can execute simultaneously
                                         # -1 sets based on shared_buffers
                                         # (change requires restart)
-#io_workers = 3                         # 1-32;
+
+#io_min_workers = 2                     # 1-32 (change requires pg_reload_conf())
+#io_max_workers = 8                     # 1-32
+#io_worker_idle_timeout = 60s
+#io_worker_launch_interval = 100ms
 
 # - Worker Processes -
 
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
index f7d5998a138..cffffd62fdd 100644
--- a/src/include/storage/io_worker.h
+++ b/src/include/storage/io_worker.h
@@ -17,6 +17,14 @@
 
 pg_noreturn extern void IoWorkerMain(const void *startup_data, size_t startup_data_len);
 
-extern PGDLLIMPORT int io_workers;
+/* Public GUCs. */
+extern PGDLLIMPORT int io_min_workers;
+extern PGDLLIMPORT int io_max_workers;
+extern PGDLLIMPORT int io_worker_idle_timeout;
+extern PGDLLIMPORT int io_worker_launch_interval;
+
+/* Interfaces visible to the postmaster. */
+extern bool pgaio_worker_pm_test_grow(void);
+extern void pgaio_worker_pm_clear_grow(void);
 
 #endif							/* IO_WORKER_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index af8553bcb6c..d7eb648bd27 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -88,6 +88,7 @@ PG_LWLOCK(53, AioWorkerSubmissionQueue)
 PG_LWLOCK(54, WaitLSN)
 PG_LWLOCK(55, LogicalDecodingControl)
 PG_LWLOCK(56, DataChecksumsWorker)
+PG_LWLOCK(57, AioWorkerControl)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 001e6eea61c..bcce4011790 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -38,6 +38,7 @@ typedef enum
 	PMSIGNAL_ROTATE_LOGFILE,	/* send SIGUSR1 to syslogger to rotate logfile */
 	PMSIGNAL_START_AUTOVAC_LAUNCHER,	/* start an autovacuum launcher */
 	PMSIGNAL_START_AUTOVAC_WORKER,	/* start an autovacuum worker */
+	PMSIGNAL_IO_WORKER_GROW,	/* I/O worker pool wants to grow */
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
diff --git a/src/test/modules/test_aio/t/002_io_workers.pl b/src/test/modules/test_aio/t/002_io_workers.pl
index 34bc132ea08..b9775811d4d 100644
--- a/src/test/modules/test_aio/t/002_io_workers.pl
+++ b/src/test/modules/test_aio/t/002_io_workers.pl
@@ -14,6 +14,9 @@ $node->init();
 $node->append_conf(
 	'postgresql.conf', qq(
 io_method=worker
+io_worker_idle_timeout=0ms
+io_worker_launch_interval=0ms
+io_max_workers=32
 ));
 
 $node->start();
@@ -31,7 +34,7 @@ sub test_number_of_io_workers_dynamic
 {
 	my $node = shift;
 
-	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_workers');
+	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_min_workers');
 
 	# Verify that worker count can't be set to 0
 	change_number_of_io_workers($node, 0, $prev_worker_count, 1);
@@ -62,24 +65,24 @@ sub change_number_of_io_workers
 	my ($result, $stdout, $stderr);
 
 	($result, $stdout, $stderr) =
-	  $node->psql('postgres', "ALTER SYSTEM SET io_workers = $worker_count");
+	  $node->psql('postgres', "ALTER SYSTEM SET io_min_workers = $worker_count");
 	$node->safe_psql('postgres', 'SELECT pg_reload_conf()');
 
 	if ($expect_failure)
 	{
 		like(
 			$stderr,
-			qr/$worker_count is outside the valid range for parameter "io_workers"/,
-			"updating number of io_workers to $worker_count failed, as expected"
+			qr/$worker_count is outside the valid range for parameter "io_min_workers"/,
+			"updating io_min_workers to $worker_count failed, as expected"
 		);
 
 		return $prev_worker_count;
 	}
 	else
 	{
-		is( $node->safe_psql('postgres', 'SHOW io_workers'),
+		is( $node->safe_psql('postgres', 'SHOW io_min_workers'),
 			$worker_count,
-			"updating number of io_workers from $prev_worker_count to $worker_count"
+			"updating number of io_min_workers from $prev_worker_count to $worker_count"
 		);
 
 		check_io_worker_count($node, $worker_count);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e6a39f5608..e411fe55254 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2270,6 +2270,7 @@ PgAioUringCaps
 PgAioUringContext
 PgAioWaitRef
 PgAioWorkerControl
+PgAioWorkerSet
 PgAioWorkerSlot
 PgAioWorkerSubmissionQueue
 PgArchData
-- 
2.53.0



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-07-12 05:08       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-07-30 10:14         ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-08-04 05:30           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-03-28 09:31             ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 15:02               ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 18:14                 ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-07 10:39                   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
@ 2026-04-07 19:01                     ` Andres Freund <[email protected]>
  2026-04-07 23:18                       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Andres Freund @ 2026-04-07 19:01 UTC (permalink / raw)
  To: Thomas Munro <[email protected]>; +Cc: Dmitry Dolgov <[email protected]>; PostgreSQL Hackers <[email protected]>

Hi,

On 2026-04-07 22:39:37 +1200, Thomas Munro wrote:

> > > @@ -1582,14 +1584,16 @@ DetermineSleepTime(void)
> > >
> > >                       return seconds * 1000;
> > >               }
> > > -             else
> > > -                     return 60 * 1000;
> > >       }
> > >
> > > -     if (StartWorkerNeeded)
> > > +     /* Time of next maybe_start_io_workers() call, or 0 for none. */
> > > +     next_wakeup = maybe_start_io_workers_scheduled_at();
> > > +
> > > +     /* Ignore bgworkers during shutdown. */
> > > +     if (StartWorkerNeeded && Shutdown == NoShutdown)
> > >               return 0;
> >
> > Why is the maybe_start_io_workers_scheduled_at() thing before the return 0
> > here?
>
> Seems OK?  I mean sure I would to make this whole function more
> uniform in structure, see my second patch, but...

It's ok, there just doesn't seem to be a point in doing it before that if,
rather than just after...

> > > +static int
> > > +pgaio_worker_set_get_highest(PgAioWorkerSet *set)
> > > +{
> > > +     Assert(!pgaio_worker_set_is_empty(set));
> > > +     return pg_leftmost_one_pos64(*set);
> > > +}
> >
> > "worker_set_get*" reads quite awkwardly.  Maybe just going for
> > pgaio_workerset_* would help?
> >
> > Or maybe just name it PgAioWset/pgaio_wset_ or such?
>
> OK let's try "workerset".

Looks better.


> > Maybe just name it pgaio_worker_set_grow()?
>
> OK how about:
>
> pgaio_worker_request_grow()
> pgaio_worker_cancel_grow()

WFM.


> > > @@ -252,19 +438,15 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
> > >                               break;
> > >                       }
> > >
> > > -                     if (wakeup == NULL)
> > > -                     {
> > > -                             /* Choose an idle worker to wake up if we haven't already. */
> > > -                             worker = pgaio_worker_choose_idle();
> > > -                             if (worker >= 0)
> > > -                                     wakeup = io_worker_control->workers[worker].latch;
> > > -
> > > -                             pgaio_debug_io(DEBUG4, staged_ios[i],
> > > -                                                        "choosing worker %d",
> > > -                                                        worker);
> > > -                     }
> > > +                     /* Choose one worker to wake for this batch. */
> > > +                     if (worker == -1)
> > > +                             worker = pgaio_worker_choose_idle(0);
> > >               }
> >
> > If we only want to do this once per "batch", why not just do it outside the
> > num_staged_ios loop?
>
> Two steps: pgaio_worker_choose_idle() must be done while holding the
> queue lock (will probably finish up revising this in future work on
> removing locks...).  pgaio_worker_wake() is called outside the loop,
> after releasing the lock.

I just meant doing it outside the for loop.

		for (int i = 0; i < num_staged_ios; ++i)
		{
			Assert(!pgaio_worker_needs_synchronous_execution(staged_ios[i]));
			if (!pgaio_worker_submission_queue_insert(staged_ios[i]))
			{
				/*
				 * Do the rest synchronously. If the queue is full, give up
				 * and do the rest synchronously. We're holding an exclusive
				 * lock on the queue so nothing can consume entries.
				 */
				synchronous_ios = &staged_ios[i];
				nsync = (num_staged_ios - i);

				break;
			}

			/* Choose one worker to wake for this batch. */
			if (worker == -1)
				worker = pgaio_worker_choose_idle(-1);
		}

The if (worker == -1) is done for every to be submitted IO.  If there are no
idle workers, we'd redo the pgaio_worker_choose_idle() every time.  ISTM it
should just be:

		for (int i = 0; i < num_staged_ios; ++i)
		{
			Assert(!pgaio_worker_needs_synchronous_execution(staged_ios[i]));
			if (!pgaio_worker_submission_queue_insert(staged_ios[i]))
			{
				/*
				 * Do the rest synchronously. If the queue is full, give up
				 * and do the rest synchronously. We're holding an exclusive
				 * lock on the queue so nothing can consume entries.
				 */
				synchronous_ios = &staged_ios[i];
				nsync = (num_staged_ios - i);

				break;
			}
		}

		/* Choose one worker to wake for this batch. */
		if (worker == -1)
			worker = pgaio_worker_choose_idle(-1);


> > > @@ -295,14 +474,27 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
> > >  static void
> > >  pgaio_worker_die(int code, Datum arg)
> > >  {
> > > [...]
> > > +     /* Notify other workers on pool change. */
> >
> > Why are we notifying them on pool changes?
>
> Comments added to explain.  It closes a wakeup-loss race (imagine if
> you consumed a wakeup while you were exiting due to timeout; noone
> else would wake up, which I fixed with this big hammer).

Thanks, looks a lot clearer now.

> > > +/*
> > > + * Check if this backend is allowed to time out, and thus should use a
> > > + * non-infinite sleep time.  Only the highest-numbered worker is allowed to
> > > + * time out, and only if the pool is above io_min_workers.  Serializing
> > > + * timeouts keeps IDs in a range 0..N without gaps, and avoids undershooting
> > > + * io_min_workers.
> >
> > But it's ok if a lower numbered worker errors out, right?  There will be a
> > temporary gap, but we will start a new worker for it?
>
> Yes it is OK for there to be gaps.
>
> If any worker errors out, it will be replaced when reaped if we fell
> below io_min_workers, and otherwise replaced via the usual means, ie
> once the backlog detection and the launch delay allow it.  I did have
> a version that always replaced *every* worker with exit code 1
> immediately, but I started wondering if we really want persistent
> errors to turn into high speed fork() loops.  I'm still not sure TBH.
> We don't expect workers to error out, so it means something is already
> pretty screwed up and you might appreciate the rate limiting?

Yea, I think it's saner not to do that.




> > I think both 'wakeups" and "ios" are a bit too generically named. Based on the
> > names I have no idea what this heuristic might be.
>
> I have struggled to name them.  Does wakeup_count and io_count help?

hist_wakeups, hist_ios?



> > > +                                     /*
> > > +                                      * If there were no idle higher numbered peers and there
> > > +                                      * are more than enough IOs queued for me and all lower
> > > +                                      * numbered peers, then try to start a new worker.
> > > +                                      */
> > > +                                     if (worker == -1 && queue_depth > MyIoWorkerId)
> > > +                                             grow = true;
> > > +                             }
> >
> > We probably shouldn't request growth when already at the cap? That could
> > generate a *lot* of pmsignal traffic, I think?
>
> No, we only set it if it isn't already set (like a latch), and only
> send a pmsignal when we set it (like a latch), and the postmaster only
> clears it if it can start a worker (unlike a latch).  That applies in
> general, not just when we hit the cap of io_max_workers: while the
> postmaster is waiting for launch interval to expire, it will leave the
> flag set, suppressed for 100ms or whatever, and the in the special
> case of io_max_workers, for as long as the count remains that high.

I'm quite certain that's not how it actually ended up working with the prior
version and the benchmark I showed, there indeed were a lot of requests to
postmaster.  I think it's because pgaio_worker_cancel_grow() (forgot the old
name already) very frequently clears the flag, just for it to be immediately
set again.


Yep, still happens, does require the max to be smaller than 32 though.

While a lot of IO is happening, no new connections being started, and with
1781562 being postmaster's pid:

perf stat --no-inherit -p 1781562 -e raw_syscalls:sys_enter -r 0 sleep 1

 Performance counter stats for process id '1781562':

             2,790      raw_syscalls:sys_enter

       1.001872667 seconds time elapsed

             2,814      raw_syscalls:sys_enter

       1.001983049 seconds time elapsed

             3,036      raw_syscalls:sys_enter

       1.001705850 seconds time elapsed

             2,982      raw_syscalls:sys_enter

       1.001881364 seconds time elapsed


I think it may need a timestamp in the shared state to not allow another
postmaster wake until some time has elapsed, or something.


>
> I should have made it clearer that that's a secondary condition.  The
> primary condition is: a worker wanted to wake another worker, but
> found that none were idle.  Unfortunately the whole system is a bit
> too asynchronous for that to be a reliable cue on its own.  So, I also
> check if the queue appears to be (1) obviously growing: that's clearly
> too long and must be introducing latency, or (2) varying "too much".
> Which I detect in exactly the same way.
>
> Imagine a histogram that look like this:
>
> LOG:  depth 00: 7898
> LOG:  depth 01: 1630
> LOG:  depth 02: 308
> LOG:  depth 03: 93
> LOG:  depth 04: 40
> LOG:  depth 05: 19
> LOG:  depth 06: 6
> LOG:  depth 07: 4
> LOG:  depth 08: 0
> LOG:  depth 09: 1
> LOG:  depth 10: 1
> LOG:  depth 11: 0
> LOG:  depth 12: 0
> LOG:  depth 13: 0
>
> If you're failing to find idle workers to wake up AND our managic
> threshold is hit by something in that long tail, then it'll call for
> backup.  Of course I'm totally sidestepping a lot of queueing theory
> maths and just saying "I'd better be able to find an idle worker when
> I want to" and if not, "there had better not be any outliers that
> reach this far".
>
> I've written a longer explanation in a long comment.  Including a
> little challenge for someone to do better with real science and maths.
> I hope it's a bit clearer at least.

Definitely good to have that comment.  Have to ponder it for a bit.



> > ninja install-test-files
> > io_max_workers=32
> > debug_io_direct=data
> > effective_io_concurrency=16
> > shared_buffers=5GB
> >
> > pgbench -i -q -s 100 --fillfactor=30
> >
> > CREATE EXTENSION IF NOT EXISTS test_aio;
> > CREATE EXTENSION IF NOT EXISTS pg_buffercache;
> > DROP TABLE IF EXISTS pattern_random_pgbench;
> > CREATE TABLE pattern_random_pgbench AS SELECT ARRAY(SELECT random(0, pg_relation_size('pgbench_accounts')/8192 - 1)::int4 FROM generate_series(1, pg_relation_size('pgbench_accounts')/8192)) AS pattern;
> >
> > My test is:
> >
> > SET effective_io_concurrency = 20;
> > SELECT pg_buffercache_evict_relation('pgbench_accounts');
> > SELECT read_stream_for_blocks('pgbench_accounts', pattern) FROM pattern_random_pgbench LIMIT 1;
> >
> >
> > We end up with ~24-28 workers, even though we never have more than 20 IOs in
> > flight. Not entirely sure why. I guess it's just that after doing an IO the
> > worker needs to mark itself idle etc?
>
> Yep.  It would be nice to make it a bit more accurate in later cycles.
> It tends to overprovision rather than under, since it thinks all other
> workers are busy.

I think that's the right direction to err into.


> That information is a bit racy.

Yea, I think that's fine.

> > Hm. This way you get very rapid worker pool reductions.  Configured
> > io_worker_idle_timeout=1s, started a bunch of work of and observed the worker
> > count after the work finishes:
> > ...
> > Of course this is a ridiculuously low setting, but it does seems like starting
> > the timeout even when not the highest numbered worker will lead to a lot of
> > quick yoyoing.
>
> I have changed it so that after one worker times out, the next one
> begins its timeout count from 0.  (This is one of the reasons for that
> "notify the whole pool when I exit" thing.)

That looks much better in a quick test.


I've not again looked through the details, but based on a relatively short
experiment, the one problematic thing I see is the frequent postmaster
requests.

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-07-12 05:08       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-07-30 10:14         ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-08-04 05:30           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-03-28 09:31             ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 15:02               ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 18:14                 ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-07 10:39                   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-07 19:01                     ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
@ 2026-04-07 23:18                       ` Thomas Munro <[email protected]>
  2026-04-08 00:30                         ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-08 00:30                         ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  0 siblings, 2 replies; 24+ messages in thread

From: Thomas Munro @ 2026-04-07 23:18 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Dmitry Dolgov <[email protected]>; PostgreSQL Hackers <[email protected]>

On Wed, Apr 8, 2026 at 7:01 AM Andres Freund <[email protected]> wrote:
> The if (worker == -1) is done for every to be submitted IO.  If there are no
> idle workers, we'd redo the pgaio_worker_choose_idle() every time.  ISTM it
> should just be:
>
>                 for (int i = 0; i < num_staged_ios; ++i)
>                 {
>                         Assert(!pgaio_worker_needs_synchronous_execution(staged_ios[i]));
>                         if (!pgaio_worker_submission_queue_insert(staged_ios[i]))
>                         {
>                                 /*
>                                  * Do the rest synchronously. If the queue is full, give up
>                                  * and do the rest synchronously. We're holding an exclusive
>                                  * lock on the queue so nothing can consume entries.
>                                  */
>                                 synchronous_ios = &staged_ios[i];
>                                 nsync = (num_staged_ios - i);
>
>                                 break;
>                         }
>                 }
>
>                 /* Choose one worker to wake for this batch. */
>                 if (worker == -1)
>                         worker = pgaio_worker_choose_idle(-1);

Well I didn't want to wake a worker if we'd failed to enqueue
anything.  Ahh, I could put it there and test nsync.  Or I guess I
could just do it anyway.  Considering that.

> > > I think both 'wakeups" and "ios" are a bit too generically named. Based on the
> > > names I have no idea what this heuristic might be.
> >
> > I have struggled to name them.  Does wakeup_count and io_count help?
>
> hist_wakeups, hist_ios?

Thanks, that's a good name.

> > No, we only set it if it isn't already set (like a latch), and only
> > send a pmsignal when we set it (like a latch), and the postmaster only
> > clears it if it can start a worker (unlike a latch).  That applies in
> > general, not just when we hit the cap of io_max_workers: while the
> > postmaster is waiting for launch interval to expire, it will leave the
> > flag set, suppressed for 100ms or whatever, and the in the special
> > case of io_max_workers, for as long as the count remains that high.
>
> I'm quite certain that's not how it actually ended up working with the prior
> version and the benchmark I showed, there indeed were a lot of requests to
> postmaster.  I think it's because pgaio_worker_cancel_grow() (forgot the old
> name already) very frequently clears the flag, just for it to be immediately
> set again.
>
>
> Yep, still happens, does require the max to be smaller than 32 though.
>
> While a lot of IO is happening, no new connections being started, and with
> 1781562 being postmaster's pid:
>
> perf stat --no-inherit -p 1781562 -e raw_syscalls:sys_enter -r 0 sleep 1
>
>  Performance counter stats for process id '1781562':
>
>              2,790      raw_syscalls:sys_enter
>
>        1.001872667 seconds time elapsed
>
>              2,814      raw_syscalls:sys_enter
>
>        1.001983049 seconds time elapsed
>
>              3,036      raw_syscalls:sys_enter
>
>        1.001705850 seconds time elapsed
>
>              2,982      raw_syscalls:sys_enter
>
>        1.001881364 seconds time elapsed
>
>
> I think it may need a timestamp in the shared state to not allow another
> postmaster wake until some time has elapsed, or something.

Hnng.  Studying...

> > I should have made it clearer that that's a secondary condition.  The
> > primary condition is: a worker wanted to wake another worker, but
> > found that none were idle.  Unfortunately the whole system is a bit
> > too asynchronous for that to be a reliable cue on its own.  So, I also
> > check if the queue appears to be (1) obviously growing: that's clearly
> > too long and must be introducing latency, or (2) varying "too much".
> > Which I detect in exactly the same way.
> >
> > Imagine a histogram that look like this:
> >
> > LOG:  depth 00: 7898
> > LOG:  depth 01: 1630
> > LOG:  depth 02: 308
> > LOG:  depth 03: 93
> > LOG:  depth 04: 40
> > LOG:  depth 05: 19
> > LOG:  depth 06: 6
> > LOG:  depth 07: 4
> > LOG:  depth 08: 0
> > LOG:  depth 09: 1
> > LOG:  depth 10: 1
> > LOG:  depth 11: 0
> > LOG:  depth 12: 0
> > LOG:  depth 13: 0
> >
> > If you're failing to find idle workers to wake up AND our managic
> > threshold is hit by something in that long tail, then it'll call for
> > backup.  Of course I'm totally sidestepping a lot of queueing theory
> > maths and just saying "I'd better be able to find an idle worker when
> > I want to" and if not, "there had better not be any outliers that
> > reach this far".
> >
> > I've written a longer explanation in a long comment.  Including a
> > little challenge for someone to do better with real science and maths.
> > I hope it's a bit clearer at least.
>
> Definitely good to have that comment.  Have to ponder it for a bit.

Let me try again.

Our goal is simple: process every IO immediately.  We have immediate
feedback that is simple: there's an IO in the queue and there is no
idle worker.  The only action we can take is simple: add one more
worker.  So we don't need to suffer through the maths required to
figure out the ideal k for our M/G/k queue system (I think that's what
we have?) or any of the inputs that would require*.  The problem is
that on its own, the test triggered far too easily because a worker
that is not marked idle might in fact be just about to pick up that IO
on the one the one hand, and because there might be rare
spikes/clustering on the other, so I cooled it off a bit by
additionally testing if the queue appears to be growing or spiking
beyond some threshold.  I think it's OK to let the queue grow a bit
before we are triggered anyway, so the precise value used doesn't seem
too critical.  Someone might be able to come up with a more defensible
value, but in the end I just wanted a value that isn't triggered by
the outliers I see in real systems that are keeping up.  We could tune
it lower and overshoot more, but this setting seems to work pretty
well.  It doesn't seem likely that a real system could achieve a
steady state that is introducing latency but isn't increasing over
time, and pool size adjustments are bound to lag anyway.

* It's probably quite hard for call centres to figure out the number
of agents required to make you wait for a certain length of time, but
it's easy to know if you had to wait and you wish they had more!

> I've not again looked through the details, but based on a relatively short
> experiment, the one problematic thing I see is the frequent postmaster
> requests.

Looking into that...





^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-07-12 05:08       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-07-30 10:14         ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-08-04 05:30           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-03-28 09:31             ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 15:02               ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 18:14                 ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-07 10:39                   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-07 19:01                     ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-07 23:18                       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
@ 2026-04-08 00:30                         ` Andres Freund <[email protected]>
  2026-04-08 02:09                           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  1 sibling, 1 reply; 24+ messages in thread

From: Andres Freund @ 2026-04-08 00:30 UTC (permalink / raw)
  To: Thomas Munro <[email protected]>; +Cc: Dmitry Dolgov <[email protected]>; PostgreSQL Hackers <[email protected]>

Hi,

On 2026-04-08 11:18:51 +1200, Thomas Munro wrote:
> On Wed, Apr 8, 2026 at 7:01 AM Andres Freund <[email protected]> wrote:
> > The if (worker == -1) is done for every to be submitted IO.  If there are no
> > idle workers, we'd redo the pgaio_worker_choose_idle() every time.  ISTM it
> > should just be:
> >
> >                 for (int i = 0; i < num_staged_ios; ++i)
> >                 {
> >                         Assert(!pgaio_worker_needs_synchronous_execution(staged_ios[i]));
> >                         if (!pgaio_worker_submission_queue_insert(staged_ios[i]))
> >                         {
> >                                 /*
> >                                  * Do the rest synchronously. If the queue is full, give up
> >                                  * and do the rest synchronously. We're holding an exclusive
> >                                  * lock on the queue so nothing can consume entries.
> >                                  */
> >                                 synchronous_ios = &staged_ios[i];
> >                                 nsync = (num_staged_ios - i);
> >
> >                                 break;
> >                         }
> >                 }
> >
> >                 /* Choose one worker to wake for this batch. */
> >                 if (worker == -1)
> >                         worker = pgaio_worker_choose_idle(-1);
> 
> Well I didn't want to wake a worker if we'd failed to enqueue
> anything.

I think it's worth waking up workers if there are idle ones and the queue is
full?



> > > No, we only set it if it isn't already set (like a latch), and only
> > > send a pmsignal when we set it (like a latch), and the postmaster only
> > > clears it if it can start a worker (unlike a latch).  That applies in
> > > general, not just when we hit the cap of io_max_workers: while the
> > > postmaster is waiting for launch interval to expire, it will leave the
> > > flag set, suppressed for 100ms or whatever, and the in the special
> > > case of io_max_workers, for as long as the count remains that high.
> >
> > I'm quite certain that's not how it actually ended up working with the prior
> > version and the benchmark I showed, there indeed were a lot of requests to
> > postmaster.  I think it's because pgaio_worker_cancel_grow() (forgot the old
> > name already) very frequently clears the flag, just for it to be immediately
> > set again.
> >
> >
> > Yep, still happens, does require the max to be smaller than 32 though.
> >
> > While a lot of IO is happening, no new connections being started, and with
> > 1781562 being postmaster's pid:
> >
> > perf stat --no-inherit -p 1781562 -e raw_syscalls:sys_enter -r 0 sleep 1
> >
> >
> >              2,982      raw_syscalls:sys_enter
> >
> >        1.001881364 seconds time elapsed
> >
> >
> > I think it may need a timestamp in the shared state to not allow another
> > postmaster wake until some time has elapsed, or something.
> 
> Hnng.  Studying...

I suspect the primary reasonis that pgaio_worker_request_grow() is triggered
even when io_worker_control->nworkers is >= io_max_workers.


I suspect there's also pingpong between submission not finding any workers
idle, requesting growth, and workers being idle for a short period, then the
same thing starting again.

Seems like there should be two fields. One saying "notify postmaster again"
and one "postmaster start a worker".  The former would only be cleared by
postmaster after the timeout.


> Our goal is simple: process every IO immediately.  We have immediate
> feedback that is simple: there's an IO in the queue and there is no
> idle worker.  The only action we can take is simple: add one more
> worker.  So we don't need to suffer through the maths required to
> figure out the ideal k for our M/G/k queue system (I think that's what
> we have?) or any of the inputs that would require*.  The problem is
> that on its own, the test triggered far too easily because a worker
> that is not marked idle might in fact be just about to pick up that IO

Is that case really concerning? As long as you have some rate limiting about
the start rate, starting another worker when there are no idle workers seems
harmless?  Afaict it's fairly self limiting.


> on the one the one hand, and because there might be rare
> spikes/clustering on the other, so I cooled it off a bit by
> additionally testing if the queue appears to be growing or spiking
> beyond some threshold.  I think it's OK to let the queue grow a bit
> before we are triggered anyway, so the precise value used doesn't seem
> too critical.  Someone might be able to come up with a more defensible
> value, but in the end I just wanted a value that isn't triggered by
> the outliers I see in real systems that are keeping up.  We could tune
> it lower and overshoot more, but this setting seems to work pretty
> well.  It doesn't seem likely that a real system could achieve a
> steady state that is introducing latency but isn't increasing over
> time, and pool size adjustments are bound to lag anyway.

Yea, I don't think the precise logic matters that much as long as we ramp up
reasonably fast without being crazy and ramp up a bit faster.

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-07-12 05:08       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-07-30 10:14         ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-08-04 05:30           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-03-28 09:31             ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 15:02               ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 18:14                 ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-07 10:39                   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-07 19:01                     ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-07 23:18                       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-08 00:30                         ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
@ 2026-04-08 02:09                           ` Thomas Munro <[email protected]>
  2026-04-08 02:20                             ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-08 02:24                             ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  0 siblings, 2 replies; 24+ messages in thread

From: Thomas Munro @ 2026-04-08 02:09 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Dmitry Dolgov <[email protected]>; PostgreSQL Hackers <[email protected]>

On Wed, Apr 8, 2026 at 12:30 PM Andres Freund <[email protected]> wrote:
> On 2026-04-08 11:18:51 +1200, Thomas Munro wrote:
> > >                 /* Choose one worker to wake for this batch. */
> > >                 if (worker == -1)
> > >                         worker = pgaio_worker_choose_idle(-1);
> >
> > Well I didn't want to wake a worker if we'd failed to enqueue
> > anything.
>
> I think it's worth waking up workers if there are idle ones and the queue is
> full?

True, but I prefer to test nsync because there is another reason to break:

commit 29a0fb215779d10fae0cbeb8ce57805f244bad9b
Author: Tomas Vondra <[email protected]>
Date:   Wed Mar 11 12:11:04 2026 +0100

    Conditional locking in pgaio_worker_submit_internal

I haven't finished digesting that commit, and will follow up shortly
on that topic once this patch is in.

> I suspect the primary reasonis that pgaio_worker_request_grow() is triggered
> even when io_worker_control->nworkers is >= io_max_workers.

Yeah.  V6 already addressed that directly.

> I suspect there's also pingpong between submission not finding any workers
> idle, requesting growth, and workers being idle for a short period, then the
> same thing starting again.
>
> Seems like there should be two fields. One saying "notify postmaster again"
> and one "postmaster start a worker".  The former would only be cleared by
> postmaster after the timeout.

Good idea.  V7 has two tweaks:

* separate grow and grow_signal_sent flags, as you suggested
* it also applies the io_worker_launch_delay to cancelled grow requests

This seems to work pretty well for avoiding useless postmaster
wakeups.  You get a few due to cancelled grow requests, but not more
frequently than than io_worker_launch_delay allows, while the pool is
vacillating during workload changes.  It soon makes its mind up and
stabilises on a good size.  To be clear, there is no change in overall
effect, only a reduction in useless wakeups.

I retested the value of request cancellation.  If you comment that
call out, we do tend to overshoot, so I think it's worth having.  But
you were quite right to complain about the postmaster wakeup rate it
produced.

> > Our goal is simple: process every IO immediately.  We have immediate
> > feedback that is simple: there's an IO in the queue and there is no
> > idle worker.  The only action we can take is simple: add one more
> > worker.  So we don't need to suffer through the maths required to
> > figure out the ideal k for our M/G/k queue system (I think that's what
> > we have?) or any of the inputs that would require*.  The problem is
> > that on its own, the test triggered far too easily because a worker
> > that is not marked idle might in fact be just about to pick up that IO
>
> Is that case really concerning? As long as you have some rate limiting about
> the start rate, starting another worker when there are no idle workers seems
> harmless?  Afaict it's fairly self limiting.

I retested without the depth test and I continue to think we need it.
Without it, the pool overshoots by quite a lot.  You should be able to
set io_max_workers=32 without fear of creating a ton of useless worker
processes no matter what your workload.

> > on the one the one hand, and because there might be rare
> > spikes/clustering on the other, so I cooled it off a bit by
> > additionally testing if the queue appears to be growing or spiking
> > beyond some threshold.  I think it's OK to let the queue grow a bit
> > before we are triggered anyway, so the precise value used doesn't seem
> > too critical.  Someone might be able to come up with a more defensible
> > value, but in the end I just wanted a value that isn't triggered by
> > the outliers I see in real systems that are keeping up.  We could tune
> > it lower and overshoot more, but this setting seems to work pretty
> > well.  It doesn't seem likely that a real system could achieve a
> > steady state that is introducing latency but isn't increasing over
> > time, and pool size adjustments are bound to lag anyway.
>
> Yea, I don't think the precise logic matters that much as long as we ramp up
> reasonably fast without being crazy and ramp up a bit faster.

Cool.


Attachments:

  [text/x-patch] v7-0001-aio-Adjust-I-O-worker-pool-size-automatically.patch (47.0K, 2-v7-0001-aio-Adjust-I-O-worker-pool-size-automatically.patch)
  download | inline diff:
From f4ba40548f72a0a6ffb006914d820a5374fcc7fc Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Sat, 22 Mar 2025 00:36:49 +1300
Subject: [PATCH v7] aio: Adjust I/O worker pool size automatically.

The size of the I/O worker pool used to implement io_method=worker was
previously controlled by the io_workers setting, defaulting to 3.  It
was hard to know how to tune it effectively.  It is now replaced with:

  io_min_workers=2
  io_max_workers=8 (up to 32)
  io_worker_idle_timeout=60s
  io_worker_launch_interval=100ms

The pool is automatically sized within the configured range according to
recent variation in demand.  It grows when existing workers detect a
backlog, and shrinks when the highest numbered worker is idle for too
long.  Work was already concentrated into low-numbered workers in
anticipation of this logic.

The logic for waking extra workers now also tries to measure and reduce
the number of spurious wakeups, though they are not entirely eliminated.

Reviewed-by: Dmitry Dolgov <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  69 +-
 src/backend/postmaster/postmaster.c           | 177 +++--
 src/backend/storage/aio/method_worker.c       | 631 +++++++++++++++---
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/misc/guc_parameters.dat     |  34 +-
 src/backend/utils/misc/postgresql.conf.sample |   6 +-
 src/include/storage/io_worker.h               |  11 +-
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pmsignal.h                |   1 +
 src/test/modules/test_aio/t/002_io_workers.pl |  15 +-
 src/tools/pgindent/typedefs.list              |   1 +
 11 files changed, 801 insertions(+), 146 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2c4106ee9ab..1c8b8e7f3e2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2942,16 +2942,75 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
-      <varlistentry id="guc-io-workers" xreflabel="io_workers">
-       <term><varname>io_workers</varname> (<type>integer</type>)
+      <varlistentry id="guc-io-min-workers" xreflabel="io_min_workers">
+       <term><varname>io_min_workers</varname> (<type>integer</type>)
        <indexterm>
-        <primary><varname>io_workers</varname> configuration parameter</primary>
+        <primary><varname>io_min_workers</varname> configuration parameter</primary>
        </indexterm>
        </term>
        <listitem>
         <para>
-         Selects the number of I/O worker processes to use. The default is
-         3. This parameter can only be set in the
+         Sets the minimum number of I/O worker processes. The default is
+         2. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-max-workers" xreflabel="io_max_workers">
+       <term><varname>io_max_workers</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_max_workers</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the maximum number of I/O worker processes. The default is
+         8. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-worker-idle-timeout" xreflabel="io_worker_idle_timeout">
+       <term><varname>io_worker_idle_timeout</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_worker_idle_timeout</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the time after which entirely idle I/O worker processes exit, reducing the
+         size of pool to match demand.  The default is 1 minute.  This
+         parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-worker-launch-interval" xreflabel="io_worker_launch_interval">
+       <term><varname>io_worker_launch_interval</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_worker_launch_interval</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the minimum time before another I/O worker can be launched.  This avoids
+         creating too many for an unsustained burst of activity.  The default is 100ms.
+         This parameter can only be set in the
          <filename>postgresql.conf</filename> file or on the server command
          line.
         </para>
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ae829747004..7851bf1600b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -409,6 +409,7 @@ static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
 /* State for IO worker management. */
+static TimestampTz io_worker_launch_next_time = 0;
 static int	io_worker_count = 0;
 static PMChild *io_worker_children[MAX_IO_WORKERS];
 
@@ -447,7 +448,8 @@ static int	CountChildren(BackendTypeMask targetMask);
 static void LaunchMissingBackgroundProcesses(void);
 static void maybe_start_bgworkers(void);
 static bool maybe_reap_io_worker(int pid);
-static void maybe_adjust_io_workers(void);
+static void maybe_start_io_workers(void);
+static TimestampTz maybe_start_io_workers_scheduled_at(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static PMChild *StartChildProcess(BackendType type);
 static void StartSysLogger(void);
@@ -1391,7 +1393,7 @@ PostmasterMain(int argc, char *argv[])
 	UpdatePMState(PM_STARTUP);
 
 	/* Make sure we can perform I/O while starting up. */
-	maybe_adjust_io_workers();
+	maybe_start_io_workers();
 
 	/* Start bgwriter and checkpointer so they can help with recovery */
 	if (CheckpointerPMChild == NULL)
@@ -1555,14 +1557,15 @@ checkControlFile(void)
 static int
 DetermineSleepTime(void)
 {
-	TimestampTz next_wakeup = 0;
+	TimestampTz next_wakeup;
 
 	/*
-	 * Normal case: either there are no background workers at all, or we're in
-	 * a shutdown sequence (during which we ignore bgworkers altogether).
+	 * If in ImmediateShutdown with a SIGKILL timeout, ignore everything else
+	 * and wait for that.
+	 *
+	 * XXX Shouldn't this also test FatalError?
 	 */
-	if (Shutdown > NoShutdown ||
-		(!StartWorkerNeeded && !HaveCrashedWorker))
+	if (Shutdown >= ImmediateShutdown)
 	{
 		if (AbortStartTime != 0)
 		{
@@ -1582,14 +1585,16 @@ DetermineSleepTime(void)
 
 			return seconds * 1000;
 		}
-		else
-			return 60 * 1000;
 	}
 
-	if (StartWorkerNeeded)
+	/* Time of next maybe_start_io_workers() call, or 0 for none. */
+	next_wakeup = maybe_start_io_workers_scheduled_at();
+
+	/* Ignore bgworkers during shutdown. */
+	if (StartWorkerNeeded && Shutdown == NoShutdown)
 		return 0;
 
-	if (HaveCrashedWorker)
+	if (HaveCrashedWorker && Shutdown == NoShutdown)
 	{
 		dlist_mutable_iter iter;
 
@@ -2545,7 +2550,17 @@ process_pm_child_exit(void)
 			if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
 				HandleChildCrash(pid, exitstatus, _("io worker"));
 
-			maybe_adjust_io_workers();
+			/*
+			 * A worker that exited with an error might have brought the pool
+			 * size below io_min_workers, or allowed the queue to grow to the
+			 * point where another worker called for growth.
+			 *
+			 * In the common case that a worker timed out due to idleness, no
+			 * replacement needs to be started.  maybe_start_io_workers() will
+			 * figure that out.
+			 */
+			maybe_start_io_workers();
+
 			continue;
 		}
 
@@ -3265,7 +3280,7 @@ PostmasterStateMachine(void)
 		UpdatePMState(PM_STARTUP);
 
 		/* Make sure we can perform I/O while starting up. */
-		maybe_adjust_io_workers();
+		maybe_start_io_workers();
 
 		StartupPMChild = StartChildProcess(B_STARTUP);
 		Assert(StartupPMChild != NULL);
@@ -3339,7 +3354,7 @@ LaunchMissingBackgroundProcesses(void)
 	 * A config file change will always lead to this function being called, so
 	 * we always will process the config change in a timely manner.
 	 */
-	maybe_adjust_io_workers();
+	maybe_start_io_workers();
 
 	/*
 	 * The checkpointer and the background writer are active from the start,
@@ -3800,6 +3815,16 @@ process_pm_pmsignal(void)
 		StartWorkerNeeded = true;
 	}
 
+	/* Process IO worker start requests. */
+	if (CheckPostmasterSignal(PMSIGNAL_IO_WORKER_GROW))
+	{
+		/*
+		 * No local flag, as the state is exposed through pgaio_worker_*()
+		 * functions.  This signal is received on potentially actionable level
+		 * changes, so that maybe_start_io_workers() will run.
+		 */
+	}
+
 	/* Process background worker state changes. */
 	if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
 	{
@@ -4402,44 +4427,115 @@ maybe_reap_io_worker(int pid)
 }
 
 /*
- * Start or stop IO workers, to close the gap between the number of running
- * workers and the number of configured workers.  Used to respond to change of
- * the io_workers GUC (by increasing and decreasing the number of workers), as
- * well as workers terminating in response to errors (by starting
- * "replacement" workers).
+ * Returns the next time at which maybe_start_io_workers() would start one or
+ * more I/O workers.  Any time in the past means ASAP, and 0 means no worker
+ * is currently scheduled.
+ *
+ * This is called by DetermineSleepTime() and also maybe_start_io_workers()
+ * itself, to make sure that they agree.
  */
-static void
-maybe_adjust_io_workers(void)
+static TimestampTz
+maybe_start_io_workers_scheduled_at(void)
 {
 	if (!pgaio_workers_enabled())
-		return;
+		return 0;
 
 	/*
 	 * If we're in final shutting down state, then we're just waiting for all
 	 * processes to exit.
 	 */
 	if (pmState >= PM_WAIT_IO_WORKERS)
-		return;
+		return 0;
 
 	/* Don't start new workers during an immediate shutdown either. */
 	if (Shutdown >= ImmediateShutdown)
-		return;
+		return 0;
 
 	/*
 	 * Don't start new workers if we're in the shutdown phase of a crash
 	 * restart. But we *do* need to start if we're already starting up again.
 	 */
 	if (FatalError && pmState >= PM_STOP_BACKENDS)
-		return;
+		return 0;
+
+	/*
+	 * Don't start a worker if we're at or above the maximum.  (Excess workers
+	 * exit when the GUC is lowered, but the count can be temporarily too high
+	 * until they are reaped.)
+	 */
+	if (io_worker_count >= io_max_workers)
+		return 0;
+
+	/* If we're under the minimum, start a worker as soon as possible. */
+	if (io_worker_count < io_min_workers)
+		return TIMESTAMP_MINUS_INFINITY;	/* start worker ASAP */
+
+	/* Only proceed if a "grow" signal has been received from a worker. */
+	if (!pgaio_worker_pm_test_grow_signal_sent())
+		return 0;
 
-	Assert(pmState < PM_WAIT_IO_WORKERS);
+	/*
+	 * maybe_start_io_workers() should start a new I/O worker after this time,
+	 * or as soon as possible if is already in the past.
+	 */
+	return io_worker_launch_next_time;
+}
 
-	/* Not enough running? */
-	while (io_worker_count < io_workers)
+/*
+ * Start I/O workers if required.  Used at startup, to respond to change of
+ * the io_min_workers GUC, when asked to start a new one due to submission
+ * queue backlog, and after workers terminate in response to errors (by
+ * starting "replacement" workers).
+ */
+static void
+maybe_start_io_workers(void)
+{
+	TimestampTz scheduled_at;
+
+	while ((scheduled_at = maybe_start_io_workers_scheduled_at()) != 0)
 	{
+		TimestampTz now = GetCurrentTimestamp();
 		PMChild    *child;
 		int			i;
 
+		Assert(pmState < PM_WAIT_IO_WORKERS);
+
+		/* Still waiting for the scheduled time? */
+		if (scheduled_at > now)
+			break;
+
+		/*
+		 * Compute next launch time relative to the previous value, so that
+		 * time spent on the postmaster's other duties don't result in an
+		 * inaccurate launch interval.
+		 */
+		io_worker_launch_next_time =
+			TimestampTzPlusMilliseconds(io_worker_launch_next_time,
+										io_worker_launch_interval);
+
+		/*
+		 * Check if a grow signal has been sent, but the grow request has been
+		 * canceled since then because the workers ran out of work.  We've
+		 * still advanced the next launch time, so we won't consider any more
+		 * grow signals until then.  That prevents workers from signaling more
+		 * than once in that time period, because we won't clear
+		 * grow_signal_sent until then.
+		 */
+		if (io_worker_count >= io_min_workers && !pgaio_worker_pm_test_grow())
+		{
+			pgaio_worker_pm_clear_grow_signal_sent();
+			break;
+		}
+
+		/*
+		 * If that's already in the past, the interval is either impossibly
+		 * short or we received no requests for new workers for a period.
+		 * Compute a new future time relative to the last launch time instead.
+		 */
+		if (io_worker_launch_next_time <= now)
+			io_worker_launch_next_time =
+				TimestampTzPlusMilliseconds(now, io_worker_launch_interval);
+
 		/* find unused entry in io_worker_children array */
 		for (i = 0; i < MAX_IO_WORKERS; ++i)
 		{
@@ -4457,22 +4553,21 @@ maybe_adjust_io_workers(void)
 			++io_worker_count;
 		}
 		else
-			break;				/* try again next time */
-	}
-
-	/* Too many running? */
-	if (io_worker_count > io_workers)
-	{
-		/* ask the IO worker in the highest slot to exit */
-		for (int i = MAX_IO_WORKERS - 1; i >= 0; --i)
 		{
-			if (io_worker_children[i] != NULL)
-			{
-				kill(io_worker_children[i]->pid, SIGUSR2);
-				break;
-			}
+			/*
+			 * Fork failure: we'll try again after the launch interval
+			 * expires, or be called again without delay if we don't yet have
+			 * io_min_workers.  Don't loop here though, the postmaster has
+			 * other duties.
+			 */
+			break;
 		}
 	}
+
+	/*
+	 * Workers decide when to shut down by themselves, according to the
+	 * io_max_workers and io_worker_idle_timeout GUCs.
+	 */
 }
 
 
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index eb686cede1a..10ae8a2fb50 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -11,9 +11,8 @@
  * infrastructure for reopening the file, and must processed synchronously by
  * the client code when submitted.
  *
- * So that the submitter can make just one system call when submitting a batch
- * of IOs, wakeups "fan out"; each woken IO worker can wake two more. XXX This
- * could be improved by using futexes instead of latches to wake N waiters.
+ * The pool of workers tries to stabilize at a size that can handle recently
+ * seen variation in demand, within the configured limits.
  *
  * This method of AIO is available in all builds on all operating systems, and
  * is the default.
@@ -29,6 +28,8 @@
 
 #include "postgres.h"
 
+#include <limits.h>
+
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -40,6 +41,8 @@
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
 #include "tcop/tcopprot.h"
@@ -48,10 +51,22 @@
 #include "utils/ps_status.h"
 #include "utils/wait_event.h"
 
+/*
+ * Saturation for counters used to estimate wakeup:IO ratio.
+ *
+ * We maintain hist_wakeups for wakeups received and hist_ios for IOs
+ * processed by each worker.  When either counter reaches this saturation
+ * value, we divide both by two.  The result is an exponentially decaying
+ * ratio of wakeups to IOs, with a very short memory.
+ *
+ * If a worker is itself experiencing useless wakeups, it assumes that
+ * higher-numbered workers would experience even more, so it should end the
+ * chain.
+ */
+#define PGAIO_WORKER_WAKEUP_RATIO_SATURATE 4
 
-/* How many workers should each worker wake up if needed? */
-#define IO_WORKER_WAKEUP_FANOUT 2
-
+/* Debugging support: show current IO and wakeups:ios statistics in ps. */
+/* #define PGAIO_WORKER_SHOW_PS_INFO */
 
 typedef struct PgAioWorkerSubmissionQueue
 {
@@ -63,13 +78,35 @@ typedef struct PgAioWorkerSubmissionQueue
 
 typedef struct PgAioWorkerSlot
 {
-	Latch	   *latch;
-	bool		in_use;
+	ProcNumber	proc_number;
 } PgAioWorkerSlot;
 
+/*
+ * Sets of worker IDs are held in a simple bitmap, accessed through functions
+ * that provide a more readable abstraction.  If we wanted to support more
+ * workers than that, the contention on the single queue would surely get too
+ * high, so we might want to consider multiple pools instead of widening this.
+ */
+typedef uint64 PgAioWorkerSet;
+
+#define PGAIO_WORKERSET_BITS (sizeof(PgAioWorkerSet) * CHAR_BIT)
+
+static_assert(PGAIO_WORKERSET_BITS >= MAX_IO_WORKERS, "too small");
+
 typedef struct PgAioWorkerControl
 {
-	uint64		idle_worker_mask;
+	/* Seen by postmaster */
+	bool		grow;
+	bool		grow_signal_sent;
+
+	/* Protected by AioWorkerSubmissionQueueLock. */
+	PgAioWorkerSet idle_workerset;
+
+	/* Protected by AioWorkerControlLock. */
+	PgAioWorkerSet workerset;
+	int			nworkers;
+
+	/* Protected by AioWorkerControlLock. */
 	PgAioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
 } PgAioWorkerControl;
 
@@ -91,15 +128,108 @@ const IoMethodOps pgaio_worker_ops = {
 
 
 /* GUCs */
-int			io_workers = 3;
+int			io_min_workers = 2;
+int			io_max_workers = 8;
+int			io_worker_idle_timeout = 60000;
+int			io_worker_launch_interval = 100;
 
 
 static int	io_worker_queue_size = 64;
-static int	MyIoWorkerId;
+static int	MyIoWorkerId = -1;
 static PgAioWorkerSubmissionQueue *io_worker_submission_queue;
 static PgAioWorkerControl *io_worker_control;
 
 
+static void
+pgaio_workerset_initialize(PgAioWorkerSet *set)
+{
+	*set = 0;
+}
+
+static bool
+pgaio_workerset_is_empty(PgAioWorkerSet *set)
+{
+	return *set == 0;
+}
+
+static PgAioWorkerSet
+pgaio_workerset_singleton(int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	return UINT64_C(1) << worker;
+}
+
+static void
+pgaio_workerset_all(PgAioWorkerSet *set)
+{
+	*set = UINT64_MAX >> (PGAIO_WORKERSET_BITS - MAX_IO_WORKERS);
+}
+
+static void
+pgaio_workerset_subtract(PgAioWorkerSet *set1, const PgAioWorkerSet *set2)
+{
+	*set1 &= ~*set2;
+}
+
+static void
+pgaio_workerset_insert(PgAioWorkerSet *set, int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	*set |= pgaio_workerset_singleton(worker);
+}
+
+static void
+pgaio_workerset_remove(PgAioWorkerSet *set, int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	*set &= ~pgaio_workerset_singleton(worker);
+}
+
+static void
+pgaio_workerset_remove_lte(PgAioWorkerSet *set, int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	*set &= (~(PgAioWorkerSet) 0) << (worker + 1);
+}
+
+static int
+pgaio_workerset_get_highest(PgAioWorkerSet *set)
+{
+	Assert(!pgaio_workerset_is_empty(set));
+	return pg_leftmost_one_pos64(*set);
+}
+
+static int
+pgaio_workerset_get_lowest(PgAioWorkerSet *set)
+{
+	Assert(!pgaio_workerset_is_empty(set));
+	return pg_rightmost_one_pos64(*set);
+}
+
+static int
+pgaio_workerset_pop_lowest(PgAioWorkerSet *set)
+{
+	int			worker = pgaio_workerset_get_lowest(set);
+
+	pgaio_workerset_remove(set, worker);
+	return worker;
+}
+
+#ifdef USE_ASSERT_CHECKING
+static bool
+pgaio_workerset_contains(PgAioWorkerSet *set, int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	return (*set & pgaio_workerset_singleton(worker)) != 0;
+}
+
+static int
+pgaio_workerset_count(PgAioWorkerSet *set)
+{
+	return pg_popcount64(*set);
+}
+#endif
+
 static void
 pgaio_worker_shmem_request(void *arg)
 {
@@ -133,37 +263,159 @@ pgaio_worker_shmem_init(void *arg)
 	io_worker_submission_queue->size = queue_size;
 	io_worker_submission_queue->head = 0;
 	io_worker_submission_queue->tail = 0;
+	io_worker_control->grow = false;
+	pgaio_workerset_initialize(&io_worker_control->workerset);
+	pgaio_workerset_initialize(&io_worker_control->idle_workerset);
 
-	io_worker_control->idle_worker_mask = 0;
 	for (int i = 0; i < MAX_IO_WORKERS; ++i)
+		io_worker_control->workers[i].proc_number = INVALID_PROC_NUMBER;
+}
+
+/*
+ * Tell postmaster that we think a new worker is needed.
+ */
+static void
+pgaio_worker_request_grow(void)
+{
+	/*
+	 * Suppress useless signaling if we already know that we're at the
+	 * maximum.  This uses an unlocked read of nworkers, but that's OK for
+	 * this heuristic purpose.
+	 */
+	if (io_worker_control->nworkers < io_max_workers)
 	{
-		io_worker_control->workers[i].latch = NULL;
-		io_worker_control->workers[i].in_use = false;
+		if (!io_worker_control->grow)
+		{
+			io_worker_control->grow = true;
+			pg_memory_barrier();
+
+			/*
+			 * If the postmaster has already been signaled, don't do it again
+			 * until the postmaster clears this flag.  There is no point in
+			 * repeated signals if grow is being set and cleared repeatedly
+			 * while the postmaster is waiting for io_worker_launch_interval
+			 * (which it applies even to canceled requests).
+			 */
+			if (!io_worker_control->grow_signal_sent)
+			{
+				io_worker_control->grow_signal_sent = true;
+				pg_memory_barrier();
+				SendPostmasterSignal(PMSIGNAL_IO_WORKER_GROW);
+			}
+		}
 	}
 }
 
+/*
+ * Cancel any request for a new worker, after observing an empty queue.
+ */
+static void
+pgaio_worker_cancel_grow(void)
+{
+	if (io_worker_control->grow)
+	{
+		io_worker_control->grow = false;
+		pg_memory_barrier();
+	}
+}
+
+/*
+ * Called by the postmaster to check if a new worker has been requested (but
+ * possibly canceled since).
+ */
+bool
+pgaio_worker_pm_test_grow_signal_sent(void)
+{
+	pg_memory_barrier();
+	return io_worker_control && io_worker_control->grow_signal_sent;
+}
+
+/*
+ * Called by the postmaster to check if a new worker has been requested and
+ * not canceled since.
+ */
+bool
+pgaio_worker_pm_test_grow(void)
+{
+	pg_memory_barrier();
+	return io_worker_control && io_worker_control->grow;
+}
+
+/*
+ * Called by the postmaster to clear the request for a new worker.
+ */
+void
+pgaio_worker_pm_clear_grow_signal_sent(void)
+{
+	if (io_worker_control)
+	{
+		io_worker_control->grow = false;
+		io_worker_control->grow_signal_sent = false;
+	}
+	pg_memory_barrier();
+}
+
 static int
-pgaio_worker_choose_idle(void)
+pgaio_worker_choose_idle(int only_workers_above)
 {
+	PgAioWorkerSet workerset;
 	int			worker;
 
-	if (io_worker_control->idle_worker_mask == 0)
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
+	workerset = io_worker_control->idle_workerset;
+	if (only_workers_above >= 0)
+		pgaio_workerset_remove_lte(&workerset, only_workers_above);
+	if (pgaio_workerset_is_empty(&workerset))
 		return -1;
 
-	/* Find the lowest bit position, and clear it. */
-	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
-	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
-	Assert(io_worker_control->workers[worker].in_use);
+	/* Find the lowest numbered idle worker and mark it not idle. */
+	worker = pgaio_workerset_get_lowest(&workerset);
+	pgaio_workerset_remove(&io_worker_control->idle_workerset, worker);
 
 	return worker;
 }
 
+/*
+ * Try to wake a worker by setting its latch, to tell it there are IOs to
+ * process in the submission queue.
+ */
+static void
+pgaio_worker_wake(int worker)
+{
+	ProcNumber	proc_number;
+
+	/*
+	 * If the selected worker is concurrently exiting, then pgaio_worker_die()
+	 * had not yet removed it as of when we saw it in idle_workerset.  That's
+	 * OK, because it will wake all remaining workers to close wakeup-vs-exit
+	 * races: *someone* will see the queued IO.  If there are no workers
+	 * running, the postmaster will start a new one.
+	 */
+	proc_number = io_worker_control->workers[worker].proc_number;
+	if (proc_number != INVALID_PROC_NUMBER)
+		SetLatch(&GetPGProcByNumber(proc_number)->procLatch);
+}
+
+/*
+ * Try to wake a set of workers.  Used on pool change, to close races
+ * described in the callers.
+ */
+static void
+pgaio_workerset_wake(PgAioWorkerSet workerset)
+{
+	while (!pgaio_workerset_is_empty(&workerset))
+		pgaio_worker_wake(pgaio_workerset_pop_lowest(&workerset));
+}
+
 static bool
 pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
 {
 	PgAioWorkerSubmissionQueue *queue;
 	uint32		new_head;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	queue = io_worker_submission_queue;
 	new_head = (queue->head + 1) & (queue->size - 1);
 	if (new_head == queue->tail)
@@ -185,6 +437,8 @@ pgaio_worker_submission_queue_consume(void)
 	PgAioWorkerSubmissionQueue *queue;
 	int			result;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	queue = io_worker_submission_queue;
 	if (queue->tail == queue->head)
 		return -1;				/* empty */
@@ -201,6 +455,8 @@ pgaio_worker_submission_queue_depth(void)
 	uint32		head;
 	uint32		tail;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	head = io_worker_submission_queue->head;
 	tail = io_worker_submission_queue->tail;
 
@@ -226,8 +482,7 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 {
 	PgAioHandle **synchronous_ios = NULL;
 	int			nsync = 0;
-	Latch	   *wakeup = NULL;
-	int			worker;
+	int			worker = -1;
 
 	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
 
@@ -251,20 +506,15 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 
 				break;
 			}
-
-			if (wakeup == NULL)
-			{
-				/* Choose an idle worker to wake up if we haven't already. */
-				worker = pgaio_worker_choose_idle();
-				if (worker >= 0)
-					wakeup = io_worker_control->workers[worker].latch;
-
-				pgaio_debug_io(DEBUG4, staged_ios[i],
-							   "choosing worker %d",
-							   worker);
-			}
 		}
+		/* Choose one worker to wake for this batch. */
+		if (nsync < num_staged_ios)
+			worker = pgaio_worker_choose_idle(-1);
 		LWLockRelease(AioWorkerSubmissionQueueLock);
+
+		/* Wake up chosen worker.  It will wake peers if necessary. */
+		if (nsync == 0)
+			pgaio_worker_wake(worker);
 	}
 	else
 	{
@@ -273,9 +523,6 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 		nsync = num_staged_ios;
 	}
 
-	if (wakeup)
-		SetLatch(wakeup);
-
 	/* Run whatever is left synchronously. */
 	if (nsync > 0)
 	{
@@ -295,14 +542,30 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 static void
 pgaio_worker_die(int code, Datum arg)
 {
-	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
-	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
-	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+	PgAioWorkerSet notify_set;
 
-	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
-	io_worker_control->workers[MyIoWorkerId].in_use = false;
-	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	pgaio_workerset_remove(&io_worker_control->idle_workerset, MyIoWorkerId);
 	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].proc_number == MyProcNumber);
+	io_worker_control->workers[MyIoWorkerId].proc_number = INVALID_PROC_NUMBER;
+	Assert(pgaio_workerset_contains(&io_worker_control->workerset, MyIoWorkerId));
+	pgaio_workerset_remove(&io_worker_control->workerset, MyIoWorkerId);
+	notify_set = io_worker_control->workerset;
+	Assert(io_worker_control->nworkers > 0);
+	io_worker_control->nworkers--;
+	Assert(pgaio_workerset_count(&io_worker_control->workerset) ==
+		   io_worker_control->nworkers);
+	LWLockRelease(AioWorkerControlLock);
+
+	/*
+	 * Notify other workers on pool change.  This allows the new highest
+	 * worker to know that it is now the one that can time out, and closes a
+	 * wakeup-loss race described in pgaio_worker_wake().
+	 */
+	pgaio_workerset_wake(notify_set);
 }
 
 /*
@@ -312,33 +575,38 @@ pgaio_worker_die(int code, Datum arg)
 static void
 pgaio_worker_register(void)
 {
+	PgAioWorkerSet free_workerset;
+	PgAioWorkerSet old_workerset;
+
 	MyIoWorkerId = -1;
 
-	/*
-	 * XXX: This could do with more fine-grained locking. But it's also not
-	 * very common for the number of workers to change at the moment...
-	 */
-	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
+	/* Find lowest unused worker ID. */
+	pgaio_workerset_all(&free_workerset);
+	pgaio_workerset_subtract(&free_workerset, &io_worker_control->workerset);
+	if (!pgaio_workerset_is_empty(&free_workerset))
+		MyIoWorkerId = pgaio_workerset_get_lowest(&free_workerset);
+	if (MyIoWorkerId == -1)
+		elog(ERROR, "couldn't find a free worker ID");
 
-	for (int i = 0; i < MAX_IO_WORKERS; ++i)
-	{
-		if (!io_worker_control->workers[i].in_use)
-		{
-			Assert(io_worker_control->workers[i].latch == NULL);
-			io_worker_control->workers[i].in_use = true;
-			MyIoWorkerId = i;
-			break;
-		}
-		else
-			Assert(io_worker_control->workers[i].latch != NULL);
-	}
+	Assert(io_worker_control->workers[MyIoWorkerId].proc_number ==
+		   INVALID_PROC_NUMBER);
+	io_worker_control->workers[MyIoWorkerId].proc_number = MyProcNumber;
 
-	if (MyIoWorkerId == -1)
-		elog(ERROR, "couldn't find a free worker slot");
+	old_workerset = io_worker_control->workerset;
+	Assert(!pgaio_workerset_contains(&old_workerset, MyIoWorkerId));
+	pgaio_workerset_insert(&io_worker_control->workerset, MyIoWorkerId);
+	io_worker_control->nworkers++;
+	Assert(io_worker_control->nworkers <= MAX_IO_WORKERS);
+	Assert(pgaio_workerset_count(&io_worker_control->workerset) ==
+		   io_worker_control->nworkers);
+	LWLockRelease(AioWorkerControlLock);
 
-	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
-	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
-	LWLockRelease(AioWorkerSubmissionQueueLock);
+	/*
+	 * Notify other workers on pool change.  If we were the highest worker,
+	 * this allows the new highest worker to know that it can time out.
+	 */
+	pgaio_workerset_wake(old_workerset);
 
 	on_shmem_exit(pgaio_worker_die, 0);
 }
@@ -364,14 +632,48 @@ pgaio_worker_error_callback(void *arg)
 	errcontext("I/O worker executing I/O on behalf of process %d", owner_pid);
 }
 
+/*
+ * Check if this backend is allowed to time out, and thus should use a
+ * non-infinite sleep time.  Only the highest-numbered worker is allowed to
+ * time out, and only if the pool is above io_min_workers.  Serializing
+ * timeouts keeps IDs in a range 0..N without gaps, and avoids undershooting
+ * io_min_workers.
+ *
+ * The result is only instantaneously true and may be temporarily inconsistent
+ * in different workers around transitions, but all workers are woken up on
+ * pool size or GUC changes making the result eventually consistent.
+ */
+static bool
+pgaio_worker_can_timeout(void)
+{
+	PgAioWorkerSet workerset;
+
+	/* Serialize against pool size changes. */
+	LWLockAcquire(AioWorkerControlLock, LW_SHARED);
+	workerset = io_worker_control->workerset;
+	LWLockRelease(AioWorkerControlLock);
+
+	if (MyIoWorkerId != pgaio_workerset_get_highest(&workerset))
+		return false;
+
+	if (MyIoWorkerId < io_min_workers)
+		return false;
+
+	return true;
+}
+
 void
 IoWorkerMain(const void *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	TimestampTz idle_timeout_abs = 0;
+	int			timeout_guc_used = 0;
 	PgAioHandle *volatile error_ioh = NULL;
 	ErrorContextCallback errcallback = {0};
 	volatile int error_errno = 0;
 	char		cmd[128];
+	int			hist_ios = 0;
+	int			hist_wakeups = 0;
 
 	AuxiliaryProcessMainCommon();
 
@@ -439,10 +741,9 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	while (!ShutdownRequestPending)
 	{
 		uint32		io_index;
-		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
-		int			nlatches = 0;
-		int			nwakeups = 0;
-		int			worker;
+		int			worker = -1;
+		int			queue_depth = 0;
+		bool		maybe_grow = false;
 
 		/*
 		 * Try to get a job to do.
@@ -453,38 +754,107 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
 		if ((io_index = pgaio_worker_submission_queue_consume()) == -1)
 		{
-			/*
-			 * Nothing to do.  Mark self idle.
-			 *
-			 * XXX: Invent some kind of back pressure to reduce useless
-			 * wakeups?
-			 */
-			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+			/* Nothing to do.  Mark self idle. */
+			pgaio_workerset_insert(&io_worker_control->idle_workerset,
+								   MyIoWorkerId);
 		}
 		else
 		{
 			/* Got one.  Clear idle flag. */
-			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+			pgaio_workerset_remove(&io_worker_control->idle_workerset,
+								   MyIoWorkerId);
 
-			/* See if we can wake up some peers. */
-			nwakeups = Min(pgaio_worker_submission_queue_depth(),
-						   IO_WORKER_WAKEUP_FANOUT);
-			for (int i = 0; i < nwakeups; ++i)
+			/*
+			 * See if we should wake up a higher numbered peer.  Only do that
+			 * if this worker is not receiving spurious wakeups itself.  The
+			 * intention is create a frontier beyond which idle workers stay
+			 * asleep.
+			 *
+			 * This heuristic tries to discover the useful wakeup propagation
+			 * chain length when IOs are very fast and workers wake up to find
+			 * that all IOs have already been taken.
+			 *
+			 * If we chose not to wake a worker when we ideally should have,
+			 * then ios will soon exceed wakeups.
+			 */
+			if (hist_wakeups <= hist_ios)
 			{
-				if ((worker = pgaio_worker_choose_idle()) < 0)
-					break;
-				latches[nlatches++] = io_worker_control->workers[worker].latch;
+				queue_depth = pgaio_worker_submission_queue_depth();
+				if (queue_depth > 0)
+				{
+					/* Choose a worker higher than me to wake. */
+					worker = pgaio_worker_choose_idle(MyIoWorkerId);
+					if (worker == -1)
+						maybe_grow = true;
+				}
 			}
 		}
 		LWLockRelease(AioWorkerSubmissionQueueLock);
 
-		for (int i = 0; i < nlatches; ++i)
-			SetLatch(latches[i]);
+		/* Propagate wakeups. */
+		if (worker != -1)
+		{
+			pgaio_worker_wake(worker);
+		}
+		else if (maybe_grow)
+		{
+			/*
+			 * We know there was at least one more item in the queue, and we
+			 * failed to find a higher-numbered idle worker to wake.  Now we
+			 * decide if we should try to start one more worker.
+			 *
+			 * We do this with a simple heuristic: is the queue depth greater
+			 * than the current number of workers?
+			 *
+			 * Consider the following situations:
+			 *
+			 * 1. The queue depth is constantly increasing, because IOs are
+			 * arriving faster than they can possibly be serviced.  It doesn't
+			 * matter much which threshold we choose, as we will surely hit
+			 * it.  Crossing the current worker count is a useful signal
+			 * because it's clearly too deep to avoid queuing latency already,
+			 * but still leaves a small window of opportunity to improve the
+			 * situation before the queue oveflows.
+			 *
+			 * 2. The worker pool is keeping up, no latency is being
+			 * introduced and an extra worker would be a waste of resources.
+			 * Queue depth distributions tend to be heavily skewed, with long
+			 * tails of low probability spikes (due to submission clustering,
+			 * scheduling, jitter, stalls, noisy neighbors, etc).  We want a
+			 * number that is very unlikely to be triggered by an outlier, and
+			 * we bet that an exponential or similar distribution whose
+			 * outliers never reach this threshold must be almost entirely
+			 * concentrated at the low end.  If we do see a spike as big as
+			 * the worker count, we take it as a signal that the distribution
+			 * is surely too wide.
+			 *
+			 * On its own, this is an extremely crude signal.  When combined
+			 * with the wakeup propagation test that precedes it (but on its
+			 * own tends to overshoot) and the io_worker_launch_delay, we
+			 * gradually try each pool size until we find one that doesn't
+			 * trigger further growth.
+			 *
+			 * XXX Ideas from queueing theory or control theory could surely
+			 * do a much better job of this.
+			 */
+
+			/* Read nworkers without lock for this heuristic purpose. */
+			if (queue_depth > io_worker_control->nworkers)
+				pgaio_worker_request_grow();
+		}
 
 		if (io_index != -1)
 		{
 			PgAioHandle *ioh = NULL;
 
+			/* Cancel timeout and update wakeup:work ratio. */
+			idle_timeout_abs = 0;
+			if (++hist_ios == PGAIO_WORKER_WAKEUP_RATIO_SATURATE)
+			{
+				hist_wakeups /= 2;
+				hist_ios /= 2;
+			}
+
 			ioh = &pgaio_ctl->io_handles[io_index];
 			error_ioh = ioh;
 			errcallback.arg = ioh;
@@ -537,6 +907,19 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 			}
 #endif
 
+#ifdef PGAIO_WORKER_SHOW_PS_INFO
+			{
+				char	   *description = pgaio_io_get_target_description(ioh);
+
+				sprintf(cmd, "%d: [%s] %s",
+						MyIoWorkerId,
+						pgaio_io_get_op_name(ioh),
+						pgaio_io_get_target_description(ioh));
+				pfree(description);
+				set_ps_display(cmd);
+			}
+#endif
+
 			/*
 			 * We don't expect this to ever fail with ERROR or FATAL, no need
 			 * to keep error_ioh set to the IO.
@@ -550,8 +933,76 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		}
 		else
 		{
-			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-					  WAIT_EVENT_IO_WORKER_MAIN);
+			int			timeout_ms;
+
+			/* Cancel new worker request if pending. */
+			pgaio_worker_cancel_grow();
+
+			/* Compute the remaining allowed idle time. */
+			if (io_worker_idle_timeout == -1)
+			{
+				/* Never time out. */
+				timeout_ms = -1;
+			}
+			else
+			{
+				TimestampTz now = GetCurrentTimestamp();
+
+				/* If the GUC changes, reset timer. */
+				if (idle_timeout_abs != 0 &&
+					io_worker_idle_timeout != timeout_guc_used)
+					idle_timeout_abs = 0;
+
+				/* Only the highest-numbered worker can time out. */
+				if (pgaio_worker_can_timeout())
+				{
+					if (idle_timeout_abs == 0)
+					{
+						/*
+						 * I have just been promoted to the timeout worker, or
+						 * the GUC changed.  Compute new absolute time from
+						 * now.
+						 */
+						idle_timeout_abs =
+							TimestampTzPlusMilliseconds(now,
+														io_worker_idle_timeout);
+						timeout_guc_used = io_worker_idle_timeout;
+					}
+					timeout_ms =
+						TimestampDifferenceMilliseconds(now, idle_timeout_abs);
+				}
+				else
+				{
+					/* No timeout for me. */
+					idle_timeout_abs = 0;
+					timeout_ms = -1;
+				}
+			}
+
+#ifdef PGAIO_WORKER_SHOW_PS_INFO
+			sprintf(cmd, "%d: idle, wakeups:ios = %d:%d",
+					MyIoWorkerId, hist_wakeups, hist_ios);
+			set_ps_display(cmd);
+#endif
+
+			if (WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT,
+						  timeout_ms,
+						  WAIT_EVENT_IO_WORKER_MAIN) == WL_TIMEOUT)
+			{
+				/* WL_TIMEOUT */
+				if (pgaio_worker_can_timeout())
+					if (GetCurrentTimestamp() >= idle_timeout_abs)
+						break;
+			}
+			else
+			{
+				/* WL_LATCH_SET */
+				if (++hist_wakeups == PGAIO_WORKER_WAKEUP_RATIO_SATURATE)
+				{
+					hist_wakeups /= 2;
+					hist_ios /= 2;
+				}
+			}
 			ResetLatch(MyLatch);
 		}
 
@@ -561,6 +1012,10 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		{
 			ConfigReloadPending = false;
 			ProcessConfigFile(PGC_SIGHUP);
+
+			/* If io_max_workers has been decreased, exit highest first. */
+			if (MyIoWorkerId >= io_max_workers)
+				break;
 		}
 	}
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7bda5298558..560659f9568 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -369,6 +369,7 @@ AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 LogicalDecodingControl	"Waiting to read or update logical decoding status information."
 DataChecksumsWorker	"Waiting for data checksums worker."
+AioWorkerControl	"Waiting to update AIO worker information."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 86c1eba5dab..83af594d4af 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1390,6 +1390,14 @@
   check_hook => 'check_io_max_concurrency',
 },
 
+{ name => 'io_max_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Maximum number of I/O worker processes, for io_method=worker.',
+  variable => 'io_max_workers',
+  boot_val => '8',
+  min => '1',
+  max => 'MAX_IO_WORKERS',
+},
+
 { name => 'io_method', type => 'enum', context => 'PGC_POSTMASTER', group => 'RESOURCES_IO',
   short_desc => 'Selects the method for executing asynchronous I/O.',
   variable => 'io_method',
@@ -1398,14 +1406,32 @@
   assign_hook => 'assign_io_method',
 },
 
-{ name => 'io_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
-  short_desc => 'Number of IO worker processes, for io_method=worker.',
-  variable => 'io_workers',
-  boot_val => '3',
+{ name => 'io_min_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Minimum number of I/O worker processes, for io_method=worker.',
+  variable => 'io_min_workers',
+  boot_val => '2',
   min => '1',
   max => 'MAX_IO_WORKERS',
 },
 
+{ name => 'io_worker_idle_timeout', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Maximum time before idle I/O worker processes time out, for io_method=worker.',
+  variable => 'io_worker_idle_timeout',
+  flags => 'GUC_UNIT_MS',
+  boot_val => '60000',
+  min => '0',
+  max => 'INT_MAX',
+},
+
+{ name => 'io_worker_launch_interval', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Minimum time before launching a new I/O worker process, for io_method=worker.',
+  variable => 'io_worker_launch_interval',
+  flags => 'GUC_UNIT_MS',
+  boot_val => '100',
+  min => '0',
+  max => 'INT_MAX',
+},
+
 # Not for general use --- used by SET SESSION AUTHORIZATION and SET
 # ROLE
 { name => 'is_superuser', type => 'bool', context => 'PGC_INTERNAL', group => 'UNGROUPED',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4f2bbf05295..5e1e49f0ae8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -222,7 +222,11 @@
                                         # can execute simultaneously
                                         # -1 sets based on shared_buffers
                                         # (change requires restart)
-#io_workers = 3                         # 1-32;
+
+#io_min_workers = 2                     # 1-32 (change requires pg_reload_conf())
+#io_max_workers = 8                     # 1-32
+#io_worker_idle_timeout = 60s
+#io_worker_launch_interval = 100ms
 
 # - Worker Processes -
 
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
index f7d5998a138..c852c9f3741 100644
--- a/src/include/storage/io_worker.h
+++ b/src/include/storage/io_worker.h
@@ -17,6 +17,15 @@
 
 pg_noreturn extern void IoWorkerMain(const void *startup_data, size_t startup_data_len);
 
-extern PGDLLIMPORT int io_workers;
+/* Public GUCs. */
+extern PGDLLIMPORT int io_min_workers;
+extern PGDLLIMPORT int io_max_workers;
+extern PGDLLIMPORT int io_worker_idle_timeout;
+extern PGDLLIMPORT int io_worker_launch_interval;
+
+/* Interfaces visible to the postmaster. */
+extern bool pgaio_worker_pm_test_grow_signal_sent(void);
+extern void pgaio_worker_pm_clear_grow_signal_sent(void);
+extern bool pgaio_worker_pm_test_grow(void);
 
 #endif							/* IO_WORKER_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index af8553bcb6c..d7eb648bd27 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -88,6 +88,7 @@ PG_LWLOCK(53, AioWorkerSubmissionQueue)
 PG_LWLOCK(54, WaitLSN)
 PG_LWLOCK(55, LogicalDecodingControl)
 PG_LWLOCK(56, DataChecksumsWorker)
+PG_LWLOCK(57, AioWorkerControl)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 001e6eea61c..bcce4011790 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -38,6 +38,7 @@ typedef enum
 	PMSIGNAL_ROTATE_LOGFILE,	/* send SIGUSR1 to syslogger to rotate logfile */
 	PMSIGNAL_START_AUTOVAC_LAUNCHER,	/* start an autovacuum launcher */
 	PMSIGNAL_START_AUTOVAC_WORKER,	/* start an autovacuum worker */
+	PMSIGNAL_IO_WORKER_GROW,	/* I/O worker pool wants to grow */
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
diff --git a/src/test/modules/test_aio/t/002_io_workers.pl b/src/test/modules/test_aio/t/002_io_workers.pl
index 34bc132ea08..b9775811d4d 100644
--- a/src/test/modules/test_aio/t/002_io_workers.pl
+++ b/src/test/modules/test_aio/t/002_io_workers.pl
@@ -14,6 +14,9 @@ $node->init();
 $node->append_conf(
 	'postgresql.conf', qq(
 io_method=worker
+io_worker_idle_timeout=0ms
+io_worker_launch_interval=0ms
+io_max_workers=32
 ));
 
 $node->start();
@@ -31,7 +34,7 @@ sub test_number_of_io_workers_dynamic
 {
 	my $node = shift;
 
-	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_workers');
+	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_min_workers');
 
 	# Verify that worker count can't be set to 0
 	change_number_of_io_workers($node, 0, $prev_worker_count, 1);
@@ -62,24 +65,24 @@ sub change_number_of_io_workers
 	my ($result, $stdout, $stderr);
 
 	($result, $stdout, $stderr) =
-	  $node->psql('postgres', "ALTER SYSTEM SET io_workers = $worker_count");
+	  $node->psql('postgres', "ALTER SYSTEM SET io_min_workers = $worker_count");
 	$node->safe_psql('postgres', 'SELECT pg_reload_conf()');
 
 	if ($expect_failure)
 	{
 		like(
 			$stderr,
-			qr/$worker_count is outside the valid range for parameter "io_workers"/,
-			"updating number of io_workers to $worker_count failed, as expected"
+			qr/$worker_count is outside the valid range for parameter "io_min_workers"/,
+			"updating io_min_workers to $worker_count failed, as expected"
 		);
 
 		return $prev_worker_count;
 	}
 	else
 	{
-		is( $node->safe_psql('postgres', 'SHOW io_workers'),
+		is( $node->safe_psql('postgres', 'SHOW io_min_workers'),
 			$worker_count,
-			"updating number of io_workers from $prev_worker_count to $worker_count"
+			"updating number of io_min_workers from $prev_worker_count to $worker_count"
 		);
 
 		check_io_worker_count($node, $worker_count);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2dfe1b38826..3dea516912c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2271,6 +2271,7 @@ PgAioUringCaps
 PgAioUringContext
 PgAioWaitRef
 PgAioWorkerControl
+PgAioWorkerSet
 PgAioWorkerSlot
 PgAioWorkerSubmissionQueue
 PgArchData
-- 
2.53.0



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-07-12 05:08       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-07-30 10:14         ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-08-04 05:30           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-03-28 09:31             ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 15:02               ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 18:14                 ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-07 10:39                   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-07 19:01                     ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-07 23:18                       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-08 00:30                         ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-08 02:09                           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
@ 2026-04-08 02:20                             ` Andres Freund <[email protected]>
  2026-04-08 02:47                               ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  1 sibling, 1 reply; 24+ messages in thread

From: Andres Freund @ 2026-04-08 02:20 UTC (permalink / raw)
  To: Thomas Munro <[email protected]>; +Cc: Dmitry Dolgov <[email protected]>; PostgreSQL Hackers <[email protected]>

Hi,

On 2026-04-08 14:09:16 +1200, Thomas Munro wrote:
> On Wed, Apr 8, 2026 at 12:30 PM Andres Freund <[email protected]> wrote:
> > On 2026-04-08 11:18:51 +1200, Thomas Munro wrote:
> > > >                 /* Choose one worker to wake for this batch. */
> > > >                 if (worker == -1)
> > > >                         worker = pgaio_worker_choose_idle(-1);
> > >
> > > Well I didn't want to wake a worker if we'd failed to enqueue
> > > anything.
> >
> > I think it's worth waking up workers if there are idle ones and the queue is
> > full?
> 
> True, but I prefer to test nsync because there is another reason to break:

I don't follow.  What I was proposing is after the conditional lock
acquisition succeeded.  So is your nsync == 0 check.

> +/*
> + * Tell postmaster that we think a new worker is needed.
> + */
> +static void
> +pgaio_worker_request_grow(void)
> +{
> +	/*
> +	 * Suppress useless signaling if we already know that we're at the
> +	 * maximum.  This uses an unlocked read of nworkers, but that's OK for
> +	 * this heuristic purpose.
> +	 */
> +	if (io_worker_control->nworkers < io_max_workers)
>  	{
> -		io_worker_control->workers[i].latch = NULL;
> -		io_worker_control->workers[i].in_use = false;
> +		if (!io_worker_control->grow)
> +		{
> +			io_worker_control->grow = true;
> +			pg_memory_barrier();
> +
> +			/*
> +			 * If the postmaster has already been signaled, don't do it again
> +			 * until the postmaster clears this flag.  There is no point in
> +			 * repeated signals if grow is being set and cleared repeatedly
> +			 * while the postmaster is waiting for io_worker_launch_interval
> +			 * (which it applies even to canceled requests).
> +			 */
> +			if (!io_worker_control->grow_signal_sent)
> +			{
> +				io_worker_control->grow_signal_sent = true;
> +				pg_memory_barrier();
> +				SendPostmasterSignal(PMSIGNAL_IO_WORKER_GROW);
> +			}
> +		}
>  	}
>  }


I'd probbly use early returns to make it a bit more readable.



> +static bool
> +pgaio_worker_can_timeout(void)
> +{
> +	PgAioWorkerSet workerset;
> +
> +	/* Serialize against pool size changes. */
> +	LWLockAcquire(AioWorkerControlLock, LW_SHARED);
> +	workerset = io_worker_control->workerset;
> +	LWLockRelease(AioWorkerControlLock);
> +
> +	if (MyIoWorkerId != pgaio_workerset_get_highest(&workerset))
> +		return false;
> +
> +	if (MyIoWorkerId < io_min_workers)
> +		return false;
> +
> +	return true;
> +}

I guess I'd move the < io_min_workers to earlier so that you don't acquire the
lock if that'll return false anyway.


Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-07-12 05:08       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-07-30 10:14         ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-08-04 05:30           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-03-28 09:31             ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 15:02               ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 18:14                 ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-07 10:39                   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-07 19:01                     ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-07 23:18                       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-08 00:30                         ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-08 02:09                           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-08 02:20                             ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
@ 2026-04-08 02:47                               ` Thomas Munro <[email protected]>
  0 siblings, 0 replies; 24+ messages in thread

From: Thomas Munro @ 2026-04-08 02:47 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Dmitry Dolgov <[email protected]>; PostgreSQL Hackers <[email protected]>

On Wed, Apr 8, 2026 at 2:20 PM Andres Freund <[email protected]> wrote:
> I don't follow.  What I was proposing is after the conditional lock
> acquisition succeeded.  So is your nsync == 0 check.

Oops.  Sorry, brainfade.  Condition removed.

> > +/*
> > + * Tell postmaster that we think a new worker is needed.
> > + */
> > +static void
> > +pgaio_worker_request_grow(void)
> > +{
> > +     /*
> > +      * Suppress useless signaling if we already know that we're at the
> > +      * maximum.  This uses an unlocked read of nworkers, but that's OK for
> > +      * this heuristic purpose.
> > +      */
> > +     if (io_worker_control->nworkers < io_max_workers)
> >       {
> > -             io_worker_control->workers[i].latch = NULL;
> > -             io_worker_control->workers[i].in_use = false;
> > +             if (!io_worker_control->grow)
> > +             {
> > +                     io_worker_control->grow = true;
> > +                     pg_memory_barrier();
> > +
> > +                     /*
> > +                      * If the postmaster has already been signaled, don't do it again
> > +                      * until the postmaster clears this flag.  There is no point in
> > +                      * repeated signals if grow is being set and cleared repeatedly
> > +                      * while the postmaster is waiting for io_worker_launch_interval
> > +                      * (which it applies even to canceled requests).
> > +                      */
> > +                     if (!io_worker_control->grow_signal_sent)
> > +                     {
> > +                             io_worker_control->grow_signal_sent = true;
> > +                             pg_memory_barrier();
> > +                             SendPostmasterSignal(PMSIGNAL_IO_WORKER_GROW);
> > +                     }
> > +             }
> >       }
> >  }
>
>
> I'd probbly use early returns to make it a bit more readable.

Done for this and similar functions.

> > +static bool
> > +pgaio_worker_can_timeout(void)
> > +{
> > +     PgAioWorkerSet workerset;
> > +
> > +     /* Serialize against pool size changes. */
> > +     LWLockAcquire(AioWorkerControlLock, LW_SHARED);
> > +     workerset = io_worker_control->workerset;
> > +     LWLockRelease(AioWorkerControlLock);
> > +
> > +     if (MyIoWorkerId != pgaio_workerset_get_highest(&workerset))
> > +             return false;
> > +
> > +     if (MyIoWorkerId < io_min_workers)
> > +             return false;
> > +
> > +     return true;
> > +}
>
> I guess I'd move the < io_min_workers to earlier so that you don't acquire the
> lock if that'll return false anyway.

Done.


Attachments:

  [text/x-patch] v8-0001-aio-Adjust-I-O-worker-pool-size-automatically.patch (47.0K, 2-v8-0001-aio-Adjust-I-O-worker-pool-size-automatically.patch)
  download | inline diff:
From a8f35b1de3af96d3194339f2a0027c002109408c Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Sat, 22 Mar 2025 00:36:49 +1300
Subject: [PATCH v8] aio: Adjust I/O worker pool size automatically.

The size of the I/O worker pool used to implement io_method=worker was
previously controlled by the io_workers setting, defaulting to 3.  It
was hard to know how to tune it effectively.  It is now replaced with:

  io_min_workers=2
  io_max_workers=8 (up to 32)
  io_worker_idle_timeout=60s
  io_worker_launch_interval=100ms

The pool is automatically sized within the configured range according to
recent variation in demand.  It grows when existing workers detect a
backlog, and shrinks when the highest numbered worker is idle for too
long.  Work was already concentrated into low-numbered workers in
anticipation of this logic.

The logic for waking extra workers now also tries to measure and reduce
the number of spurious wakeups, though they are not entirely eliminated.

Reviewed-by: Dmitry Dolgov <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  69 +-
 src/backend/postmaster/postmaster.c           | 177 +++--
 src/backend/storage/aio/method_worker.c       | 635 +++++++++++++++---
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/misc/guc_parameters.dat     |  34 +-
 src/backend/utils/misc/postgresql.conf.sample |   6 +-
 src/include/storage/io_worker.h               |  11 +-
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pmsignal.h                |   1 +
 src/test/modules/test_aio/t/002_io_workers.pl |  15 +-
 src/tools/pgindent/typedefs.list              |   1 +
 11 files changed, 803 insertions(+), 148 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2c4106ee9ab..1c8b8e7f3e2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2942,16 +2942,75 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
-      <varlistentry id="guc-io-workers" xreflabel="io_workers">
-       <term><varname>io_workers</varname> (<type>integer</type>)
+      <varlistentry id="guc-io-min-workers" xreflabel="io_min_workers">
+       <term><varname>io_min_workers</varname> (<type>integer</type>)
        <indexterm>
-        <primary><varname>io_workers</varname> configuration parameter</primary>
+        <primary><varname>io_min_workers</varname> configuration parameter</primary>
        </indexterm>
        </term>
        <listitem>
         <para>
-         Selects the number of I/O worker processes to use. The default is
-         3. This parameter can only be set in the
+         Sets the minimum number of I/O worker processes. The default is
+         2. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-max-workers" xreflabel="io_max_workers">
+       <term><varname>io_max_workers</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_max_workers</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the maximum number of I/O worker processes. The default is
+         8. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-worker-idle-timeout" xreflabel="io_worker_idle_timeout">
+       <term><varname>io_worker_idle_timeout</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_worker_idle_timeout</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the time after which entirely idle I/O worker processes exit, reducing the
+         size of pool to match demand.  The default is 1 minute.  This
+         parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-worker-launch-interval" xreflabel="io_worker_launch_interval">
+       <term><varname>io_worker_launch_interval</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_worker_launch_interval</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the minimum time before another I/O worker can be launched.  This avoids
+         creating too many for an unsustained burst of activity.  The default is 100ms.
+         This parameter can only be set in the
          <filename>postgresql.conf</filename> file or on the server command
          line.
         </para>
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ae829747004..62406b27d70 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -409,6 +409,7 @@ static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
 /* State for IO worker management. */
+static TimestampTz io_worker_launch_next_time = 0;
 static int	io_worker_count = 0;
 static PMChild *io_worker_children[MAX_IO_WORKERS];
 
@@ -447,7 +448,8 @@ static int	CountChildren(BackendTypeMask targetMask);
 static void LaunchMissingBackgroundProcesses(void);
 static void maybe_start_bgworkers(void);
 static bool maybe_reap_io_worker(int pid);
-static void maybe_adjust_io_workers(void);
+static void maybe_start_io_workers(void);
+static TimestampTz maybe_start_io_workers_scheduled_at(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static PMChild *StartChildProcess(BackendType type);
 static void StartSysLogger(void);
@@ -1391,7 +1393,7 @@ PostmasterMain(int argc, char *argv[])
 	UpdatePMState(PM_STARTUP);
 
 	/* Make sure we can perform I/O while starting up. */
-	maybe_adjust_io_workers();
+	maybe_start_io_workers();
 
 	/* Start bgwriter and checkpointer so they can help with recovery */
 	if (CheckpointerPMChild == NULL)
@@ -1555,14 +1557,15 @@ checkControlFile(void)
 static int
 DetermineSleepTime(void)
 {
-	TimestampTz next_wakeup = 0;
+	TimestampTz next_wakeup;
 
 	/*
-	 * Normal case: either there are no background workers at all, or we're in
-	 * a shutdown sequence (during which we ignore bgworkers altogether).
+	 * If in ImmediateShutdown with a SIGKILL timeout, ignore everything else
+	 * and wait for that.
+	 *
+	 * XXX Shouldn't this also test FatalError?
 	 */
-	if (Shutdown > NoShutdown ||
-		(!StartWorkerNeeded && !HaveCrashedWorker))
+	if (Shutdown >= ImmediateShutdown)
 	{
 		if (AbortStartTime != 0)
 		{
@@ -1582,14 +1585,16 @@ DetermineSleepTime(void)
 
 			return seconds * 1000;
 		}
-		else
-			return 60 * 1000;
 	}
 
-	if (StartWorkerNeeded)
+	/* Time of next maybe_start_io_workers() call, or 0 for none. */
+	next_wakeup = maybe_start_io_workers_scheduled_at();
+
+	/* Ignore bgworkers during shutdown. */
+	if (StartWorkerNeeded && Shutdown == NoShutdown)
 		return 0;
 
-	if (HaveCrashedWorker)
+	if (HaveCrashedWorker && Shutdown == NoShutdown)
 	{
 		dlist_mutable_iter iter;
 
@@ -2545,7 +2550,17 @@ process_pm_child_exit(void)
 			if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
 				HandleChildCrash(pid, exitstatus, _("io worker"));
 
-			maybe_adjust_io_workers();
+			/*
+			 * A worker that exited with an error might have brought the pool
+			 * size below io_min_workers, or allowed the queue to grow to the
+			 * point where another worker called for growth.
+			 *
+			 * In the common case that a worker timed out due to idleness, no
+			 * replacement needs to be started.  maybe_start_io_workers() will
+			 * figure that out.
+			 */
+			maybe_start_io_workers();
+
 			continue;
 		}
 
@@ -3265,7 +3280,7 @@ PostmasterStateMachine(void)
 		UpdatePMState(PM_STARTUP);
 
 		/* Make sure we can perform I/O while starting up. */
-		maybe_adjust_io_workers();
+		maybe_start_io_workers();
 
 		StartupPMChild = StartChildProcess(B_STARTUP);
 		Assert(StartupPMChild != NULL);
@@ -3339,7 +3354,7 @@ LaunchMissingBackgroundProcesses(void)
 	 * A config file change will always lead to this function being called, so
 	 * we always will process the config change in a timely manner.
 	 */
-	maybe_adjust_io_workers();
+	maybe_start_io_workers();
 
 	/*
 	 * The checkpointer and the background writer are active from the start,
@@ -3800,6 +3815,16 @@ process_pm_pmsignal(void)
 		StartWorkerNeeded = true;
 	}
 
+	/* Process IO worker start requests. */
+	if (CheckPostmasterSignal(PMSIGNAL_IO_WORKER_GROW))
+	{
+		/*
+		 * No local flag, as the state is exposed through pgaio_worker_*()
+		 * functions.  This signal is received on potentially actionable level
+		 * changes, so that maybe_start_io_workers() will run.
+		 */
+	}
+
 	/* Process background worker state changes. */
 	if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
 	{
@@ -4402,44 +4427,115 @@ maybe_reap_io_worker(int pid)
 }
 
 /*
- * Start or stop IO workers, to close the gap between the number of running
- * workers and the number of configured workers.  Used to respond to change of
- * the io_workers GUC (by increasing and decreasing the number of workers), as
- * well as workers terminating in response to errors (by starting
- * "replacement" workers).
+ * Returns the next time at which maybe_start_io_workers() would start one or
+ * more I/O workers.  Any time in the past means ASAP, and 0 means no worker
+ * is currently scheduled.
+ *
+ * This is called by DetermineSleepTime() and also maybe_start_io_workers()
+ * itself, to make sure that they agree.
  */
-static void
-maybe_adjust_io_workers(void)
+static TimestampTz
+maybe_start_io_workers_scheduled_at(void)
 {
 	if (!pgaio_workers_enabled())
-		return;
+		return 0;
 
 	/*
 	 * If we're in final shutting down state, then we're just waiting for all
 	 * processes to exit.
 	 */
 	if (pmState >= PM_WAIT_IO_WORKERS)
-		return;
+		return 0;
 
 	/* Don't start new workers during an immediate shutdown either. */
 	if (Shutdown >= ImmediateShutdown)
-		return;
+		return 0;
 
 	/*
 	 * Don't start new workers if we're in the shutdown phase of a crash
 	 * restart. But we *do* need to start if we're already starting up again.
 	 */
 	if (FatalError && pmState >= PM_STOP_BACKENDS)
-		return;
+		return 0;
+
+	/*
+	 * Don't start a worker if we're at or above the maximum.  (Excess workers
+	 * exit when the GUC is lowered, but the count can be temporarily too high
+	 * until they are reaped.)
+	 */
+	if (io_worker_count >= io_max_workers)
+		return 0;
+
+	/* If we're under the minimum, start a worker as soon as possible. */
+	if (io_worker_count < io_min_workers)
+		return TIMESTAMP_MINUS_INFINITY;	/* start worker ASAP */
+
+	/* Only proceed if a "grow" signal has been received from a worker. */
+	if (!pgaio_worker_pm_test_grow_signal_sent())
+		return 0;
 
-	Assert(pmState < PM_WAIT_IO_WORKERS);
+	/*
+	 * maybe_start_io_workers() should start a new I/O worker after this time,
+	 * or as soon as possible if is already in the past.
+	 */
+	return io_worker_launch_next_time;
+}
 
-	/* Not enough running? */
-	while (io_worker_count < io_workers)
+/*
+ * Start I/O workers if required.  Used at startup, to respond to change of
+ * the io_min_workers GUC, when asked to start a new one due to submission
+ * queue backlog, and after workers terminate in response to errors (by
+ * starting "replacement" workers).
+ */
+static void
+maybe_start_io_workers(void)
+{
+	TimestampTz scheduled_at;
+
+	while ((scheduled_at = maybe_start_io_workers_scheduled_at()) != 0)
 	{
+		TimestampTz now = GetCurrentTimestamp();
 		PMChild    *child;
 		int			i;
 
+		Assert(pmState < PM_WAIT_IO_WORKERS);
+
+		/* Still waiting for the scheduled time? */
+		if (scheduled_at > now)
+			break;
+
+		/*
+		 * Compute next launch time relative to the previous value, so that
+		 * time spent on the postmaster's other duties don't result in an
+		 * inaccurate launch interval.
+		 */
+		io_worker_launch_next_time =
+			TimestampTzPlusMilliseconds(io_worker_launch_next_time,
+										io_worker_launch_interval);
+
+		/*
+		 * If that's already in the past, the interval is either impossibly
+		 * short or we received no requests for new workers for a period.
+		 * Compute a new future time relative to the last launch time instead.
+		 */
+		if (io_worker_launch_next_time <= now)
+			io_worker_launch_next_time =
+				TimestampTzPlusMilliseconds(now, io_worker_launch_interval);
+
+		/*
+		 * Check if a grow signal has been sent, but the grow request has been
+		 * canceled since then because the workers ran out of work.  We've
+		 * still advanced the next launch time, so we won't consider any more
+		 * grow signals until then.  That prevents workers from signaling more
+		 * than once in that time period, because we won't clear
+		 * grow_signal_sent until then.
+		 */
+		if (io_worker_count >= io_min_workers && !pgaio_worker_pm_test_grow())
+		{
+			pgaio_worker_pm_clear_grow_signal_sent();
+			break;
+		}
+
 		/* find unused entry in io_worker_children array */
 		for (i = 0; i < MAX_IO_WORKERS; ++i)
 		{
@@ -4457,22 +4553,21 @@ maybe_adjust_io_workers(void)
 			++io_worker_count;
 		}
 		else
-			break;				/* try again next time */
-	}
-
-	/* Too many running? */
-	if (io_worker_count > io_workers)
-	{
-		/* ask the IO worker in the highest slot to exit */
-		for (int i = MAX_IO_WORKERS - 1; i >= 0; --i)
 		{
-			if (io_worker_children[i] != NULL)
-			{
-				kill(io_worker_children[i]->pid, SIGUSR2);
-				break;
-			}
+			/*
+			 * Fork failure: we'll try again after the launch interval
+			 * expires, or be called again without delay if we don't yet have
+			 * io_min_workers.  Don't loop here though, the postmaster has
+			 * other duties.
+			 */
+			break;
 		}
 	}
+
+	/*
+	 * Workers decide when to shut down by themselves, according to the
+	 * io_max_workers and io_worker_idle_timeout GUCs.
+	 */
 }
 
 
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index eb686cede1a..e7f60623348 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -11,9 +11,8 @@
  * infrastructure for reopening the file, and must processed synchronously by
  * the client code when submitted.
  *
- * So that the submitter can make just one system call when submitting a batch
- * of IOs, wakeups "fan out"; each woken IO worker can wake two more. XXX This
- * could be improved by using futexes instead of latches to wake N waiters.
+ * The pool of workers tries to stabilize at a size that can handle recently
+ * seen variation in demand, within the configured limits.
  *
  * This method of AIO is available in all builds on all operating systems, and
  * is the default.
@@ -29,6 +28,8 @@
 
 #include "postgres.h"
 
+#include <limits.h>
+
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -40,6 +41,8 @@
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
 #include "tcop/tcopprot.h"
@@ -48,10 +51,22 @@
 #include "utils/ps_status.h"
 #include "utils/wait_event.h"
 
+/*
+ * Saturation for counters used to estimate wakeup:IO ratio.
+ *
+ * We maintain hist_wakeups for wakeups received and hist_ios for IOs
+ * processed by each worker.  When either counter reaches this saturation
+ * value, we divide both by two.  The result is an exponentially decaying
+ * ratio of wakeups to IOs, with a very short memory.
+ *
+ * If a worker is itself experiencing useless wakeups, it assumes that
+ * higher-numbered workers would experience even more, so it should end the
+ * chain.
+ */
+#define PGAIO_WORKER_WAKEUP_RATIO_SATURATE 4
 
-/* How many workers should each worker wake up if needed? */
-#define IO_WORKER_WAKEUP_FANOUT 2
-
+/* Debugging support: show current IO and wakeups:ios statistics in ps. */
+/* #define PGAIO_WORKER_SHOW_PS_INFO */
 
 typedef struct PgAioWorkerSubmissionQueue
 {
@@ -63,13 +78,35 @@ typedef struct PgAioWorkerSubmissionQueue
 
 typedef struct PgAioWorkerSlot
 {
-	Latch	   *latch;
-	bool		in_use;
+	ProcNumber	proc_number;
 } PgAioWorkerSlot;
 
+/*
+ * Sets of worker IDs are held in a simple bitmap, accessed through functions
+ * that provide a more readable abstraction.  If we wanted to support more
+ * workers than that, the contention on the single queue would surely get too
+ * high, so we might want to consider multiple pools instead of widening this.
+ */
+typedef uint64 PgAioWorkerSet;
+
+#define PGAIO_WORKERSET_BITS (sizeof(PgAioWorkerSet) * CHAR_BIT)
+
+static_assert(PGAIO_WORKERSET_BITS >= MAX_IO_WORKERS, "too small");
+
 typedef struct PgAioWorkerControl
 {
-	uint64		idle_worker_mask;
+	/* Seen by postmaster */
+	bool		grow;
+	bool		grow_signal_sent;
+
+	/* Protected by AioWorkerSubmissionQueueLock. */
+	PgAioWorkerSet idle_workerset;
+
+	/* Protected by AioWorkerControlLock. */
+	PgAioWorkerSet workerset;
+	int			nworkers;
+
+	/* Protected by AioWorkerControlLock. */
 	PgAioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
 } PgAioWorkerControl;
 
@@ -91,15 +128,108 @@ const IoMethodOps pgaio_worker_ops = {
 
 
 /* GUCs */
-int			io_workers = 3;
+int			io_min_workers = 2;
+int			io_max_workers = 8;
+int			io_worker_idle_timeout = 60000;
+int			io_worker_launch_interval = 100;
 
 
 static int	io_worker_queue_size = 64;
-static int	MyIoWorkerId;
+static int	MyIoWorkerId = -1;
 static PgAioWorkerSubmissionQueue *io_worker_submission_queue;
 static PgAioWorkerControl *io_worker_control;
 
 
+static void
+pgaio_workerset_initialize(PgAioWorkerSet *set)
+{
+	*set = 0;
+}
+
+static bool
+pgaio_workerset_is_empty(PgAioWorkerSet *set)
+{
+	return *set == 0;
+}
+
+static PgAioWorkerSet
+pgaio_workerset_singleton(int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	return UINT64_C(1) << worker;
+}
+
+static void
+pgaio_workerset_all(PgAioWorkerSet *set)
+{
+	*set = UINT64_MAX >> (PGAIO_WORKERSET_BITS - MAX_IO_WORKERS);
+}
+
+static void
+pgaio_workerset_subtract(PgAioWorkerSet *set1, const PgAioWorkerSet *set2)
+{
+	*set1 &= ~*set2;
+}
+
+static void
+pgaio_workerset_insert(PgAioWorkerSet *set, int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	*set |= pgaio_workerset_singleton(worker);
+}
+
+static void
+pgaio_workerset_remove(PgAioWorkerSet *set, int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	*set &= ~pgaio_workerset_singleton(worker);
+}
+
+static void
+pgaio_workerset_remove_lte(PgAioWorkerSet *set, int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	*set &= (~(PgAioWorkerSet) 0) << (worker + 1);
+}
+
+static int
+pgaio_workerset_get_highest(PgAioWorkerSet *set)
+{
+	Assert(!pgaio_workerset_is_empty(set));
+	return pg_leftmost_one_pos64(*set);
+}
+
+static int
+pgaio_workerset_get_lowest(PgAioWorkerSet *set)
+{
+	Assert(!pgaio_workerset_is_empty(set));
+	return pg_rightmost_one_pos64(*set);
+}
+
+static int
+pgaio_workerset_pop_lowest(PgAioWorkerSet *set)
+{
+	int			worker = pgaio_workerset_get_lowest(set);
+
+	pgaio_workerset_remove(set, worker);
+	return worker;
+}
+
+#ifdef USE_ASSERT_CHECKING
+static bool
+pgaio_workerset_contains(PgAioWorkerSet *set, int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	return (*set & pgaio_workerset_singleton(worker)) != 0;
+}
+
+static int
+pgaio_workerset_count(PgAioWorkerSet *set)
+{
+	return pg_popcount64(*set);
+}
+#endif
+
 static void
 pgaio_worker_shmem_request(void *arg)
 {
@@ -133,37 +263,160 @@ pgaio_worker_shmem_init(void *arg)
 	io_worker_submission_queue->size = queue_size;
 	io_worker_submission_queue->head = 0;
 	io_worker_submission_queue->tail = 0;
+	io_worker_control->grow = false;
+	pgaio_workerset_initialize(&io_worker_control->workerset);
+	pgaio_workerset_initialize(&io_worker_control->idle_workerset);
 
-	io_worker_control->idle_worker_mask = 0;
 	for (int i = 0; i < MAX_IO_WORKERS; ++i)
-	{
-		io_worker_control->workers[i].latch = NULL;
-		io_worker_control->workers[i].in_use = false;
-	}
+		io_worker_control->workers[i].proc_number = INVALID_PROC_NUMBER;
+}
+
+/*
+ * Tell postmaster that we think a new worker is needed.
+ */
+static void
+pgaio_worker_request_grow(void)
+{
+	/*
+	 * Suppress useless signaling if we already know that we're at the
+	 * maximum.  This uses an unlocked read of nworkers, but that's OK for
+	 * this heuristic purpose.
+	 */
+	if (io_worker_control->nworkers >= io_max_workers)
+		return;
+
+	/* Already requested? */
+	if (io_worker_control->grow)
+		return;
+
+	io_worker_control->grow = true;
+	pg_memory_barrier();
+
+	/*
+	 * If the postmaster has already been signaled, don't do it again until
+	 * the postmaster clears this flag.  There is no point in repeated signals
+	 * if grow is being set and cleared repeatedly while the postmaster is
+	 * waiting for io_worker_launch_interval, which it applies even to
+	 * canceled requests.
+	 */
+	if (io_worker_control->grow_signal_sent)
+		return;
+
+	io_worker_control->grow_signal_sent = true;
+	pg_memory_barrier();
+	SendPostmasterSignal(PMSIGNAL_IO_WORKER_GROW);
+}
+
+/*
+ * Cancel any request for a new worker, after observing an empty queue.
+ */
+static void
+pgaio_worker_cancel_grow(void)
+{
+	if (!io_worker_control->grow)
+		return;
+
+	io_worker_control->grow = false;
+	pg_memory_barrier();
+}
+
+/*
+ * Called by the postmaster to check if a new worker has been requested (but
+ * possibly canceled since).
+ */
+bool
+pgaio_worker_pm_test_grow_signal_sent(void)
+{
+	pg_memory_barrier();
+	return io_worker_control && io_worker_control->grow_signal_sent;
+}
+
+/*
+ * Called by the postmaster to check if a new worker has been requested and
+ * not canceled since.
+ */
+bool
+pgaio_worker_pm_test_grow(void)
+{
+	pg_memory_barrier();
+	return io_worker_control && io_worker_control->grow;
+}
+
+/*
+ * Called by the postmaster to clear the request for a new worker.
+ */
+void
+pgaio_worker_pm_clear_grow_signal_sent(void)
+{
+	if (!io_worker_control)
+		return;
+
+	io_worker_control->grow = false;
+	io_worker_control->grow_signal_sent = false;
+	pg_memory_barrier();
 }
 
 static int
-pgaio_worker_choose_idle(void)
+pgaio_worker_choose_idle(int only_workers_above)
 {
+	PgAioWorkerSet workerset;
 	int			worker;
 
-	if (io_worker_control->idle_worker_mask == 0)
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
+	workerset = io_worker_control->idle_workerset;
+	if (only_workers_above >= 0)
+		pgaio_workerset_remove_lte(&workerset, only_workers_above);
+	if (pgaio_workerset_is_empty(&workerset))
 		return -1;
 
-	/* Find the lowest bit position, and clear it. */
-	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
-	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
-	Assert(io_worker_control->workers[worker].in_use);
+	/* Find the lowest numbered idle worker and mark it not idle. */
+	worker = pgaio_workerset_get_lowest(&workerset);
+	pgaio_workerset_remove(&io_worker_control->idle_workerset, worker);
 
 	return worker;
 }
 
+/*
+ * Try to wake a worker by setting its latch, to tell it there are IOs to
+ * process in the submission queue.
+ */
+static void
+pgaio_worker_wake(int worker)
+{
+	ProcNumber	proc_number;
+
+	/*
+	 * If the selected worker is concurrently exiting, then pgaio_worker_die()
+	 * had not yet removed it as of when we saw it in idle_workerset.  That's
+	 * OK, because it will wake all remaining workers to close wakeup-vs-exit
+	 * races: *someone* will see the queued IO.  If there are no workers
+	 * running, the postmaster will start a new one.
+	 */
+	proc_number = io_worker_control->workers[worker].proc_number;
+	if (proc_number != INVALID_PROC_NUMBER)
+		SetLatch(&GetPGProcByNumber(proc_number)->procLatch);
+}
+
+/*
+ * Try to wake a set of workers.  Used on pool change, to close races
+ * described in the callers.
+ */
+static void
+pgaio_workerset_wake(PgAioWorkerSet workerset)
+{
+	while (!pgaio_workerset_is_empty(&workerset))
+		pgaio_worker_wake(pgaio_workerset_pop_lowest(&workerset));
+}
+
 static bool
 pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
 {
 	PgAioWorkerSubmissionQueue *queue;
 	uint32		new_head;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	queue = io_worker_submission_queue;
 	new_head = (queue->head + 1) & (queue->size - 1);
 	if (new_head == queue->tail)
@@ -185,6 +438,8 @@ pgaio_worker_submission_queue_consume(void)
 	PgAioWorkerSubmissionQueue *queue;
 	int			result;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	queue = io_worker_submission_queue;
 	if (queue->tail == queue->head)
 		return -1;				/* empty */
@@ -201,6 +456,8 @@ pgaio_worker_submission_queue_depth(void)
 	uint32		head;
 	uint32		tail;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	head = io_worker_submission_queue->head;
 	tail = io_worker_submission_queue->tail;
 
@@ -226,8 +483,7 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 {
 	PgAioHandle **synchronous_ios = NULL;
 	int			nsync = 0;
-	Latch	   *wakeup = NULL;
-	int			worker;
+	int			worker = -1;
 
 	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
 
@@ -251,20 +507,14 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 
 				break;
 			}
-
-			if (wakeup == NULL)
-			{
-				/* Choose an idle worker to wake up if we haven't already. */
-				worker = pgaio_worker_choose_idle();
-				if (worker >= 0)
-					wakeup = io_worker_control->workers[worker].latch;
-
-				pgaio_debug_io(DEBUG4, staged_ios[i],
-							   "choosing worker %d",
-							   worker);
-			}
 		}
+		/* Choose one worker to wake for this batch. */
+		worker = pgaio_worker_choose_idle(-1);
 		LWLockRelease(AioWorkerSubmissionQueueLock);
+
+		/* Wake up chosen worker.  It will wake peers if necessary. */
+		if (worker != -1)
+			pgaio_worker_wake(worker);
 	}
 	else
 	{
@@ -273,9 +523,6 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 		nsync = num_staged_ios;
 	}
 
-	if (wakeup)
-		SetLatch(wakeup);
-
 	/* Run whatever is left synchronously. */
 	if (nsync > 0)
 	{
@@ -295,14 +542,30 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 static void
 pgaio_worker_die(int code, Datum arg)
 {
-	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
-	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
-	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+	PgAioWorkerSet notify_set;
 
-	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
-	io_worker_control->workers[MyIoWorkerId].in_use = false;
-	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	pgaio_workerset_remove(&io_worker_control->idle_workerset, MyIoWorkerId);
 	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].proc_number == MyProcNumber);
+	io_worker_control->workers[MyIoWorkerId].proc_number = INVALID_PROC_NUMBER;
+	Assert(pgaio_workerset_contains(&io_worker_control->workerset, MyIoWorkerId));
+	pgaio_workerset_remove(&io_worker_control->workerset, MyIoWorkerId);
+	notify_set = io_worker_control->workerset;
+	Assert(io_worker_control->nworkers > 0);
+	io_worker_control->nworkers--;
+	Assert(pgaio_workerset_count(&io_worker_control->workerset) ==
+		   io_worker_control->nworkers);
+	LWLockRelease(AioWorkerControlLock);
+
+	/*
+	 * Notify other workers on pool change.  This allows the new highest
+	 * worker to know that it is now the one that can time out, and closes a
+	 * wakeup-loss race described in pgaio_worker_wake().
+	 */
+	pgaio_workerset_wake(notify_set);
 }
 
 /*
@@ -312,33 +575,38 @@ pgaio_worker_die(int code, Datum arg)
 static void
 pgaio_worker_register(void)
 {
+	PgAioWorkerSet free_workerset;
+	PgAioWorkerSet old_workerset;
+
 	MyIoWorkerId = -1;
 
-	/*
-	 * XXX: This could do with more fine-grained locking. But it's also not
-	 * very common for the number of workers to change at the moment...
-	 */
-	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
+	/* Find lowest unused worker ID. */
+	pgaio_workerset_all(&free_workerset);
+	pgaio_workerset_subtract(&free_workerset, &io_worker_control->workerset);
+	if (!pgaio_workerset_is_empty(&free_workerset))
+		MyIoWorkerId = pgaio_workerset_get_lowest(&free_workerset);
+	if (MyIoWorkerId == -1)
+		elog(ERROR, "couldn't find a free worker ID");
 
-	for (int i = 0; i < MAX_IO_WORKERS; ++i)
-	{
-		if (!io_worker_control->workers[i].in_use)
-		{
-			Assert(io_worker_control->workers[i].latch == NULL);
-			io_worker_control->workers[i].in_use = true;
-			MyIoWorkerId = i;
-			break;
-		}
-		else
-			Assert(io_worker_control->workers[i].latch != NULL);
-	}
+	Assert(io_worker_control->workers[MyIoWorkerId].proc_number ==
+		   INVALID_PROC_NUMBER);
+	io_worker_control->workers[MyIoWorkerId].proc_number = MyProcNumber;
 
-	if (MyIoWorkerId == -1)
-		elog(ERROR, "couldn't find a free worker slot");
+	old_workerset = io_worker_control->workerset;
+	Assert(!pgaio_workerset_contains(&old_workerset, MyIoWorkerId));
+	pgaio_workerset_insert(&io_worker_control->workerset, MyIoWorkerId);
+	io_worker_control->nworkers++;
+	Assert(io_worker_control->nworkers <= MAX_IO_WORKERS);
+	Assert(pgaio_workerset_count(&io_worker_control->workerset) ==
+		   io_worker_control->nworkers);
+	LWLockRelease(AioWorkerControlLock);
 
-	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
-	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
-	LWLockRelease(AioWorkerSubmissionQueueLock);
+	/*
+	 * Notify other workers on pool change.  If we were the highest worker,
+	 * this allows the new highest worker to know that it can time out.
+	 */
+	pgaio_workerset_wake(old_workerset);
 
 	on_shmem_exit(pgaio_worker_die, 0);
 }
@@ -364,14 +632,48 @@ pgaio_worker_error_callback(void *arg)
 	errcontext("I/O worker executing I/O on behalf of process %d", owner_pid);
 }
 
+/*
+ * Check if this backend is allowed to time out, and thus should use a
+ * non-infinite sleep time.  Only the highest-numbered worker is allowed to
+ * time out, and only if the pool is above io_min_workers.  Serializing
+ * timeouts keeps IDs in a range 0..N without gaps, and avoids undershooting
+ * io_min_workers.
+ *
+ * The result is only instantaneously true and may be temporarily inconsistent
+ * in different workers around transitions, but all workers are woken up on
+ * pool size or GUC changes making the result eventually consistent.
+ */
+static bool
+pgaio_worker_can_timeout(void)
+{
+	PgAioWorkerSet workerset;
+
+	if (MyIoWorkerId < io_min_workers)
+		return false;
+
+	/* Serialize against pool size changes. */
+	LWLockAcquire(AioWorkerControlLock, LW_SHARED);
+	workerset = io_worker_control->workerset;
+	LWLockRelease(AioWorkerControlLock);
+
+	if (MyIoWorkerId != pgaio_workerset_get_highest(&workerset))
+		return false;
+
+	return true;
+}
+
 void
 IoWorkerMain(const void *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	TimestampTz idle_timeout_abs = 0;
+	int			timeout_guc_used = 0;
 	PgAioHandle *volatile error_ioh = NULL;
 	ErrorContextCallback errcallback = {0};
 	volatile int error_errno = 0;
 	char		cmd[128];
+	int			hist_ios = 0;
+	int			hist_wakeups = 0;
 
 	AuxiliaryProcessMainCommon();
 
@@ -439,10 +741,9 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	while (!ShutdownRequestPending)
 	{
 		uint32		io_index;
-		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
-		int			nlatches = 0;
-		int			nwakeups = 0;
-		int			worker;
+		int			worker = -1;
+		int			queue_depth = 0;
+		bool		maybe_grow = false;
 
 		/*
 		 * Try to get a job to do.
@@ -453,38 +754,107 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
 		if ((io_index = pgaio_worker_submission_queue_consume()) == -1)
 		{
-			/*
-			 * Nothing to do.  Mark self idle.
-			 *
-			 * XXX: Invent some kind of back pressure to reduce useless
-			 * wakeups?
-			 */
-			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+			/* Nothing to do.  Mark self idle. */
+			pgaio_workerset_insert(&io_worker_control->idle_workerset,
+								   MyIoWorkerId);
 		}
 		else
 		{
 			/* Got one.  Clear idle flag. */
-			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+			pgaio_workerset_remove(&io_worker_control->idle_workerset,
+								   MyIoWorkerId);
 
-			/* See if we can wake up some peers. */
-			nwakeups = Min(pgaio_worker_submission_queue_depth(),
-						   IO_WORKER_WAKEUP_FANOUT);
-			for (int i = 0; i < nwakeups; ++i)
+			/*
+			 * See if we should wake up a higher numbered peer.  Only do that
+			 * if this worker is not receiving spurious wakeups itself.  The
+			 * intention is create a frontier beyond which idle workers stay
+			 * asleep.
+			 *
+			 * This heuristic tries to discover the useful wakeup propagation
+			 * chain length when IOs are very fast and workers wake up to find
+			 * that all IOs have already been taken.
+			 *
+			 * If we chose not to wake a worker when we ideally should have,
+			 * then ios will soon exceed wakeups.
+			 */
+			if (hist_wakeups <= hist_ios)
 			{
-				if ((worker = pgaio_worker_choose_idle()) < 0)
-					break;
-				latches[nlatches++] = io_worker_control->workers[worker].latch;
+				queue_depth = pgaio_worker_submission_queue_depth();
+				if (queue_depth > 0)
+				{
+					/* Choose a worker higher than me to wake. */
+					worker = pgaio_worker_choose_idle(MyIoWorkerId);
+					if (worker == -1)
+						maybe_grow = true;
+				}
 			}
 		}
 		LWLockRelease(AioWorkerSubmissionQueueLock);
 
-		for (int i = 0; i < nlatches; ++i)
-			SetLatch(latches[i]);
+		/* Propagate wakeups. */
+		if (worker != -1)
+		{
+			pgaio_worker_wake(worker);
+		}
+		else if (maybe_grow)
+		{
+			/*
+			 * We know there was at least one more item in the queue, and we
+			 * failed to find a higher-numbered idle worker to wake.  Now we
+			 * decide if we should try to start one more worker.
+			 *
+			 * We do this with a simple heuristic: is the queue depth greater
+			 * than the current number of workers?
+			 *
+			 * Consider the following situations:
+			 *
+			 * 1. The queue depth is constantly increasing, because IOs are
+			 * arriving faster than they can possibly be serviced.  It doesn't
+			 * matter much which threshold we choose, as we will surely hit
+			 * it.  Crossing the current worker count is a useful signal
+			 * because it's clearly too deep to avoid queuing latency already,
+			 * but still leaves a small window of opportunity to improve the
+			 * situation before the queue oveflows.
+			 *
+			 * 2. The worker pool is keeping up, no latency is being
+			 * introduced and an extra worker would be a waste of resources.
+			 * Queue depth distributions tend to be heavily skewed, with long
+			 * tails of low probability spikes (due to submission clustering,
+			 * scheduling, jitter, stalls, noisy neighbors, etc).  We want a
+			 * number that is very unlikely to be triggered by an outlier, and
+			 * we bet that an exponential or similar distribution whose
+			 * outliers never reach this threshold must be almost entirely
+			 * concentrated at the low end.  If we do see a spike as big as
+			 * the worker count, we take it as a signal that the distribution
+			 * is surely too wide.
+			 *
+			 * On its own, this is an extremely crude signal.  When combined
+			 * with the wakeup propagation test that precedes it (but on its
+			 * own tends to overshoot) and the io_worker_launch_delay, we
+			 * gradually try each pool size until we find one that doesn't
+			 * trigger further growth.
+			 *
+			 * XXX Perhaps ideas from queueing theory or control theory could
+			 * do a much better job of this.
+			 */
+
+			/* Read nworkers without lock for this heuristic purpose. */
+			if (queue_depth > io_worker_control->nworkers)
+				pgaio_worker_request_grow();
+		}
 
 		if (io_index != -1)
 		{
 			PgAioHandle *ioh = NULL;
 
+			/* Cancel timeout and update wakeup:work ratio. */
+			idle_timeout_abs = 0;
+			if (++hist_ios == PGAIO_WORKER_WAKEUP_RATIO_SATURATE)
+			{
+				hist_wakeups /= 2;
+				hist_ios /= 2;
+			}
+
 			ioh = &pgaio_ctl->io_handles[io_index];
 			error_ioh = ioh;
 			errcallback.arg = ioh;
@@ -537,6 +907,19 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 			}
 #endif
 
+#ifdef PGAIO_WORKER_SHOW_PS_INFO
+			{
+				char	   *description = pgaio_io_get_target_description(ioh);
+
+				sprintf(cmd, "%d: [%s] %s",
+						MyIoWorkerId,
+						pgaio_io_get_op_name(ioh),
+						pgaio_io_get_target_description(ioh));
+				pfree(description);
+				set_ps_display(cmd);
+			}
+#endif
+
 			/*
 			 * We don't expect this to ever fail with ERROR or FATAL, no need
 			 * to keep error_ioh set to the IO.
@@ -550,8 +933,76 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		}
 		else
 		{
-			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-					  WAIT_EVENT_IO_WORKER_MAIN);
+			int			timeout_ms;
+
+			/* Cancel new worker request if pending. */
+			pgaio_worker_cancel_grow();
+
+			/* Compute the remaining allowed idle time. */
+			if (io_worker_idle_timeout == -1)
+			{
+				/* Never time out. */
+				timeout_ms = -1;
+			}
+			else
+			{
+				TimestampTz now = GetCurrentTimestamp();
+
+				/* If the GUC changes, reset timer. */
+				if (idle_timeout_abs != 0 &&
+					io_worker_idle_timeout != timeout_guc_used)
+					idle_timeout_abs = 0;
+
+				/* Only the highest-numbered worker can time out. */
+				if (pgaio_worker_can_timeout())
+				{
+					if (idle_timeout_abs == 0)
+					{
+						/*
+						 * I have just been promoted to the timeout worker, or
+						 * the GUC changed.  Compute new absolute time from
+						 * now.
+						 */
+						idle_timeout_abs =
+							TimestampTzPlusMilliseconds(now,
+														io_worker_idle_timeout);
+						timeout_guc_used = io_worker_idle_timeout;
+					}
+					timeout_ms =
+						TimestampDifferenceMilliseconds(now, idle_timeout_abs);
+				}
+				else
+				{
+					/* No timeout for me. */
+					idle_timeout_abs = 0;
+					timeout_ms = -1;
+				}
+			}
+
+#ifdef PGAIO_WORKER_SHOW_PS_INFO
+			sprintf(cmd, "%d: idle, wakeups:ios = %d:%d",
+					MyIoWorkerId, hist_wakeups, hist_ios);
+			set_ps_display(cmd);
+#endif
+
+			if (WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT,
+						  timeout_ms,
+						  WAIT_EVENT_IO_WORKER_MAIN) == WL_TIMEOUT)
+			{
+				/* WL_TIMEOUT */
+				if (pgaio_worker_can_timeout())
+					if (GetCurrentTimestamp() >= idle_timeout_abs)
+						break;
+			}
+			else
+			{
+				/* WL_LATCH_SET */
+				if (++hist_wakeups == PGAIO_WORKER_WAKEUP_RATIO_SATURATE)
+				{
+					hist_wakeups /= 2;
+					hist_ios /= 2;
+				}
+			}
 			ResetLatch(MyLatch);
 		}
 
@@ -561,6 +1012,10 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		{
 			ConfigReloadPending = false;
 			ProcessConfigFile(PGC_SIGHUP);
+
+			/* If io_max_workers has been decreased, exit highest first. */
+			if (MyIoWorkerId >= io_max_workers)
+				break;
 		}
 	}
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7bda5298558..560659f9568 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -369,6 +369,7 @@ AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 LogicalDecodingControl	"Waiting to read or update logical decoding status information."
 DataChecksumsWorker	"Waiting for data checksums worker."
+AioWorkerControl	"Waiting to update AIO worker information."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 86c1eba5dab..83af594d4af 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1390,6 +1390,14 @@
   check_hook => 'check_io_max_concurrency',
 },
 
+{ name => 'io_max_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Maximum number of I/O worker processes, for io_method=worker.',
+  variable => 'io_max_workers',
+  boot_val => '8',
+  min => '1',
+  max => 'MAX_IO_WORKERS',
+},
+
 { name => 'io_method', type => 'enum', context => 'PGC_POSTMASTER', group => 'RESOURCES_IO',
   short_desc => 'Selects the method for executing asynchronous I/O.',
   variable => 'io_method',
@@ -1398,14 +1406,32 @@
   assign_hook => 'assign_io_method',
 },
 
-{ name => 'io_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
-  short_desc => 'Number of IO worker processes, for io_method=worker.',
-  variable => 'io_workers',
-  boot_val => '3',
+{ name => 'io_min_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Minimum number of I/O worker processes, for io_method=worker.',
+  variable => 'io_min_workers',
+  boot_val => '2',
   min => '1',
   max => 'MAX_IO_WORKERS',
 },
 
+{ name => 'io_worker_idle_timeout', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Maximum time before idle I/O worker processes time out, for io_method=worker.',
+  variable => 'io_worker_idle_timeout',
+  flags => 'GUC_UNIT_MS',
+  boot_val => '60000',
+  min => '0',
+  max => 'INT_MAX',
+},
+
+{ name => 'io_worker_launch_interval', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Minimum time before launching a new I/O worker process, for io_method=worker.',
+  variable => 'io_worker_launch_interval',
+  flags => 'GUC_UNIT_MS',
+  boot_val => '100',
+  min => '0',
+  max => 'INT_MAX',
+},
+
 # Not for general use --- used by SET SESSION AUTHORIZATION and SET
 # ROLE
 { name => 'is_superuser', type => 'bool', context => 'PGC_INTERNAL', group => 'UNGROUPED',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4f2bbf05295..5e1e49f0ae8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -222,7 +222,11 @@
                                         # can execute simultaneously
                                         # -1 sets based on shared_buffers
                                         # (change requires restart)
-#io_workers = 3                         # 1-32;
+
+#io_min_workers = 2                     # 1-32 (change requires pg_reload_conf())
+#io_max_workers = 8                     # 1-32
+#io_worker_idle_timeout = 60s
+#io_worker_launch_interval = 100ms
 
 # - Worker Processes -
 
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
index f7d5998a138..c852c9f3741 100644
--- a/src/include/storage/io_worker.h
+++ b/src/include/storage/io_worker.h
@@ -17,6 +17,15 @@
 
 pg_noreturn extern void IoWorkerMain(const void *startup_data, size_t startup_data_len);
 
-extern PGDLLIMPORT int io_workers;
+/* Public GUCs. */
+extern PGDLLIMPORT int io_min_workers;
+extern PGDLLIMPORT int io_max_workers;
+extern PGDLLIMPORT int io_worker_idle_timeout;
+extern PGDLLIMPORT int io_worker_launch_interval;
+
+/* Interfaces visible to the postmaster. */
+extern bool pgaio_worker_pm_test_grow_signal_sent(void);
+extern void pgaio_worker_pm_clear_grow_signal_sent(void);
+extern bool pgaio_worker_pm_test_grow(void);
 
 #endif							/* IO_WORKER_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index af8553bcb6c..d7eb648bd27 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -88,6 +88,7 @@ PG_LWLOCK(53, AioWorkerSubmissionQueue)
 PG_LWLOCK(54, WaitLSN)
 PG_LWLOCK(55, LogicalDecodingControl)
 PG_LWLOCK(56, DataChecksumsWorker)
+PG_LWLOCK(57, AioWorkerControl)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 001e6eea61c..bcce4011790 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -38,6 +38,7 @@ typedef enum
 	PMSIGNAL_ROTATE_LOGFILE,	/* send SIGUSR1 to syslogger to rotate logfile */
 	PMSIGNAL_START_AUTOVAC_LAUNCHER,	/* start an autovacuum launcher */
 	PMSIGNAL_START_AUTOVAC_WORKER,	/* start an autovacuum worker */
+	PMSIGNAL_IO_WORKER_GROW,	/* I/O worker pool wants to grow */
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
diff --git a/src/test/modules/test_aio/t/002_io_workers.pl b/src/test/modules/test_aio/t/002_io_workers.pl
index 34bc132ea08..b9775811d4d 100644
--- a/src/test/modules/test_aio/t/002_io_workers.pl
+++ b/src/test/modules/test_aio/t/002_io_workers.pl
@@ -14,6 +14,9 @@ $node->init();
 $node->append_conf(
 	'postgresql.conf', qq(
 io_method=worker
+io_worker_idle_timeout=0ms
+io_worker_launch_interval=0ms
+io_max_workers=32
 ));
 
 $node->start();
@@ -31,7 +34,7 @@ sub test_number_of_io_workers_dynamic
 {
 	my $node = shift;
 
-	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_workers');
+	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_min_workers');
 
 	# Verify that worker count can't be set to 0
 	change_number_of_io_workers($node, 0, $prev_worker_count, 1);
@@ -62,24 +65,24 @@ sub change_number_of_io_workers
 	my ($result, $stdout, $stderr);
 
 	($result, $stdout, $stderr) =
-	  $node->psql('postgres', "ALTER SYSTEM SET io_workers = $worker_count");
+	  $node->psql('postgres', "ALTER SYSTEM SET io_min_workers = $worker_count");
 	$node->safe_psql('postgres', 'SELECT pg_reload_conf()');
 
 	if ($expect_failure)
 	{
 		like(
 			$stderr,
-			qr/$worker_count is outside the valid range for parameter "io_workers"/,
-			"updating number of io_workers to $worker_count failed, as expected"
+			qr/$worker_count is outside the valid range for parameter "io_min_workers"/,
+			"updating io_min_workers to $worker_count failed, as expected"
 		);
 
 		return $prev_worker_count;
 	}
 	else
 	{
-		is( $node->safe_psql('postgres', 'SHOW io_workers'),
+		is( $node->safe_psql('postgres', 'SHOW io_min_workers'),
 			$worker_count,
-			"updating number of io_workers from $prev_worker_count to $worker_count"
+			"updating number of io_min_workers from $prev_worker_count to $worker_count"
 		);
 
 		check_io_worker_count($node, $worker_count);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2dfe1b38826..3dea516912c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2271,6 +2271,7 @@ PgAioUringCaps
 PgAioUringContext
 PgAioWaitRef
 PgAioWorkerControl
+PgAioWorkerSet
 PgAioWorkerSlot
 PgAioWorkerSubmissionQueue
 PgArchData
-- 
2.53.0



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-07-12 05:08       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-07-30 10:14         ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-08-04 05:30           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-03-28 09:31             ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 15:02               ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 18:14                 ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-07 10:39                   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-07 19:01                     ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-07 23:18                       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-08 00:30                         ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-08 02:09                           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
@ 2026-04-08 02:24                             ` Thomas Munro <[email protected]>
  1 sibling, 0 replies; 24+ messages in thread

From: Thomas Munro @ 2026-04-08 02:24 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Dmitry Dolgov <[email protected]>; PostgreSQL Hackers <[email protected]>

On Wed, Apr 8, 2026 at 2:09 PM Thomas Munro <[email protected]> wrote:
> > Seems like there should be two fields. One saying "notify postmaster again"
> > and one "postmaster start a worker".  The former would only be cleared by
> > postmaster after the timeout.
>
> Good idea.  V7 has two tweaks:
>
> * separate grow and grow_signal_sent flags, as you suggested
> * it also applies the io_worker_launch_delay to cancelled grow requests

Oh, but that logic should of course be moved below the "time in the
past" check.  Will do...





^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-07-12 05:08       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-07-30 10:14         ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-08-04 05:30           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-03-28 09:31             ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 15:02               ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-06 18:14                 ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-07 10:39                   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2026-04-07 19:01                     ` Re: Automatically sizing the IO worker pool Andres Freund <[email protected]>
  2026-04-07 23:18                       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
@ 2026-04-08 00:30                         ` Thomas Munro <[email protected]>
  1 sibling, 0 replies; 24+ messages in thread

From: Thomas Munro @ 2026-04-08 00:30 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Dmitry Dolgov <[email protected]>; PostgreSQL Hackers <[email protected]>

I changed pgaio_worker_request_grow() not to bother the postmaster
unless nworkers < io_max_workers.

I move that code you wanted outside the loop and did:

        /* Choose one worker to wake for this batch. */
        if (nsync < num_staged_ios)
            worker = pgaio_worker_choose_idle(-1);

I took your suggestion for the names hist_wakeups and hist_ios.

For the location of the following line, I preferred not to separate
the pre-existing tests of StartWorkerNeeded and HaveCrashedWorker,
since they belong together as bgworker concerns.

    next_wakeup = maybe_start_io_workers_scheduled_at();

I think I've run out of reasons not to commit this, unless your
pondering of the grow-trigger heuristics revealed a problem?


Attachments:

  [text/x-patch] v6-0001-aio-Adjust-I-O-worker-pool-size-automatically.patch (45.6K, 2-v6-0001-aio-Adjust-I-O-worker-pool-size-automatically.patch)
  download | inline diff:
From 09676d6115c82bb11bfc35a41d370889870cd809 Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Sat, 22 Mar 2025 00:36:49 +1300
Subject: [PATCH v6] aio: Adjust I/O worker pool size automatically.

The size of the I/O worker pool used to implement io_method=worker was
previously controlled by the io_workers setting, defaulting to 3.  It
was hard to know how to tune it effectively.  It is now replaced with:

  io_min_workers=2
  io_max_workers=8 (up to 32)
  io_worker_idle_timeout=60s
  io_worker_launch_interval=100ms

The pool is automatically sized within the configured range according to
recent variation in demand.  It grows when existing workers detect a
backlog, and shrinks when the highest numbered worker is idle for too
long.  Work was already concentrated into low-numbered workers in
anticipation of this logic.

The logic for waking extra workers now also tries to measure and reduce
the number of spurious wakeups, though they are not entirely eliminated.

Reviewed-by: Dmitry Dolgov <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  69 +-
 src/backend/postmaster/postmaster.c           | 166 +++--
 src/backend/storage/aio/method_worker.c       | 601 +++++++++++++++---
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/misc/guc_parameters.dat     |  34 +-
 src/backend/utils/misc/postgresql.conf.sample |   6 +-
 src/include/storage/io_worker.h               |  10 +-
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pmsignal.h                |   1 +
 src/test/modules/test_aio/t/002_io_workers.pl |  15 +-
 src/tools/pgindent/typedefs.list              |   1 +
 11 files changed, 759 insertions(+), 146 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2c4106ee9ab..1c8b8e7f3e2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2942,16 +2942,75 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
-      <varlistentry id="guc-io-workers" xreflabel="io_workers">
-       <term><varname>io_workers</varname> (<type>integer</type>)
+      <varlistentry id="guc-io-min-workers" xreflabel="io_min_workers">
+       <term><varname>io_min_workers</varname> (<type>integer</type>)
        <indexterm>
-        <primary><varname>io_workers</varname> configuration parameter</primary>
+        <primary><varname>io_min_workers</varname> configuration parameter</primary>
        </indexterm>
        </term>
        <listitem>
         <para>
-         Selects the number of I/O worker processes to use. The default is
-         3. This parameter can only be set in the
+         Sets the minimum number of I/O worker processes. The default is
+         2. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-max-workers" xreflabel="io_max_workers">
+       <term><varname>io_max_workers</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_max_workers</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the maximum number of I/O worker processes. The default is
+         8. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-worker-idle-timeout" xreflabel="io_worker_idle_timeout">
+       <term><varname>io_worker_idle_timeout</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_worker_idle_timeout</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the time after which entirely idle I/O worker processes exit, reducing the
+         size of pool to match demand.  The default is 1 minute.  This
+         parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.
+        </para>
+        <para>
+         Only has an effect if <xref linkend="guc-io-method"/> is set to
+         <literal>worker</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+      <varlistentry id="guc-io-worker-launch-interval" xreflabel="io_worker_launch_interval">
+       <term><varname>io_worker_launch_interval</varname> (<type>int</type>)
+       <indexterm>
+        <primary><varname>io_worker_launch_interval</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the minimum time before another I/O worker can be launched.  This avoids
+         creating too many for an unsustained burst of activity.  The default is 100ms.
+         This parameter can only be set in the
          <filename>postgresql.conf</filename> file or on the server command
          line.
         </para>
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ae829747004..f7d53b02ea3 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -409,6 +409,7 @@ static DNSServiceRef bonjour_sdref = NULL;
 #endif
 
 /* State for IO worker management. */
+static TimestampTz io_worker_launch_next_time = 0;
 static int	io_worker_count = 0;
 static PMChild *io_worker_children[MAX_IO_WORKERS];
 
@@ -447,7 +448,8 @@ static int	CountChildren(BackendTypeMask targetMask);
 static void LaunchMissingBackgroundProcesses(void);
 static void maybe_start_bgworkers(void);
 static bool maybe_reap_io_worker(int pid);
-static void maybe_adjust_io_workers(void);
+static void maybe_start_io_workers(void);
+static TimestampTz maybe_start_io_workers_scheduled_at(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static PMChild *StartChildProcess(BackendType type);
 static void StartSysLogger(void);
@@ -1391,7 +1393,7 @@ PostmasterMain(int argc, char *argv[])
 	UpdatePMState(PM_STARTUP);
 
 	/* Make sure we can perform I/O while starting up. */
-	maybe_adjust_io_workers();
+	maybe_start_io_workers();
 
 	/* Start bgwriter and checkpointer so they can help with recovery */
 	if (CheckpointerPMChild == NULL)
@@ -1555,14 +1557,15 @@ checkControlFile(void)
 static int
 DetermineSleepTime(void)
 {
-	TimestampTz next_wakeup = 0;
+	TimestampTz next_wakeup;
 
 	/*
-	 * Normal case: either there are no background workers at all, or we're in
-	 * a shutdown sequence (during which we ignore bgworkers altogether).
+	 * If in ImmediateShutdown with a SIGKILL timeout, ignore everything else
+	 * and wait for that.
+	 *
+	 * XXX Shouldn't this also test FatalError?
 	 */
-	if (Shutdown > NoShutdown ||
-		(!StartWorkerNeeded && !HaveCrashedWorker))
+	if (Shutdown >= ImmediateShutdown)
 	{
 		if (AbortStartTime != 0)
 		{
@@ -1582,14 +1585,16 @@ DetermineSleepTime(void)
 
 			return seconds * 1000;
 		}
-		else
-			return 60 * 1000;
 	}
 
-	if (StartWorkerNeeded)
+	/* Time of next maybe_start_io_workers() call, or 0 for none. */
+	next_wakeup = maybe_start_io_workers_scheduled_at();
+
+	/* Ignore bgworkers during shutdown. */
+	if (StartWorkerNeeded && Shutdown == NoShutdown)
 		return 0;
 
-	if (HaveCrashedWorker)
+	if (HaveCrashedWorker && Shutdown == NoShutdown)
 	{
 		dlist_mutable_iter iter;
 
@@ -2545,7 +2550,17 @@ process_pm_child_exit(void)
 			if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
 				HandleChildCrash(pid, exitstatus, _("io worker"));
 
-			maybe_adjust_io_workers();
+			/*
+			 * A worker that exited with an error might have brought the pool
+			 * size below io_min_workers, or allowed the queue to grow to the
+			 * point where another worker called for growth.
+			 *
+			 * In the common case that a worker timed out due to idleness, no
+			 * replacement needs to be started.  maybe_start_io_workers() will
+			 * figure that out.
+			 */
+			maybe_start_io_workers();
+
 			continue;
 		}
 
@@ -3265,7 +3280,7 @@ PostmasterStateMachine(void)
 		UpdatePMState(PM_STARTUP);
 
 		/* Make sure we can perform I/O while starting up. */
-		maybe_adjust_io_workers();
+		maybe_start_io_workers();
 
 		StartupPMChild = StartChildProcess(B_STARTUP);
 		Assert(StartupPMChild != NULL);
@@ -3339,7 +3354,7 @@ LaunchMissingBackgroundProcesses(void)
 	 * A config file change will always lead to this function being called, so
 	 * we always will process the config change in a timely manner.
 	 */
-	maybe_adjust_io_workers();
+	maybe_start_io_workers();
 
 	/*
 	 * The checkpointer and the background writer are active from the start,
@@ -3800,6 +3815,16 @@ process_pm_pmsignal(void)
 		StartWorkerNeeded = true;
 	}
 
+	/* Process IO worker start requests. */
+	if (CheckPostmasterSignal(PMSIGNAL_IO_WORKER_GROW))
+	{
+		/*
+		 * No local flag, as the state is exposed through pgaio_worker_*()
+		 * functions.  This signal is received on potentially actionable level
+		 * changes, so that maybe_start_io_workers() will run.
+		 */
+	}
+
 	/* Process background worker state changes. */
 	if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
 	{
@@ -4402,44 +4427,104 @@ maybe_reap_io_worker(int pid)
 }
 
 /*
- * Start or stop IO workers, to close the gap between the number of running
- * workers and the number of configured workers.  Used to respond to change of
- * the io_workers GUC (by increasing and decreasing the number of workers), as
- * well as workers terminating in response to errors (by starting
- * "replacement" workers).
+ * Returns the next time at which maybe_start_io_workers() would start one or
+ * more I/O workers.  Any time in the past means ASAP, and 0 means no worker
+ * is currently scheduled.
+ *
+ * This is called by DetermineSleepTime() and also maybe_start_io_workers()
+ * itself, to make sure that they agree.
  */
-static void
-maybe_adjust_io_workers(void)
+static TimestampTz
+maybe_start_io_workers_scheduled_at(void)
 {
 	if (!pgaio_workers_enabled())
-		return;
+		return 0;
 
 	/*
 	 * If we're in final shutting down state, then we're just waiting for all
 	 * processes to exit.
 	 */
 	if (pmState >= PM_WAIT_IO_WORKERS)
-		return;
+		return 0;
 
 	/* Don't start new workers during an immediate shutdown either. */
 	if (Shutdown >= ImmediateShutdown)
-		return;
+		return 0;
 
 	/*
 	 * Don't start new workers if we're in the shutdown phase of a crash
 	 * restart. But we *do* need to start if we're already starting up again.
 	 */
 	if (FatalError && pmState >= PM_STOP_BACKENDS)
-		return;
+		return 0;
+
+	/*
+	 * Don't start a worker if we're at or above the maximum.  (Excess workers
+	 * exit when the GUC is lowered, but the count can be temporarily too high
+	 * until they are reaped.)
+	 */
+	if (io_worker_count >= io_max_workers)
+		return 0;
+
+	/* If we're under the minimum, start a worker as soon as possible. */
+	if (io_worker_count < io_min_workers)
+		return TIMESTAMP_MINUS_INFINITY;	/* start worker ASAP */
+
+	/* Only proceed if a "grow" request is pending from existing workers. */
+	if (!pgaio_worker_pm_test_grow())
+		return 0;
 
-	Assert(pmState < PM_WAIT_IO_WORKERS);
+	/*
+	 * maybe_start_io_workers() should start a new I/O worker after this time,
+	 * or as soon as possible if is already in the past.
+	 */
+	return io_worker_launch_next_time;
+}
+
+/*
+ * Start I/O workers if required.  Used at startup, to respond to change of
+ * the io_min_workers GUC, when asked to start a new one due to submission
+ * queue backlog, and after workers terminate in response to errors (by
+ * starting "replacement" workers).
+ */
+static void
+maybe_start_io_workers(void)
+{
+	TimestampTz scheduled_at;
 
-	/* Not enough running? */
-	while (io_worker_count < io_workers)
+	while ((scheduled_at = maybe_start_io_workers_scheduled_at()) != 0)
 	{
+		TimestampTz now = GetCurrentTimestamp();
 		PMChild    *child;
 		int			i;
 
+		Assert(pmState < PM_WAIT_IO_WORKERS);
+
+		/* Still waiting for the scheduled time? */
+		if (scheduled_at > now)
+			break;
+
+		/* Clear the grow request flag if it is set. */
+		pgaio_worker_pm_clear_grow();
+
+		/*
+		 * Compute next launch time relative to the previous value, so that
+		 * time spent on the postmaster's other duties don't result in an
+		 * inaccurate launch interval.
+		 */
+		io_worker_launch_next_time =
+			TimestampTzPlusMilliseconds(io_worker_launch_next_time,
+										io_worker_launch_interval);
+
+		/*
+		 * If that's already in the past, the interval is either impossibly
+		 * short or we received no requests for new workers for a period.
+		 * Compute a new future time relative to the last launch time instead.
+		 */
+		if (io_worker_launch_next_time <= now)
+			io_worker_launch_next_time =
+				TimestampTzPlusMilliseconds(now, io_worker_launch_interval);
+
 		/* find unused entry in io_worker_children array */
 		for (i = 0; i < MAX_IO_WORKERS; ++i)
 		{
@@ -4457,22 +4542,21 @@ maybe_adjust_io_workers(void)
 			++io_worker_count;
 		}
 		else
-			break;				/* try again next time */
-	}
-
-	/* Too many running? */
-	if (io_worker_count > io_workers)
-	{
-		/* ask the IO worker in the highest slot to exit */
-		for (int i = MAX_IO_WORKERS - 1; i >= 0; --i)
 		{
-			if (io_worker_children[i] != NULL)
-			{
-				kill(io_worker_children[i]->pid, SIGUSR2);
-				break;
-			}
+			/*
+			 * Fork failure: we'll try again after the launch interval
+			 * expires, or be called again without delay if we don't yet have
+			 * io_min_workers.  Don't loop here though, the postmaster has
+			 * other duties.
+			 */
+			break;
 		}
 	}
+
+	/*
+	 * Workers decide when to shut down by themselves, according to the
+	 * io_max_workers and io_worker_idle_timeout GUCs.
+	 */
 }
 
 
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index eb686cede1a..2ba823aba94 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -11,9 +11,8 @@
  * infrastructure for reopening the file, and must processed synchronously by
  * the client code when submitted.
  *
- * So that the submitter can make just one system call when submitting a batch
- * of IOs, wakeups "fan out"; each woken IO worker can wake two more. XXX This
- * could be improved by using futexes instead of latches to wake N waiters.
+ * The pool of workers tries to stabilize at a size that can handle recently
+ * seen variation in demand, within the configured limits.
  *
  * This method of AIO is available in all builds on all operating systems, and
  * is the default.
@@ -29,6 +28,8 @@
 
 #include "postgres.h"
 
+#include <limits.h>
+
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -40,6 +41,8 @@
 #include "storage/io_worker.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
 #include "tcop/tcopprot.h"
@@ -48,10 +51,22 @@
 #include "utils/ps_status.h"
 #include "utils/wait_event.h"
 
+/*
+ * Saturation for counters used to estimate wakeup:IO ratio.
+ *
+ * We maintain hist_wakeups for wakeups received and hist_ios for IOs
+ * processed by each worker.  When either counter reaches this saturation
+ * value, we divide both by two.  The result is an exponentially decaying
+ * ratio of wakeups to IOs, with a very short memory.
+ *
+ * If a worker is itself experiencing useless wakeups, it assumes that
+ * higher-numbered workers would experience even more, so it should end the
+ * chain.
+ */
+#define PGAIO_WORKER_WAKEUP_RATIO_SATURATE 4
 
-/* How many workers should each worker wake up if needed? */
-#define IO_WORKER_WAKEUP_FANOUT 2
-
+/* Debugging support: show current IO and wakeups:ios statistics in ps. */
+/* #define PGAIO_WORKER_SHOW_PS_INFO */
 
 typedef struct PgAioWorkerSubmissionQueue
 {
@@ -63,13 +78,34 @@ typedef struct PgAioWorkerSubmissionQueue
 
 typedef struct PgAioWorkerSlot
 {
-	Latch	   *latch;
-	bool		in_use;
+	ProcNumber	proc_number;
 } PgAioWorkerSlot;
 
+/*
+ * Sets of worker IDs are held in a simple bitmap, accessed through functions
+ * that provide a more readable abstraction.  If we wanted to support more
+ * workers than that, the contention on the single queue would surely get too
+ * high, so we might want to consider multiple pools instead of widening this.
+ */
+typedef uint64 PgAioWorkerSet;
+
+#define PGAIO_WORKERSET_BITS (sizeof(PgAioWorkerSet) * CHAR_BIT)
+
+static_assert(PGAIO_WORKERSET_BITS >= MAX_IO_WORKERS, "too small");
+
 typedef struct PgAioWorkerControl
 {
-	uint64		idle_worker_mask;
+	/* Seen by postmaster */
+	bool		grow;
+
+	/* Protected by AioWorkerSubmissionQueueLock. */
+	PgAioWorkerSet idle_workerset;
+
+	/* Protected by AioWorkerControlLock. */
+	PgAioWorkerSet workerset;
+	int			nworkers;
+
+	/* Protected by AioWorkerControlLock. */
 	PgAioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER];
 } PgAioWorkerControl;
 
@@ -91,15 +127,108 @@ const IoMethodOps pgaio_worker_ops = {
 
 
 /* GUCs */
-int			io_workers = 3;
+int			io_min_workers = 2;
+int			io_max_workers = 8;
+int			io_worker_idle_timeout = 60000;
+int			io_worker_launch_interval = 100;
 
 
 static int	io_worker_queue_size = 64;
-static int	MyIoWorkerId;
+static int	MyIoWorkerId = -1;
 static PgAioWorkerSubmissionQueue *io_worker_submission_queue;
 static PgAioWorkerControl *io_worker_control;
 
 
+static void
+pgaio_workerset_initialize(PgAioWorkerSet *set)
+{
+	*set = 0;
+}
+
+static bool
+pgaio_workerset_is_empty(PgAioWorkerSet *set)
+{
+	return *set == 0;
+}
+
+static PgAioWorkerSet
+pgaio_workerset_singleton(int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	return UINT64_C(1) << worker;
+}
+
+static void
+pgaio_workerset_all(PgAioWorkerSet *set)
+{
+	*set = UINT64_MAX >> (PGAIO_WORKERSET_BITS - MAX_IO_WORKERS);
+}
+
+static void
+pgaio_workerset_subtract(PgAioWorkerSet *set1, const PgAioWorkerSet *set2)
+{
+	*set1 &= ~*set2;
+}
+
+static void
+pgaio_workerset_insert(PgAioWorkerSet *set, int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	*set |= pgaio_workerset_singleton(worker);
+}
+
+static void
+pgaio_workerset_remove(PgAioWorkerSet *set, int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	*set &= ~pgaio_workerset_singleton(worker);
+}
+
+static void
+pgaio_workerset_remove_lte(PgAioWorkerSet *set, int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	*set &= (~(PgAioWorkerSet) 0) << (worker + 1);
+}
+
+static int
+pgaio_workerset_get_highest(PgAioWorkerSet *set)
+{
+	Assert(!pgaio_workerset_is_empty(set));
+	return pg_leftmost_one_pos64(*set);
+}
+
+static int
+pgaio_workerset_get_lowest(PgAioWorkerSet *set)
+{
+	Assert(!pgaio_workerset_is_empty(set));
+	return pg_rightmost_one_pos64(*set);
+}
+
+static int
+pgaio_workerset_pop_lowest(PgAioWorkerSet *set)
+{
+	int			worker = pgaio_workerset_get_lowest(set);
+
+	pgaio_workerset_remove(set, worker);
+	return worker;
+}
+
+#ifdef USE_ASSERT_CHECKING
+static bool
+pgaio_workerset_contains(PgAioWorkerSet *set, int worker)
+{
+	Assert(worker >= 0 && worker < MAX_IO_WORKERS);
+	return (*set & pgaio_workerset_singleton(worker)) != 0;
+}
+
+static int
+pgaio_workerset_count(PgAioWorkerSet *set)
+{
+	return pg_popcount64(*set);
+}
+#endif
+
 static void
 pgaio_worker_shmem_request(void *arg)
 {
@@ -133,37 +262,131 @@ pgaio_worker_shmem_init(void *arg)
 	io_worker_submission_queue->size = queue_size;
 	io_worker_submission_queue->head = 0;
 	io_worker_submission_queue->tail = 0;
+	io_worker_control->grow = false;
+	pgaio_workerset_initialize(&io_worker_control->workerset);
+	pgaio_workerset_initialize(&io_worker_control->idle_workerset);
 
-	io_worker_control->idle_worker_mask = 0;
 	for (int i = 0; i < MAX_IO_WORKERS; ++i)
+		io_worker_control->workers[i].proc_number = INVALID_PROC_NUMBER;
+}
+
+/*
+ * Tell postmaster that we think a new worker is needed.
+ */
+static void
+pgaio_worker_request_grow(void)
+{
+	if (!io_worker_control->grow)
 	{
-		io_worker_control->workers[i].latch = NULL;
-		io_worker_control->workers[i].in_use = false;
+		/*
+		 * Suppress useless signaling if we already know that we're at the
+		 * maximum.  This uses an unlocked read of nworkers, but that's OK for
+		 * this heuristic purpose.
+		 */
+		if (io_worker_control->nworkers < io_max_workers)
+		{
+			io_worker_control->grow = true;
+			pg_memory_barrier();
+			SendPostmasterSignal(PMSIGNAL_IO_WORKER_GROW);
+		}
+	}
+}
+
+/*
+ * Cancel any request for a new worker, after observing an empty queue.
+ */
+static void
+pgaio_worker_cancel_grow(void)
+{
+	if (io_worker_control->grow)
+	{
+		io_worker_control->grow = false;
+		pg_memory_barrier();
 	}
 }
 
+/*
+ * Called by the postmaster to check if a new worker is requested.
+ */
+bool
+pgaio_worker_pm_test_grow(void)
+{
+	pg_memory_barrier();
+	return io_worker_control && io_worker_control->grow;
+}
+
+/*
+ * Called by the postmaster to clear the request for a new worker.
+ */
+void
+pgaio_worker_pm_clear_grow(void)
+{
+	if (io_worker_control)
+		io_worker_control->grow = false;
+	pg_memory_barrier();
+}
+
 static int
-pgaio_worker_choose_idle(void)
+pgaio_worker_choose_idle(int only_workers_above)
 {
+	PgAioWorkerSet workerset;
 	int			worker;
 
-	if (io_worker_control->idle_worker_mask == 0)
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
+	workerset = io_worker_control->idle_workerset;
+	if (only_workers_above >= 0)
+		pgaio_workerset_remove_lte(&workerset, only_workers_above);
+	if (pgaio_workerset_is_empty(&workerset))
 		return -1;
 
-	/* Find the lowest bit position, and clear it. */
-	worker = pg_rightmost_one_pos64(io_worker_control->idle_worker_mask);
-	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << worker);
-	Assert(io_worker_control->workers[worker].in_use);
+	/* Find the lowest numbered idle worker and mark it not idle. */
+	worker = pgaio_workerset_get_lowest(&workerset);
+	pgaio_workerset_remove(&io_worker_control->idle_workerset, worker);
 
 	return worker;
 }
 
+/*
+ * Try to wake a worker by setting its latch, to tell it there are IOs to
+ * process in the submission queue.
+ */
+static void
+pgaio_worker_wake(int worker)
+{
+	ProcNumber	proc_number;
+
+	/*
+	 * If the selected worker is concurrently exiting, then pgaio_worker_die()
+	 * had not yet removed it as of when we saw it in idle_workerset.  That's
+	 * OK, because it will wake all remaining workers to close wakeup-vs-exit
+	 * races: *someone* will see the queued IO.  If there are no workers
+	 * running, the postmaster will start a new one.
+	 */
+	proc_number = io_worker_control->workers[worker].proc_number;
+	if (proc_number != INVALID_PROC_NUMBER)
+		SetLatch(&GetPGProcByNumber(proc_number)->procLatch);
+}
+
+/*
+ * Try to wake a set of workers.  Used on pool change, to close races
+ * described in the callers.
+ */
+static void
+pgaio_workerset_wake(PgAioWorkerSet workerset)
+{
+	while (!pgaio_workerset_is_empty(&workerset))
+		pgaio_worker_wake(pgaio_workerset_pop_lowest(&workerset));
+}
+
 static bool
 pgaio_worker_submission_queue_insert(PgAioHandle *ioh)
 {
 	PgAioWorkerSubmissionQueue *queue;
 	uint32		new_head;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	queue = io_worker_submission_queue;
 	new_head = (queue->head + 1) & (queue->size - 1);
 	if (new_head == queue->tail)
@@ -185,6 +408,8 @@ pgaio_worker_submission_queue_consume(void)
 	PgAioWorkerSubmissionQueue *queue;
 	int			result;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	queue = io_worker_submission_queue;
 	if (queue->tail == queue->head)
 		return -1;				/* empty */
@@ -201,6 +426,8 @@ pgaio_worker_submission_queue_depth(void)
 	uint32		head;
 	uint32		tail;
 
+	Assert(LWLockHeldByMeInMode(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE));
+
 	head = io_worker_submission_queue->head;
 	tail = io_worker_submission_queue->tail;
 
@@ -226,8 +453,7 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 {
 	PgAioHandle **synchronous_ios = NULL;
 	int			nsync = 0;
-	Latch	   *wakeup = NULL;
-	int			worker;
+	int			worker = -1;
 
 	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
 
@@ -251,20 +477,15 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 
 				break;
 			}
-
-			if (wakeup == NULL)
-			{
-				/* Choose an idle worker to wake up if we haven't already. */
-				worker = pgaio_worker_choose_idle();
-				if (worker >= 0)
-					wakeup = io_worker_control->workers[worker].latch;
-
-				pgaio_debug_io(DEBUG4, staged_ios[i],
-							   "choosing worker %d",
-							   worker);
-			}
 		}
+		/* Choose one worker to wake for this batch. */
+		if (nsync < num_staged_ios)
+			worker = pgaio_worker_choose_idle(-1);
 		LWLockRelease(AioWorkerSubmissionQueueLock);
+
+		/* Wake up chosen worker.  It will wake peers if necessary. */
+		if (nsync == 0)
+			pgaio_worker_wake(worker);
 	}
 	else
 	{
@@ -273,9 +494,6 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 		nsync = num_staged_ios;
 	}
 
-	if (wakeup)
-		SetLatch(wakeup);
-
 	/* Run whatever is left synchronously. */
 	if (nsync > 0)
 	{
@@ -295,14 +513,30 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 static void
 pgaio_worker_die(int code, Datum arg)
 {
-	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
-	Assert(io_worker_control->workers[MyIoWorkerId].in_use);
-	Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch);
+	PgAioWorkerSet notify_set;
 
-	io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
-	io_worker_control->workers[MyIoWorkerId].in_use = false;
-	io_worker_control->workers[MyIoWorkerId].latch = NULL;
+	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	pgaio_workerset_remove(&io_worker_control->idle_workerset, MyIoWorkerId);
 	LWLockRelease(AioWorkerSubmissionQueueLock);
+
+	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
+	Assert(io_worker_control->workers[MyIoWorkerId].proc_number == MyProcNumber);
+	io_worker_control->workers[MyIoWorkerId].proc_number = INVALID_PROC_NUMBER;
+	Assert(pgaio_workerset_contains(&io_worker_control->workerset, MyIoWorkerId));
+	pgaio_workerset_remove(&io_worker_control->workerset, MyIoWorkerId);
+	notify_set = io_worker_control->workerset;
+	Assert(io_worker_control->nworkers > 0);
+	io_worker_control->nworkers--;
+	Assert(pgaio_workerset_count(&io_worker_control->workerset) ==
+		   io_worker_control->nworkers);
+	LWLockRelease(AioWorkerControlLock);
+
+	/*
+	 * Notify other workers on pool change.  This allows the new highest
+	 * worker to know that it is now the one that can time out, and closes a
+	 * wakeup-loss race described in pgaio_worker_wake().
+	 */
+	pgaio_workerset_wake(notify_set);
 }
 
 /*
@@ -312,33 +546,38 @@ pgaio_worker_die(int code, Datum arg)
 static void
 pgaio_worker_register(void)
 {
+	PgAioWorkerSet free_workerset;
+	PgAioWorkerSet old_workerset;
+
 	MyIoWorkerId = -1;
 
-	/*
-	 * XXX: This could do with more fine-grained locking. But it's also not
-	 * very common for the number of workers to change at the moment...
-	 */
-	LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
+	LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
+	/* Find lowest unused worker ID. */
+	pgaio_workerset_all(&free_workerset);
+	pgaio_workerset_subtract(&free_workerset, &io_worker_control->workerset);
+	if (!pgaio_workerset_is_empty(&free_workerset))
+		MyIoWorkerId = pgaio_workerset_get_lowest(&free_workerset);
+	if (MyIoWorkerId == -1)
+		elog(ERROR, "couldn't find a free worker ID");
 
-	for (int i = 0; i < MAX_IO_WORKERS; ++i)
-	{
-		if (!io_worker_control->workers[i].in_use)
-		{
-			Assert(io_worker_control->workers[i].latch == NULL);
-			io_worker_control->workers[i].in_use = true;
-			MyIoWorkerId = i;
-			break;
-		}
-		else
-			Assert(io_worker_control->workers[i].latch != NULL);
-	}
+	Assert(io_worker_control->workers[MyIoWorkerId].proc_number ==
+		   INVALID_PROC_NUMBER);
+	io_worker_control->workers[MyIoWorkerId].proc_number = MyProcNumber;
 
-	if (MyIoWorkerId == -1)
-		elog(ERROR, "couldn't find a free worker slot");
+	old_workerset = io_worker_control->workerset;
+	Assert(!pgaio_workerset_contains(&old_workerset, MyIoWorkerId));
+	pgaio_workerset_insert(&io_worker_control->workerset, MyIoWorkerId);
+	io_worker_control->nworkers++;
+	Assert(io_worker_control->nworkers <= MAX_IO_WORKERS);
+	Assert(pgaio_workerset_count(&io_worker_control->workerset) ==
+		   io_worker_control->nworkers);
+	LWLockRelease(AioWorkerControlLock);
 
-	io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
-	io_worker_control->workers[MyIoWorkerId].latch = MyLatch;
-	LWLockRelease(AioWorkerSubmissionQueueLock);
+	/*
+	 * Notify other workers on pool change.  If we were the highest worker,
+	 * this allows the new highest worker to know that it can time out.
+	 */
+	pgaio_workerset_wake(old_workerset);
 
 	on_shmem_exit(pgaio_worker_die, 0);
 }
@@ -364,14 +603,48 @@ pgaio_worker_error_callback(void *arg)
 	errcontext("I/O worker executing I/O on behalf of process %d", owner_pid);
 }
 
+/*
+ * Check if this backend is allowed to time out, and thus should use a
+ * non-infinite sleep time.  Only the highest-numbered worker is allowed to
+ * time out, and only if the pool is above io_min_workers.  Serializing
+ * timeouts keeps IDs in a range 0..N without gaps, and avoids undershooting
+ * io_min_workers.
+ *
+ * The result is only instantaneously true and may be temporarily inconsistent
+ * in different workers around transitions, but all workers are woken up on
+ * pool size or GUC changes making the result eventually consistent.
+ */
+static bool
+pgaio_worker_can_timeout(void)
+{
+	PgAioWorkerSet workerset;
+
+	/* Serialize against pool size changes. */
+	LWLockAcquire(AioWorkerControlLock, LW_SHARED);
+	workerset = io_worker_control->workerset;
+	LWLockRelease(AioWorkerControlLock);
+
+	if (MyIoWorkerId != pgaio_workerset_get_highest(&workerset))
+		return false;
+
+	if (MyIoWorkerId < io_min_workers)
+		return false;
+
+	return true;
+}
+
 void
 IoWorkerMain(const void *startup_data, size_t startup_data_len)
 {
 	sigjmp_buf	local_sigjmp_buf;
+	TimestampTz idle_timeout_abs = 0;
+	int			timeout_guc_used = 0;
 	PgAioHandle *volatile error_ioh = NULL;
 	ErrorContextCallback errcallback = {0};
 	volatile int error_errno = 0;
 	char		cmd[128];
+	int			hist_ios = 0;
+	int			hist_wakeups = 0;
 
 	AuxiliaryProcessMainCommon();
 
@@ -439,10 +712,9 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	while (!ShutdownRequestPending)
 	{
 		uint32		io_index;
-		Latch	   *latches[IO_WORKER_WAKEUP_FANOUT];
-		int			nlatches = 0;
-		int			nwakeups = 0;
-		int			worker;
+		int			worker = -1;
+		int			queue_depth = 0;
+		bool		maybe_grow = false;
 
 		/*
 		 * Try to get a job to do.
@@ -453,38 +725,106 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
 		if ((io_index = pgaio_worker_submission_queue_consume()) == -1)
 		{
-			/*
-			 * Nothing to do.  Mark self idle.
-			 *
-			 * XXX: Invent some kind of back pressure to reduce useless
-			 * wakeups?
-			 */
-			io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId);
+			/* Nothing to do.  Mark self idle. */
+			pgaio_workerset_insert(&io_worker_control->idle_workerset,
+								   MyIoWorkerId);
 		}
 		else
 		{
 			/* Got one.  Clear idle flag. */
-			io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId);
+			pgaio_workerset_remove(&io_worker_control->idle_workerset,
+								   MyIoWorkerId);
 
-			/* See if we can wake up some peers. */
-			nwakeups = Min(pgaio_worker_submission_queue_depth(),
-						   IO_WORKER_WAKEUP_FANOUT);
-			for (int i = 0; i < nwakeups; ++i)
+			/*
+			 * See if we should wake up a higher numbered peer.  Only do that
+			 * if this worker is not receiving spurious wakeups itself.  The
+			 * intention is create a frontier beyond which idle workers stay
+			 * asleep.
+			 *
+			 * This heuristic tries to discover the useful wakeup propagation
+			 * chain length when IOs are very fast and workers wake up to find
+			 * that all IOs have already been taken.
+			 *
+			 * If we chose not to wake a worker when we ideally should have,
+			 * then ios will soon exceed wakeups.
+			 */
+			if (hist_wakeups <= hist_ios)
 			{
-				if ((worker = pgaio_worker_choose_idle()) < 0)
-					break;
-				latches[nlatches++] = io_worker_control->workers[worker].latch;
+				queue_depth = pgaio_worker_submission_queue_depth();
+				if (queue_depth > 0)
+				{
+					/* Choose a worker higher than me to wake. */
+					worker = pgaio_worker_choose_idle(MyIoWorkerId);
+					if (worker == -1)
+						maybe_grow = true;
+				}
 			}
 		}
 		LWLockRelease(AioWorkerSubmissionQueueLock);
 
-		for (int i = 0; i < nlatches; ++i)
-			SetLatch(latches[i]);
+		/* Propagate wakeups. */
+		if (worker != -1)
+		{
+			pgaio_worker_wake(worker);
+		}
+		else if (maybe_grow)
+		{
+			/*
+			 * We know there was at least one more item in the queue, and we
+			 * failed to find a higher-numbered idle worker to wake.  Now we
+			 * decide if we should try to start one more worker.
+			 *
+			 * We do this with a simple heuristic: is the queue depth greater
+			 * than the current number of workers?
+			 *
+			 * Consider the following situations:
+			 *
+			 * 1. The queue depth is constantly increasing, because IOs are
+			 * arriving faster than they can possibly be serviced.  It doesn't
+			 * matter much which threshold we choose, as we will surely hit
+			 * it.  Crossing the current worker count is a useful signal
+			 * because it's clearly too deep to avoid queuing latency already,
+			 * but still leaves a small window of opportunity to improve the
+			 * situation before the queue oveflows.
+			 *
+			 * 2. The worker pool is keeping up, no latency is being
+			 * introduced and an extra worker would be a waste of resources.
+			 * Queue depth distributions tend to be heavily skewed, with long
+			 * tails of low probability spikes (due to submission clustering,
+			 * scheduling, jitter, stalls, noisy neighbors, etc).  We want a
+			 * number that is very unlikely to be triggered by an outlier, and
+			 * we bet that an exponential or similar distribution whose
+			 * outliers never reach this threshold must be almost entirely
+			 * concentrated at the low end.  If we do see a spike as big as
+			 * the worker count, we take it as a signal that the distribution
+			 * is surely too wide.
+			 *
+			 * On its own, this is an extremely crude signal.  When combined
+			 * with the wakeup propagation test that precedes it and the
+			 * io_worker_launch_delay, we can try each pool size until we find
+			 * one that doesn't trigger further growth.
+			 *
+			 * XXX Ideas from queueing theory or control theory could surely
+			 * do a much better job of this.
+			 */
+
+			/* Read nworkers without lock for this heuristic purpose. */
+			if (queue_depth > io_worker_control->nworkers)
+				pgaio_worker_request_grow();
+		}
 
 		if (io_index != -1)
 		{
 			PgAioHandle *ioh = NULL;
 
+			/* Cancel timeout and update wakeup:work ratio. */
+			idle_timeout_abs = 0;
+			if (++hist_ios == PGAIO_WORKER_WAKEUP_RATIO_SATURATE)
+			{
+				hist_wakeups /= 2;
+				hist_ios /= 2;
+			}
+
 			ioh = &pgaio_ctl->io_handles[io_index];
 			error_ioh = ioh;
 			errcallback.arg = ioh;
@@ -537,6 +877,19 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 			}
 #endif
 
+#ifdef PGAIO_WORKER_SHOW_PS_INFO
+			{
+				char	   *description = pgaio_io_get_target_description(ioh);
+
+				sprintf(cmd, "%d: [%s] %s",
+						MyIoWorkerId,
+						pgaio_io_get_op_name(ioh),
+						pgaio_io_get_target_description(ioh));
+				pfree(description);
+				set_ps_display(cmd);
+			}
+#endif
+
 			/*
 			 * We don't expect this to ever fail with ERROR or FATAL, no need
 			 * to keep error_ioh set to the IO.
@@ -550,8 +903,76 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		}
 		else
 		{
-			WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
-					  WAIT_EVENT_IO_WORKER_MAIN);
+			int			timeout_ms;
+
+			/* Cancel new worker request if pending. */
+			pgaio_worker_cancel_grow();
+
+			/* Compute the remaining allowed idle time. */
+			if (io_worker_idle_timeout == -1)
+			{
+				/* Never time out. */
+				timeout_ms = -1;
+			}
+			else
+			{
+				TimestampTz now = GetCurrentTimestamp();
+
+				/* If the GUC changes, reset timer. */
+				if (idle_timeout_abs != 0 &&
+					io_worker_idle_timeout != timeout_guc_used)
+					idle_timeout_abs = 0;
+
+				/* Only the highest-numbered worker can time out. */
+				if (pgaio_worker_can_timeout())
+				{
+					if (idle_timeout_abs == 0)
+					{
+						/*
+						 * I have just been promoted to the timeout worker, or
+						 * the GUC changed.  Compute new absolute time from
+						 * now.
+						 */
+						idle_timeout_abs =
+							TimestampTzPlusMilliseconds(now,
+														io_worker_idle_timeout);
+						timeout_guc_used = io_worker_idle_timeout;
+					}
+					timeout_ms =
+						TimestampDifferenceMilliseconds(now, idle_timeout_abs);
+				}
+				else
+				{
+					/* No timeout for me. */
+					idle_timeout_abs = 0;
+					timeout_ms = -1;
+				}
+			}
+
+#ifdef PGAIO_WORKER_SHOW_PS_INFO
+			sprintf(cmd, "%d: idle, wakeups:ios = %d:%d",
+					MyIoWorkerId, hist_wakeups, hist_ios);
+			set_ps_display(cmd);
+#endif
+
+			if (WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT,
+						  timeout_ms,
+						  WAIT_EVENT_IO_WORKER_MAIN) == WL_TIMEOUT)
+			{
+				/* WL_TIMEOUT */
+				if (pgaio_worker_can_timeout())
+					if (GetCurrentTimestamp() >= idle_timeout_abs)
+						break;
+			}
+			else
+			{
+				/* WL_LATCH_SET */
+				if (++hist_wakeups == PGAIO_WORKER_WAKEUP_RATIO_SATURATE)
+				{
+					hist_wakeups /= 2;
+					hist_ios /= 2;
+				}
+			}
 			ResetLatch(MyLatch);
 		}
 
@@ -561,6 +982,10 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		{
 			ConfigReloadPending = false;
 			ProcessConfigFile(PGC_SIGHUP);
+
+			/* If io_max_workers has been decreased, exit highest first. */
+			if (MyIoWorkerId >= io_max_workers)
+				break;
 		}
 	}
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7bda5298558..560659f9568 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -369,6 +369,7 @@ AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
 WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 LogicalDecodingControl	"Waiting to read or update logical decoding status information."
 DataChecksumsWorker	"Waiting for data checksums worker."
+AioWorkerControl	"Waiting to update AIO worker information."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 86c1eba5dab..83af594d4af 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1390,6 +1390,14 @@
   check_hook => 'check_io_max_concurrency',
 },
 
+{ name => 'io_max_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Maximum number of I/O worker processes, for io_method=worker.',
+  variable => 'io_max_workers',
+  boot_val => '8',
+  min => '1',
+  max => 'MAX_IO_WORKERS',
+},
+
 { name => 'io_method', type => 'enum', context => 'PGC_POSTMASTER', group => 'RESOURCES_IO',
   short_desc => 'Selects the method for executing asynchronous I/O.',
   variable => 'io_method',
@@ -1398,14 +1406,32 @@
   assign_hook => 'assign_io_method',
 },
 
-{ name => 'io_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
-  short_desc => 'Number of IO worker processes, for io_method=worker.',
-  variable => 'io_workers',
-  boot_val => '3',
+{ name => 'io_min_workers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Minimum number of I/O worker processes, for io_method=worker.',
+  variable => 'io_min_workers',
+  boot_val => '2',
   min => '1',
   max => 'MAX_IO_WORKERS',
 },
 
+{ name => 'io_worker_idle_timeout', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Maximum time before idle I/O worker processes time out, for io_method=worker.',
+  variable => 'io_worker_idle_timeout',
+  flags => 'GUC_UNIT_MS',
+  boot_val => '60000',
+  min => '0',
+  max => 'INT_MAX',
+},
+
+{ name => 'io_worker_launch_interval', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_IO',
+  short_desc => 'Minimum time before launching a new I/O worker process, for io_method=worker.',
+  variable => 'io_worker_launch_interval',
+  flags => 'GUC_UNIT_MS',
+  boot_val => '100',
+  min => '0',
+  max => 'INT_MAX',
+},
+
 # Not for general use --- used by SET SESSION AUTHORIZATION and SET
 # ROLE
 { name => 'is_superuser', type => 'bool', context => 'PGC_INTERNAL', group => 'UNGROUPED',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4f2bbf05295..5e1e49f0ae8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -222,7 +222,11 @@
                                         # can execute simultaneously
                                         # -1 sets based on shared_buffers
                                         # (change requires restart)
-#io_workers = 3                         # 1-32;
+
+#io_min_workers = 2                     # 1-32 (change requires pg_reload_conf())
+#io_max_workers = 8                     # 1-32
+#io_worker_idle_timeout = 60s
+#io_worker_launch_interval = 100ms
 
 # - Worker Processes -
 
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
index f7d5998a138..cffffd62fdd 100644
--- a/src/include/storage/io_worker.h
+++ b/src/include/storage/io_worker.h
@@ -17,6 +17,14 @@
 
 pg_noreturn extern void IoWorkerMain(const void *startup_data, size_t startup_data_len);
 
-extern PGDLLIMPORT int io_workers;
+/* Public GUCs. */
+extern PGDLLIMPORT int io_min_workers;
+extern PGDLLIMPORT int io_max_workers;
+extern PGDLLIMPORT int io_worker_idle_timeout;
+extern PGDLLIMPORT int io_worker_launch_interval;
+
+/* Interfaces visible to the postmaster. */
+extern bool pgaio_worker_pm_test_grow(void);
+extern void pgaio_worker_pm_clear_grow(void);
 
 #endif							/* IO_WORKER_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index af8553bcb6c..d7eb648bd27 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -88,6 +88,7 @@ PG_LWLOCK(53, AioWorkerSubmissionQueue)
 PG_LWLOCK(54, WaitLSN)
 PG_LWLOCK(55, LogicalDecodingControl)
 PG_LWLOCK(56, DataChecksumsWorker)
+PG_LWLOCK(57, AioWorkerControl)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 001e6eea61c..bcce4011790 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -38,6 +38,7 @@ typedef enum
 	PMSIGNAL_ROTATE_LOGFILE,	/* send SIGUSR1 to syslogger to rotate logfile */
 	PMSIGNAL_START_AUTOVAC_LAUNCHER,	/* start an autovacuum launcher */
 	PMSIGNAL_START_AUTOVAC_WORKER,	/* start an autovacuum worker */
+	PMSIGNAL_IO_WORKER_GROW,	/* I/O worker pool wants to grow */
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
diff --git a/src/test/modules/test_aio/t/002_io_workers.pl b/src/test/modules/test_aio/t/002_io_workers.pl
index 34bc132ea08..b9775811d4d 100644
--- a/src/test/modules/test_aio/t/002_io_workers.pl
+++ b/src/test/modules/test_aio/t/002_io_workers.pl
@@ -14,6 +14,9 @@ $node->init();
 $node->append_conf(
 	'postgresql.conf', qq(
 io_method=worker
+io_worker_idle_timeout=0ms
+io_worker_launch_interval=0ms
+io_max_workers=32
 ));
 
 $node->start();
@@ -31,7 +34,7 @@ sub test_number_of_io_workers_dynamic
 {
 	my $node = shift;
 
-	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_workers');
+	my $prev_worker_count = $node->safe_psql('postgres', 'SHOW io_min_workers');
 
 	# Verify that worker count can't be set to 0
 	change_number_of_io_workers($node, 0, $prev_worker_count, 1);
@@ -62,24 +65,24 @@ sub change_number_of_io_workers
 	my ($result, $stdout, $stderr);
 
 	($result, $stdout, $stderr) =
-	  $node->psql('postgres', "ALTER SYSTEM SET io_workers = $worker_count");
+	  $node->psql('postgres', "ALTER SYSTEM SET io_min_workers = $worker_count");
 	$node->safe_psql('postgres', 'SELECT pg_reload_conf()');
 
 	if ($expect_failure)
 	{
 		like(
 			$stderr,
-			qr/$worker_count is outside the valid range for parameter "io_workers"/,
-			"updating number of io_workers to $worker_count failed, as expected"
+			qr/$worker_count is outside the valid range for parameter "io_min_workers"/,
+			"updating io_min_workers to $worker_count failed, as expected"
 		);
 
 		return $prev_worker_count;
 	}
 	else
 	{
-		is( $node->safe_psql('postgres', 'SHOW io_workers'),
+		is( $node->safe_psql('postgres', 'SHOW io_min_workers'),
 			$worker_count,
-			"updating number of io_workers from $prev_worker_count to $worker_count"
+			"updating number of io_min_workers from $prev_worker_count to $worker_count"
 		);
 
 		check_io_worker_count($node, $worker_count);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2dfe1b38826..3dea516912c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2271,6 +2271,7 @@ PgAioUringCaps
 PgAioUringContext
 PgAioWaitRef
 PgAioWorkerControl
+PgAioWorkerSet
 PgAioWorkerSlot
 PgAioWorkerSubmissionQueue
 PgArchData
-- 
2.53.0



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-07-12 05:08       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-07-30 10:14         ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
@ 2026-04-11 06:35           ` Thomas Munro <[email protected]>
  2026-04-14 10:26             ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  1 sibling, 1 reply; 24+ messages in thread

From: Thomas Munro @ 2026-04-11 06:35 UTC (permalink / raw)
  To: Dmitry Dolgov <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Wed, Jul 30, 2025 at 10:15 PM Dmitry Dolgov <[email protected]> wrote:
> As a side note, I was trying to experiment with this patch using
> dm-mapper's delay feature to introduce an arbitrary large io latency and
> see how the io queue is growing.

FWIW, here's what I came up with while experimenting with that sort of thing:

      shared_preload_libraries=io_limit
      io_limit.ios_per_second=6000

That differs from eg dm-mapper delays by making everything seem like
slow direct I/O, which seemed more interesting for this project.  For
example if you run some continuous workload while you SET
io_limit.ios_per_second to various numbers, with
io_workers_idle_timeout set fairly low, you can monitor the pool
adjustments.


Attachments:

  [text/x-patch] 0001-contrib-io_limit-Simulation-of-slow-storage.patch (12.4K, 2-0001-contrib-io_limit-Simulation-of-slow-storage.patch)
  download | inline diff:
From 6ecfe2226c9068a82b7c54094db55354960a70bb Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Sat, 11 Apr 2026 17:31:13 +1200
Subject: [PATCH] contrib/io_limit: Simulation of slow storage.

Only affects IOs submitted to io_method=worker.  Configured as:

  shared_preload_libraries=io_limit

  io_limit.ios_per_second=1000
  io_limit.read_per_second=200MB
  io_limit_write_per_second=100MB

Zero means no limit.

XXX Experimental hack
---
 contrib/Makefile                        |   1 +
 contrib/io_limit/Makefile               |  20 ++
 contrib/io_limit/io_limit.c             | 275 ++++++++++++++++++++++++
 contrib/io_limit/io_limit.control       |   5 +
 contrib/io_limit/meson.build            |  28 +++
 contrib/meson.build                     |   1 +
 src/backend/storage/aio/method_worker.c |  13 ++
 src/include/storage/io_worker.h         |   5 +
 8 files changed, 348 insertions(+)
 create mode 100644 contrib/io_limit/Makefile
 create mode 100644 contrib/io_limit/io_limit.c
 create mode 100644 contrib/io_limit/io_limit.control
 create mode 100644 contrib/io_limit/meson.build

diff --git a/contrib/Makefile b/contrib/Makefile
index 7d91fe77db3..48e82c53333 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -24,6 +24,7 @@ SUBDIRS = \
 		hstore		\
 		intagg		\
 		intarray	\
+		io_limit	\
 		isn		\
 		lo		\
 		ltree		\
diff --git a/contrib/io_limit/Makefile b/contrib/io_limit/Makefile
new file mode 100644
index 00000000000..da176698a17
--- /dev/null
+++ b/contrib/io_limit/Makefile
@@ -0,0 +1,20 @@
+# contrib/io_limit/Makefile
+
+MODULE_big = io_limit
+OBJS = \
+	$(WIN32RES) \
+	io_limit.o
+
+EXTENSION = io_limit
+PGFILEDESC = "io_limit - io_limit - artificially limit asynchronous I/O for tesing"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_prewarm
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/io_limit/io_limit.c b/contrib/io_limit/io_limit.c
new file mode 100644
index 00000000000..fa2ec6f1ff2
--- /dev/null
+++ b/contrib/io_limit/io_limit.c
@@ -0,0 +1,275 @@
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/atomics.h"
+#include "portability/instr_time.h"
+#include "storage/aio_internal.h"
+#include "storage/io_worker.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
+#include "utils/guc.h"
+
+/* GUCs. */
+static int	io_limit_ios_per_second = 0;
+static int	io_limit_read_per_second = 0;
+static int	io_limit_write_per_second = 0;
+
+typedef struct io_limit_control_data
+{
+	/* Whether any GUC is set to a non-zero value. */
+	bool		enabled;
+
+	/* Absolute time to wait until. */
+	pg_atomic_uint64 op_next_ns;
+	pg_atomic_uint64 read_next_ns;
+	pg_atomic_uint64 write_next_ns;
+
+	/* Limits expressed as delay intervals. */
+	LWLock		lock;
+	int			op_ns;
+	int			read_block_ns;
+	int			write_block_ns;
+}			io_limit_control_data;
+
+static io_limit_control_data * io_limit_control;
+
+static void io_limit_shmem_request(void *arg);
+static void io_limit_shmem_init(void *arg);
+
+static void assign_io_limit_ios_per_second(int newval, void *extra);
+static void assign_io_limit_read_per_second(int newval, void *extra);
+static void assign_io_limit_write_per_second(int newval, void *extra);
+static const char *show_io_limit_ios_per_second(void);
+static const char *show_io_limit_read_per_second(void);
+static const char *show_io_limit_write_per_second(void);
+
+static void io_limit_on_perform(PgAioHandle *ioh);
+
+static const ShmemCallbacks io_limit_shmem_callbacks = {
+	.request_fn = io_limit_shmem_request,
+	.init_fn = io_limit_shmem_init,
+};
+
+PG_MODULE_MAGIC_EXT(
+					.name = "io_limit",
+					.version = PG_VERSION
+);
+
+void
+_PG_init(void)
+{
+	/* Bail out if not configured in shared_preload_libraries. */
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	DefineCustomIntVariable("io_limit.ios_per_second",
+							"Limits IOs per second.",
+							"If set to zero, there is no limit.",
+							&io_limit_ios_per_second,
+							0,
+							0, INT_MAX,
+							PGC_USERSET,
+							0,
+							NULL,
+							assign_io_limit_ios_per_second,
+							show_io_limit_ios_per_second);
+	DefineCustomIntVariable("io_limit.read_per_second",
+							"Limits read bandwidth.",
+							"If set to zero, there is no limit.",
+							&io_limit_read_per_second,
+							0,
+							0, INT_MAX,
+							PGC_USERSET,
+							GUC_UNIT_BLOCKS,
+							NULL,
+							assign_io_limit_read_per_second,
+							show_io_limit_read_per_second);
+	DefineCustomIntVariable("io_limit.write_per_second",
+							"Limits write bandwidth.",
+							"If set to zero, there is no limit.",
+							&io_limit_write_per_second,
+							0,
+							0, INT_MAX,
+							PGC_USERSET,
+							GUC_UNIT_BLOCKS,
+							NULL,
+							assign_io_limit_write_per_second,
+							show_io_limit_write_per_second);
+
+	MarkGUCPrefixReserved("io_limit");
+	RegisterShmemCallbacks(&io_limit_shmem_callbacks);
+	pgaio_worker_set_on_perform_hook(io_limit_on_perform);
+}
+
+static void
+io_limit_shmem_request(void *arg)
+{
+	ShmemRequestStruct(.name = "io_limit",
+					   .size = sizeof(io_limit_control_data),
+					   .ptr = (void **) &io_limit_control);
+}
+
+static void
+io_limit_shmem_init(void *arg)
+{
+	memset(io_limit_control, 0, sizeof(*io_limit_control));
+	pg_atomic_init_u64(&io_limit_control->op_next_ns, 0);
+	pg_atomic_init_u64(&io_limit_control->read_next_ns, 0);
+	pg_atomic_init_u64(&io_limit_control->write_next_ns, 0);
+	LWLockInitialize(&io_limit_control->lock, LWLockNewTrancheId("io_limit"));
+
+	/* Assign initial values. */
+	assign_io_limit_ios_per_second(io_limit_ios_per_second, NULL);
+	assign_io_limit_read_per_second(io_limit_read_per_second, NULL);
+	assign_io_limit_write_per_second(io_limit_write_per_second, NULL);
+}
+
+static void
+assign_io_limit(int *wait_ns, int per_second)
+{
+	/* Ignore call from _PG_init() before ready. */
+	if (!io_limit_control)
+		return;
+
+	LWLockAcquire(&io_limit_control->lock, LW_EXCLUSIVE);
+	*wait_ns = per_second == 0 ? 0 : NS_PER_S / per_second;
+	io_limit_control->enabled =
+		io_limit_control->op_ns > 0 ||
+		io_limit_control->read_block_ns > 0 ||
+		io_limit_control->write_block_ns > 0;
+	LWLockRelease(&io_limit_control->lock);
+}
+
+static void
+assign_io_limit_ios_per_second(int newval, void *extra)
+{
+	assign_io_limit(&io_limit_control->op_ns, newval);
+}
+
+static void
+assign_io_limit_read_per_second(int newval, void *extra)
+{
+	assign_io_limit(&io_limit_control->read_block_ns, newval);
+}
+
+static void
+assign_io_limit_write_per_second(int newval, void *extra)
+{
+	assign_io_limit(&io_limit_control->write_block_ns, newval);
+}
+
+static const char *
+show_io_limit(const int *wait_ns)
+{
+	int			per_second;
+
+	LWLockAcquire(&io_limit_control->lock, LW_SHARED);
+	per_second = *wait_ns == 0 ? 0 : NS_PER_S / *wait_ns;
+	LWLockRelease(&io_limit_control->lock);
+
+	return psprintf("%d", per_second);
+}
+
+static const char *
+show_io_limit_ios_per_second(void)
+{
+	return show_io_limit(&io_limit_control->op_ns);
+}
+
+static const char *
+show_io_limit_read_per_second(void)
+{
+	return show_io_limit(&io_limit_control->read_block_ns);
+}
+
+static const char *
+show_io_limit_write_per_second(void)
+{
+	return show_io_limit(&io_limit_control->write_block_ns);
+}
+
+static BlockNumber
+io_limit_get_block_count(PgAioHandle *ioh)
+{
+	if (ioh->op == PGAIO_OP_READV ||
+		ioh->op == PGAIO_OP_WRITEV)
+	{
+		struct iovec *iov;
+		size_t		size;
+		int			iovcnt;
+
+		size = 0;
+		iovcnt = pgaio_io_get_iovec_length(ioh, &iov);
+		for (int i = 0; i < iovcnt; ++i)
+			size += iov[i].iov_len;
+
+		return size / BLCKSZ;
+	}
+
+	return 0;
+}
+
+/*
+ * Wait until *next_ns_p and advance *next_ns_p by delay_ns.
+ */
+static void
+io_limit_wait(pg_atomic_uint64 *next_ns_p, int delay_ns)
+{
+	instr_time	now;
+	uint64		now_ns;
+	uint64		next_ns;
+
+	INSTR_TIME_SET_CURRENT(now);
+	now_ns = INSTR_TIME_GET_NANOSEC(now);
+	next_ns = pg_atomic_read_u64(next_ns_p);
+
+	for (;;)
+	{
+		if (next_ns > now_ns)
+		{
+			/* Need to wait.  Delay the next op further. */
+			next_ns = pg_atomic_fetch_add_u64(next_ns_p, delay_ns);
+
+			/* Average rate maintained even with low-res sleep or EINTR. */
+			pg_usleep(((next_ns - now_ns) + 999) / 1000);
+			break;
+		}
+		else
+		{
+			/* Don't need to wait.  New next_ns is relative to now. */
+			if (pg_atomic_compare_exchange_u64(next_ns_p,
+											   &next_ns,
+											   now_ns + delay_ns))
+				break;
+		}
+	}
+}
+
+static void
+io_limit_on_perform(PgAioHandle *ioh)
+{
+	int			op_ns;
+	int			read_block_ns;
+	int			write_block_ns;
+
+	if (!io_limit_control->enabled)
+		return;
+
+	op_ns = io_limit_control->op_ns;
+	if (op_ns)
+		io_limit_wait(&io_limit_control->op_next_ns, op_ns);
+
+	if (ioh->op == PGAIO_OP_READV)
+	{
+		read_block_ns = io_limit_control->read_block_ns;
+		if (read_block_ns)
+			io_limit_wait(&io_limit_control->read_next_ns,
+						  io_limit_get_block_count(ioh) * read_block_ns);
+	}
+	else if (ioh->op == PGAIO_OP_WRITEV)
+	{
+		write_block_ns = io_limit_control->write_block_ns;
+		io_limit_wait(&io_limit_control->write_next_ns,
+					  io_limit_get_block_count(ioh) * write_block_ns);
+	}
+}
diff --git a/contrib/io_limit/io_limit.control b/contrib/io_limit/io_limit.control
new file mode 100644
index 00000000000..2f8f06c9e87
--- /dev/null
+++ b/contrib/io_limit/io_limit.control
@@ -0,0 +1,5 @@
+# io_limit extension
+comment = 'io_limit'
+default_version = '1.0'
+module_pathname = '$libdir/io_limit'
+relocatable = true
diff --git a/contrib/io_limit/meson.build b/contrib/io_limit/meson.build
new file mode 100644
index 00000000000..1d26a08de83
--- /dev/null
+++ b/contrib/io_limit/meson.build
@@ -0,0 +1,28 @@
+# Copyright (c) 2022-2026, PostgreSQL Global Development Group
+
+io_limit_sources = files(
+  'io_limit.c',
+)
+
+if host_system == 'windows'
+  io_limit_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'io_limit',
+    '--FILEDESC', 'io_limit - artificially limit asynchronous I/O for tesing',])
+endif
+
+io_limit = shared_module('io_limit',
+  io_limit_sources,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += io_limit
+
+install_data(
+  'io_limit.control',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'io_limit',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+}
diff --git a/contrib/meson.build b/contrib/meson.build
index ebb7f83d8c5..398b0d704b5 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -34,6 +34,7 @@ subdir('hstore_plperl')
 subdir('hstore_plpython')
 subdir('intagg')
 subdir('intarray')
+subdir('io_limit')
 subdir('isn')
 subdir('jsonb_plperl')
 subdir('jsonb_plpython')
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index a5ccd506d8c..87afcf856e1 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -139,6 +139,7 @@ static int	MyIoWorkerId = -1;
 static PgAioWorkerSubmissionQueue *io_worker_submission_queue;
 static PgAioWorkerControl *io_worker_control;
 
+static io_worker_on_perform_fn io_worker_on_perform_hook;
 
 static void
 pgaio_workerset_initialize(PgAioWorkerSet *set)
@@ -529,6 +530,9 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
 		for (int i = 0; i < nsync; ++i)
 		{
 			pgaio_io_perform_synchronously(synchronous_ios[i]);
+
+			if (io_worker_on_perform_hook)
+				io_worker_on_perform_hook(synchronous_ios[i]);
 		}
 	}
 
@@ -929,6 +933,9 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 			 */
 			pgaio_io_perform_synchronously(ioh);
 
+			if (io_worker_on_perform_hook)
+				io_worker_on_perform_hook(ioh);
+
 			RESUME_INTERRUPTS();
 			errcallback.arg = NULL;
 		}
@@ -1024,6 +1031,12 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	proc_exit(0);
 }
 
+void
+pgaio_worker_set_on_perform_hook(io_worker_on_perform_fn fn)
+{
+	io_worker_on_perform_hook = fn;
+}
+
 bool
 pgaio_workers_enabled(void)
 {
diff --git a/src/include/storage/io_worker.h b/src/include/storage/io_worker.h
index c852c9f3741..c9ef49a585d 100644
--- a/src/include/storage/io_worker.h
+++ b/src/include/storage/io_worker.h
@@ -28,4 +28,9 @@ extern bool pgaio_worker_pm_test_grow_signal_sent(void);
 extern void pgaio_worker_pm_clear_grow_signal_sent(void);
 extern bool pgaio_worker_pm_test_grow(void);
 
+/* Hook to support contrib/io_limit. */
+typedef void (*io_worker_on_perform_fn) (PgAioHandle *handle);
+extern void pgaio_worker_set_on_perform_hook(io_worker_on_perform_fn fn);
+
+
 #endif							/* IO_WORKER_H */
-- 
2.53.0



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Automatically sizing the IO worker pool
  2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-24 19:20 ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-05-26 06:00   ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-05-27 17:55     ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2025-07-12 05:08       ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
  2025-07-30 10:14         ` Re: Automatically sizing the IO worker pool Dmitry Dolgov <[email protected]>
  2026-04-11 06:35           ` Re: Automatically sizing the IO worker pool Thomas Munro <[email protected]>
@ 2026-04-14 10:26             ` Dmitry Dolgov <[email protected]>
  0 siblings, 0 replies; 24+ messages in thread

From: Dmitry Dolgov @ 2026-04-14 10:26 UTC (permalink / raw)
  To: Thomas Munro <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

> On Sat, Apr 11, 2026 at 06:35:18PM +1200, Thomas Munro wrote:
> On Wed, Jul 30, 2025 at 10:15 PM Dmitry Dolgov <[email protected]> wrote:
> > As a side note, I was trying to experiment with this patch using
> > dm-mapper's delay feature to introduce an arbitrary large io latency and
> > see how the io queue is growing.
> 
> FWIW, here's what I came up with while experimenting with that sort of thing:
> 
>       shared_preload_libraries=io_limit
>       io_limit.ios_per_second=6000
> 
> That differs from eg dm-mapper delays by making everything seem like
> slow direct I/O, which seemed more interesting for this project.  For
> example if you run some continuous workload while you SET
> io_limit.ios_per_second to various numbers, with
> io_workers_idle_timeout set fairly low, you can monitor the pool
> adjustments.

Yeah, sounds like a good idea. Do you plan to introduce such an
extension long term for testing, or is it just one off?

As to me it looks worth keeping, maybe even use injections points to
allow for more flexibility. And I know I sound like a broken record, but
if I understand correctly the delays introduced via ios_per_seconds and
others are constant in time -- I've experimented a bit and found some
reference implementations in numpy for geometric distribution sampling,
which allow to make the delay a random variable. Since the geometric
distribution is a discrete analog of the exponential one, and the latter
represents delays between events in Poisson distribution, such random
variable would give an approximation for more real load.


^ permalink  raw  reply  [nested|flat] 24+ messages in thread

end of thread, other threads:[~2026-04-14 10:26 UTC | newest]

Thread overview: 24+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-04-12 16:59 Automatically sizing the IO worker pool Thomas Munro <[email protected]>
2025-04-13 17:45 ` Jose Luis Tallon <[email protected]>
2025-05-24 19:20 ` Dmitry Dolgov <[email protected]>
2025-05-26 02:17   ` wenhui qiu <[email protected]>
2025-05-26 06:00   ` Thomas Munro <[email protected]>
2025-05-26 22:54     ` Thomas Munro <[email protected]>
2025-05-27 17:55     ` Dmitry Dolgov <[email protected]>
2025-07-12 05:08       ` Thomas Munro <[email protected]>
2025-07-30 10:14         ` Dmitry Dolgov <[email protected]>
2025-08-04 05:30           ` Thomas Munro <[email protected]>
2026-03-28 09:31             ` Thomas Munro <[email protected]>
2026-04-06 15:02               ` Thomas Munro <[email protected]>
2026-04-06 18:14                 ` Andres Freund <[email protected]>
2026-04-07 10:39                   ` Thomas Munro <[email protected]>
2026-04-07 19:01                     ` Andres Freund <[email protected]>
2026-04-07 23:18                       ` Thomas Munro <[email protected]>
2026-04-08 00:30                         ` Andres Freund <[email protected]>
2026-04-08 02:09                           ` Thomas Munro <[email protected]>
2026-04-08 02:20                             ` Andres Freund <[email protected]>
2026-04-08 02:47                               ` Thomas Munro <[email protected]>
2026-04-08 02:24                             ` Thomas Munro <[email protected]>
2026-04-08 00:30                         ` Thomas Munro <[email protected]>
2026-04-11 06:35           ` Thomas Munro <[email protected]>
2026-04-14 10:26             ` Dmitry Dolgov <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox