Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uwp8D-007irD-9R for pgsql-hackers@arkaria.postgresql.org; Thu, 11 Sep 2025 21:47:05 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uwp89-00AEZ0-6M for pgsql-hackers@arkaria.postgresql.org; Thu, 11 Sep 2025 21:47:01 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uwp87-00AEYm-Qd for pgsql-hackers@lists.postgresql.org; Thu, 11 Sep 2025 21:47:01 +0000 Received: from fout-a4-smtp.messagingengine.com ([103.168.172.147]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1uwp82-000GQG-0U for pgsql-hackers@postgresql.org; Thu, 11 Sep 2025 21:46:59 +0000 Received: from phl-compute-03.internal (phl-compute-03.internal [10.202.2.43]) by mailfout.phl.internal (Postfix) with ESMTP id 6D08DEC0342; Thu, 11 Sep 2025 17:46:53 -0400 (EDT) Received: from phl-mailfrontend-02 ([10.202.2.163]) by phl-compute-03.internal (MEProxy); Thu, 11 Sep 2025 17:46:53 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:content-transfer-encoding:content-type:content-type:date:date :from:from:in-reply-to:message-id:mime-version:reply-to:subject :subject:to:to; s=fm1; t=1757627213; x=1757713613; bh=fVRpVSxZYf OZSTML4iEzU0zN+uWGqk6gpTaljfnbSLg=; b=B5laDEEgIW9Zo1x6pMEPdUxks8 /8+bv+oV200F9f042k7tQIAsMrNxzLaZKIKtW1ZLAvVUFYHzDQC7s+CdZOIydMyJ zMlbaxLvFWnz0uuUOHVF8KpCJfv4SP8nkpX0AFeEzpPrMQtCAJeFPe9JMQPToIGo 6UGUrQuMXA6snpZgL+MNsM4EtxTM7W08BKQABfoaqQMzwZfcXDn+jl4Bm6puFvcw j50cCIjqyhgHeZEF0U5TkFDozQvvaoo8xnDOo6ptuK8ckn4hRcs7LDGAZYpBurYK eQ1tWOnZkLh+WPdTULN+V/RxwUF80saDpSaF4uQHrZY6MruvdKft1xWvF8iQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :content-type:date:date:feedback-id:feedback-id:from:from :in-reply-to:message-id:mime-version:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t= 1757627213; x=1757713613; bh=fVRpVSxZYfOZSTML4iEzU0zN+uWGqk6gpTa ljfnbSLg=; b=FoADU3Zb4l/VhncAdZ22t6YzsnpgEWFXFzr2UnwSBD8eybdMDSY NqalPaotnmc/qrq1i7Pu9N8a+2GX4K3xcl0QCKbgGDI3FbyGNYlAQCOh+kJtb0u6 68g1UuN1XohJzdsV70zzwNogsBoic56RcUEbaZMYCH0BKfurObLUEAemG81G8nPH jDVJ3TuCTW+DMdBCNK9a/jMplV5VBfGbBCaIzziXnQmU5/ZYrSWfnI0ebrt5vm8y 550F1bPHsqb4pB6j98EHNPpsgEsbiRUlpZd1dIMdO9nAjRxMTtDVZooTFBDKOb2h TbSpVYDjpiPQRIrsPRd1coslQfNhYwrYBmQ== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdeggddvjedvkecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpeffhffvuffkgggtugfgsehmkefsredttdejnecuhfhrohhmpeetnhgurhgvshcuhfhr vghunhguuceorghnughrvghssegrnhgrrhgriigvlhdruggvqeenucggtffrrghtthgvrh hnpeetfeffffffffetjefhhfekfefgieduffdvveehfefhleetjedttddtheehtefgueen ucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpegrnhgurh gvshesrghnrghrrgiivghlrdguvgdpnhgspghrtghpthhtohepgedpmhhouggvpehsmhht phhouhhtpdhrtghpthhtohepphhgsegsohifthdrihgvpdhrtghpthhtohepthhvsehfuh iiiiihrdgtiidprhgtphhtthhopehthhhomhgrshdrmhhunhhrohesghhmrghilhdrtgho mhdprhgtphhtthhopehpghhsqhhlqdhhrggtkhgvrhhssehpohhsthhgrhgvshhqlhdroh hrgh X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Thu, 11 Sep 2025 17:46:52 -0400 (EDT) Date: Thu, 11 Sep 2025 17:46:51 -0400 From: Andres Freund To: pgsql-hackers@postgresql.org, Peter Geoghegan , Tomas Vondra , Thomas Munro Subject: Don't synchronously wait for already-in-progress IO in read stream Message-ID: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="aknbaqiolq3mrqjk" Content-Disposition: inline Content-Transfer-Encoding: 8bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --aknbaqiolq3mrqjk Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit Hi, In the index prefetching thread we discovered that read stream performance suffers rather substantially when a read stream is reading blocks multiple times within the readahead distance. The problem leading to that is that we are currently synchronously waiting for IO on a buffer when AsyncReadBuffers() encounters a buffer already undergoing IO. If a block is read twice, that means we won't actually have enough IOs in flight to have good performance. What's worse, currently the wait is not attributed to IO wait (since we're waiting in WaitIO, rather than waiting for IO). This does not commonly occur with in-tree users of read streams, as users like seqscans, bitmap heap scans, vacuum, ... will never try to read the same block twice. However with index prefetching that is a more common case. It is possible to encounter a version of this in 18/master: If multiple scans for the same table are in progress, they can end up waiting synchronously for each other. However it's a much less severe issue, as the scan that is "further ahead" will not be blocked. To fix it, the attached patch has AsyncReadBuffers() check if the "target" buffer already has IO in progress. If so, it assing the buffer's IO wait reference to the ReadBuffersOperation. That allows WaitReadBuffers() to wait for the IO. To make that work correctly, the buffer stats etc have to be updated in that case in WaitReadBuffers(). I did not feel like I was sufficiently confident in making this work without tests. However, it's not exactly trivial to test some versions of this, due to the way multiple processes need to be coordinated. It took way way longer to write tests than to make the code actually work :/. The attached tests add a new read_stream_for_blocks() function to test_aio. I found it also rather useful to reproduce the performance issue without the index prefetching patch applied. To test the cross process case the injection point infrastructure in test_aio had to be extended a bit. Attached are three patches: 0001: Introduces a TestAio package and splits out some existing tests out of 001_aio.pl 0002: Add read_stream test infrastructure & tests Note that the tests don't test that we don't unnecessarily wait, as described above, just that IO works correctly. 0003: Improve performance of read stream when re-encountering blocks To reproduce the issue, the read_stream_for_blocks() function added to test_aio can be used, in combination with debug_io_direct=data (it's also possible without DIO, it'd just be more work). prep: CREATE EXTENSION test_aio; CREATE TABLE large AS SELECT i, repeat(random()::text, 5) FROM generate_series(1, 1000000) g(i); test: SELECT pg_buffercache_evict_relation('large'); EXPLAIN ANALYZE SELECT * FROM read_stream_for_blocks('large', ARRAY(SELECT i + generate_series(0, 3) FROM generate_series(1, 10000) g(i))); Without 0003 applied: ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Function Scan on read_stream_for_blocks (cost=975.00..985.00 rows=1000 width=12) (actual time=673.647..675.254 rows=40000.00 loops=1) │ │ Buffers: shared hit=29997 read=10003 │ │ I/O Timings: shared read=16.116 │ │ InitPlan 1 │ │ -> Result (cost=0.00..975.00 rows=40000 width=4) (actual time=0.556..7.575 rows=40000.00 loops=1) │ │ -> ProjectSet (cost=0.00..375.00 rows=40000 width=8) (actual time=0.556..4.804 rows=40000.00 loops=1) │ │ -> Function Scan on generate_series g (cost=0.00..100.00 rows=10000 width=4) (actual time=0.554..0.988 rows=10000.00 loops=1) │ │ Planning Time: 0.060 ms │ │ Execution Time: 676.436 ms │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (9 rows) Time: 676.753 ms With 0003: ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Function Scan on read_stream_for_blocks (cost=975.00..985.00 rows=1000 width=12) (actual time=87.730..89.453 rows=40000.00 loops=1) │ │ Buffers: shared hit=29997 read=10003 │ │ I/O Timings: shared read=50.467 │ │ InitPlan 1 │ │ -> Result (cost=0.00..975.00 rows=40000 width=4) (actual time=0.541..7.496 rows=40000.00 loops=1) │ │ -> ProjectSet (cost=0.00..375.00 rows=40000 width=8) (actual time=0.540..4.772 rows=40000.00 loops=1) │ │ -> Function Scan on generate_series g (cost=0.00..100.00 rows=10000 width=4) (actual time=0.539..0.965 rows=10000.00 loops=1) │ │ Planning Time: 0.046 ms │ │ Execution Time: 90.661 ms │ └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (9 rows) Time: 90.887 ms Greetings, Andres Freund --aknbaqiolq3mrqjk Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="v2-0001-aio-Refactor-tests-in-preparation-for-more-tests.patch"