Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uqvFe-00FFmt-NR for pgsql-hackers@arkaria.postgresql.org; Tue, 26 Aug 2025 15:06:24 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uqvFe-007IAB-5B for pgsql-hackers@arkaria.postgresql.org; Tue, 26 Aug 2025 15:06:22 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uqvFd-007IA3-Qw for pgsql-hackers@lists.postgresql.org; Tue, 26 Aug 2025 15:06:22 +0000 Received: from relay4-d.mail.gandi.net ([217.70.183.196]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1uqvFb-001yGz-1d for pgsql-hackers@lists.postgresql.org; Tue, 26 Aug 2025 15:06:21 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id B2C2543B4A; Tue, 26 Aug 2025 15:06:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1756220778; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BWFP1/62vY+bysowKP/sQESeKFkZ32ZLI8lJRkmFbGU=; b=XYFHms556DaCZyanHKcFv06uwrEHgDHfuESoBsYMlQzyt+9xT6iApfEgt2SXEIxNZxZ/9y ABubuKLx4QuH/QlLNZQ0OERy9IR5e1ZWb0pvAW17PsmeHptKty/4PkzEa6aCVh2HiXiUyE Zyumklc8ELBGEPtvoKh0CKX1REAik4axVuFA9bpL55T1JfZGsfFGFFkUCb6ggx2Xw3wlaZ DBbT8HJz3KD2KEcF4JdKKRRA2ryIzT62vvsCK33o07KJbe1e8kCbfjmNDeJxJGuP15LfG2 Ecs62iDw+Ge8t0fT6NoK1wzglBTuxrZCnrcRb7B92YkT/x0k0Aii+kJTF77LNA== Message-ID: <0dd33755-cab8-49c8-b1ed-698732577fbb@vondra.me> Date: Tue, 26 Aug 2025 17:06:11 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: index prefetching To: Andres Freund Cc: Peter Geoghegan , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar References: <6wyxbnry2unm3kbcu2sabhzhs7baoedlg77xqm42chpofjq45g@igst42zpl7ok> <5v2wuxg65l5e3s6uf373zskcqqoukmraxiucnvgn4t7b5cmeqx@5mhqsurdj6xn> <6butbqln6ewi5kuxz3kfv2mwomnlgtate4mb4lpa7gb2l63j4t@stlwbi2dvvev> Content-Language: en-US From: Tomas Vondra In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-GND-State: clean X-GND-Score: -100 X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgddujeehheekucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuifetpfffkfdpucggtfgfnhhsuhgsshgtrhhisggvnecuuegrihhlohhuthemuceftddunecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjughrpefkffggfgfuvfevfhfhjggtgfesthejredttddvjeenucfhrhhomhepvfhomhgrshcugghonhgurhgruceothhomhgrshesvhhonhgurhgrrdhmvgeqnecuggftrfgrthhtvghrnhepledugeeikefglefhgfffuedvleetteevgefhvdeikeefudduuddvhfevudefhfevnecukfhppeekiedrgeelrddvfedtrddvtdeinecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehinhgvthepkeeirdegledrvdeftddrvddtiedphhgvlhhopegluddtrddufeejrddtrddvngdpmhgrihhlfhhrohhmpehtohhmrghssehvohhnughrrgdrmhgvpdhnsggprhgtphhtthhopedutddprhgtphhtthhopegrnhgurhgvshesrghnrghrrgiivghlrdguvgdprhgtphhtthhopehpghessghofihtrdhivgdprhgtphhtthhopehthhhomhgrshdrmhhunhhrohesghhmrghilhdrtghomhdprhgtphhtthhopegshigrvhhuiiekudesghhmrghilhdrtghomhdprhgtphhtthhopehrohgsvghrthhmhhgrrghssehgmhgrihhlrdgtohhmpdhrtghpthhtohepmhgvlhgrnhhivghplhgrghgvmhgrnhesghhmrghilhdrtghomhdprhgtphhtt hhopehpghhsqhhlqdhhrggtkhgvrhhssehlihhsthhsrdhpohhsthhgrhgvshhqlhdrohhrghdprhgtphhtthhopehgkhhokhholhgrthhoshesphhrohhtohhnmhgrihhlrdgtohhm X-GND-Sasl: tomas@vondra.me List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 8/26/25 01:48, Andres Freund wrote: > Hi, > > On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote: >> Thanks. Based on the testing so far, the patch seems to be a substantial >> improvement. What's needed to make this prototype committable? > > Mainly some testing infrastructure that can trigger this kind of stream. The > logic is too finnicky for me to commit it without that. > So, what would that look like? The "naive" approach to testing is to simply generate a table/index, producing the right sequence of blocks. That shouldn't be too hard, it'd be enough to have an index that - has ~2-3 rows per value, on different heap pages - the values "overlap", e.g. like this (value,page) (A,1), (A,2), (A,3), (B,2), (B,3), (B,4), ... Another approach would be to test this at C level, sidestepping the query execution entirely. We'd have a "stream generator" that just generates a sequence of blocks of our own choosing (could be hard-coded, some pattern, read from a file ...), and feed it into a read stream. But how would we measure success for these tests? I don't think we want to look at query duration, that's very volatile. > >> I assume this is PG19+ improvement, right? It probably affects PG18 too, >> but it's harder to hit / the impact is not as bad as on PG19. > > Yea. It does apply to 18 too, but I can't come up with realistic scenarios > where it's a real issue. I can repro a slowdown when using many parallel > seqscans with debug_io_direct=data - but that's even slower in 17... > Makes sense. > >> On a related note, my test that generates random datasets / queries, and >> compares index prefetching with different io_method values found a >> pretty massive difference between worker and io_uring. I wonder if this >> might be some issue in io_method=worker. > >> while with index prefetching (with the aio prototype patch), it looks >> like this: >> >> QUERY PLAN >> ---------------------------------------------------------------------- >> Index Scan using idx on t (actual rows=9048576.00 loops=1) >> Index Cond: ((a >= 16150) AND (a <= 4540437)) >> Index Searches: 1 >> Prefetch Distance: 2.032 >> Prefetch Count: 868165 >> Prefetch Stalls: 2140228 >> Prefetch Skips: 6039906 >> Prefetch Resets: 0 >> Stream Ungets: 0 >> Stream Forwarded: 4 >> Prefetch Histogram: [2,4) => 855753, [4,8) => 12412 >> Buffers: shared hit=2577599 read=455610 >> Planning: >> Buffers: shared hit=78 read=26 dirtied=1 >> Planning Time: 1.032 ms >> Execution Time: 3150.578 ms >> (16 rows) >> >> So it's about 2x slower. The prefetch distance collapses, because >> there's a lot of cache hits (about 50% of requests seem to be hits of >> already visited blocks). I think that's a problem with how we adjust the >> distance, but I'll post about that separately. >> >> Let's try to simply set io_method=io_uring: >> >> QUERY PLAN >> ---------------------------------------------------------------------- >> Index Scan using idx on t (actual rows=9048576.00 loops=1) >> Index Cond: ((a >= 16150) AND (a <= 4540437)) >> Index Searches: 1 >> Prefetch Distance: 2.032 >> Prefetch Count: 868165 >> Prefetch Stalls: 2140228 >> Prefetch Skips: 6039906 >> Prefetch Resets: 0 >> Stream Ungets: 0 >> Stream Forwarded: 4 >> Prefetch Histogram: [2,4) => 855753, [4,8) => 12412 >> Buffers: shared hit=2577599 read=455610 >> Planning: >> Buffers: shared hit=78 read=26 >> Planning Time: 2.212 ms >> Execution Time: 1837.615 ms >> (16 rows) >> >> That's much closer to master (and the difference could be mostly noise). >> >> I'm not sure what's causing this, but almost all regressions my script >> is finding look like this - always io_method=worker, with distance close >> to 2.0. Is this some inherent io_method=worker overhead? > > I think what you might be observing might be the inherent IPC / latency > overhead of the worker based approach. This is particularly pronounced if the > workers are idle (and the CPU they get scheduled on is clocked down). The > latency impact of that is small, but if you never actually get to do much > readahead it can be visible. > Yeah, that's quite possible. If I understand the mechanics of this, this can behave in a rather unexpected way - lowering the load (i.e. issuing fewer I/O requests) can make the workers "more idle" and therefore more likely to get suspended ... Is there a good way to measure if this is what's happening, and the impact? For example, it'd be interesting to know how long it took for a submitted process to get picked up by a worker. And % of time a worker spent handling I/O. regards -- Tomas Vondra