Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uly8n-007yLX-Ul for pgsql-hackers@arkaria.postgresql.org; Tue, 12 Aug 2025 23:10:50 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uly8m-00ANOB-DE for pgsql-hackers@arkaria.postgresql.org; Tue, 12 Aug 2025 23:10:48 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uly8l-00ANO2-NB for pgsql-hackers@lists.postgresql.org; Tue, 12 Aug 2025 23:10:48 +0000 Received: from relay9-d.mail.gandi.net ([217.70.183.199]) by makus.postgresql.org with smtp (Exim 4.96) (envelope-from ) id 1uly8i-000GJG-2w for pgsql-hackers@lists.postgresql.org; Tue, 12 Aug 2025 23:10:46 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id 3F6CE431F5; Tue, 12 Aug 2025 23:10:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1755040242; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=rR42KGkxnctH/APpIqTzYo1O4RCnhIvUfLZZ2r9iF1c=; b=FHoRncSSO/pPWhYFun/z7zZbDdhuHHrEmxxKfN2dlM0MCtwFU9BM1Oqq+izEtW28jxwZxt pr0wmZN9hCgz9I89s2fSS1zNLB1NgVV/lrBBGTnSDDlIEqXWO7RDIk89ZZQIFHHRFI6/6Y fsj3cZsEa43t6Smc+IQ0fmoFzJyMVwh2FnS8zc3uzXaoKfw+vJpJzstO7fwEwx65zqtlww XyQLhCZrEt1/UzXo2l9IQ2//eBaZgEAT9ErkYXeLbRK9qTq6cH/FNpT3DNfqfAH9APCGRB p3h3hq2+XEZ44dtveNOZgDtUV2dy0KfMfu2HGQY7FWi85uwW/Hr+BBRtwnknrQ== Message-ID: Date: Wed, 13 Aug 2025 01:10:36 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: index prefetching From: Tomas Vondra To: Peter Geoghegan , Thomas Munro Cc: Andres Freund , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar References: <8ed1d326-5c6e-476e-b3fd-30d3da210546@vondra.me> <306fc8c0-c882-4602-86f5-a106b9ace603@vondra.me> <152ea782-5bd4-4435-b021-0ab2da61e63d@vondra.me> <7c2f6350-6fca-4e39-b0a8-8ac735f5d58a@vondra.me> <2e63cadd-2a03-46b1-866e-7ea5d3ffd37f@vondra.me> <51b5f71b-5f19-4453-91ff-2b9f2a840c58@vondra.me> Content-Language: en-US In-Reply-To: <51b5f71b-5f19-4453-91ff-2b9f2a840c58@vondra.me> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-GND-State: clean X-GND-Score: -100 X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgddufeeiieduucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuifetpfffkfdpucggtfgfnhhsuhgsshgtrhhisggvnecuuegrihhlohhuthemuceftddunecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjughrpefkffggfgfuhffvvehfjggtgfesthejredttddvjeenucfhrhhomhepvfhomhgrshcugghonhgurhgruceothhomhgrshesvhhonhgurhgrrdhmvgeqnecuggftrfgrthhtvghrnhephfeggfeljeevgfejteeuvdfhhfduteevlefguefgudelvdffjeelgeejhfetffeknecukfhppeekiedrgeelrddvfedtrddvtdeinecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehinhgvthepkeeirdegledrvdeftddrvddtiedphhgvlhhopegluddtrddufeejrddtrddvngdpmhgrihhlfhhrohhmpehtohhmrghssehvohhnughrrgdrmhgvpdhnsggprhgtphhtthhopedutddprhgtphhtthhopehpghessghofihtrdhivgdprhgtphhtthhopehthhhomhgrshdrmhhunhhrohesghhmrghilhdrtghomhdprhgtphhtthhopegrnhgurhgvshesrghnrghrrgiivghlrdguvgdprhgtphhtthhopegshigrvhhuiiekudesghhmrghilhdrtghomhdprhgtphhtthhopehrohgsvghrthhmhhgrrghssehgmhgrihhlrdgtohhmpdhrtghpthhtohepmhgvlhgrnhhivghplhgrghgvmhgrnhesghhmrghilhdrtghomhdprhgtphhtt hhopehpghhsqhhlqdhhrggtkhgvrhhssehlihhsthhsrdhpohhsthhgrhgvshhqlhdrohhrghdprhgtphhtthhopehgkhhokhholhgrthhoshesphhrohhtohhnmhgrihhlrdgtohhm X-GND-Sasl: tomas@vondra.me List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 8/12/25 23:52, Tomas Vondra wrote: > > On 8/12/25 23:22, Peter Geoghegan wrote: >> ... >> >> It looks like the patch does significantly better with the forwards scan, >> compared to the backwards scan (though both are improved by a lot). But that's >> not the main thing about these results that I find interesting. >> >> The really odd thing is that we get "shared hit=6619 read=49933" for the >> forwards scan, and "shared hit=10350 read=49933" for the backwards scan. The >> latter matches master (regardless of the scan direction used on master), while >> the former just looks wrong. What explains the "missing buffer hits" seen with >> the forwards scan? >> >> Discrepancies >> ------------- >> >> All 4 query executions agree that "rows=1048576.00", so the patch doesn't appear >> to simply be broken/giving wrong answers. Might it be that the "Buffers" >> instrumentation is broken? >> > > I think a bug in the prefetch patch is more likely. I tried with a patch > that adds various prefetch-related counters to explain, and I see this: > > > test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a > BETWEEN 16336 AND 49103 ORDER BY a; > > QUERY PLAN > ------------------------------------------------------------------------ > Index Scan using idx on public.t (actual time=0.682..527.055 > rows=1048576.00 loops=1) > Output: a, b > Index Cond: ((t.a >= 16336) AND (t.a <= 49103)) > Index Searches: 1 > Prefetch Distance: 271.263 > Prefetch Count: 60888 > Prefetch Stalls: 1 > Prefetch Skips: 991211 > Prefetch Resets: 3 > Prefetch Histogram: [2,4) => 2, [4,8) => 8, [8,16) => 17, [16,32) => > 24, [32,64) => 34, [64,128) => 52, [128,256) => 82, [256,512) => 60669 > Buffers: shared hit=5027 read=50872 > I/O Timings: shared read=33.528 > Planning: > Buffers: shared hit=78 read=23 > I/O Timings: shared read=2.349 > Planning Time: 3.686 ms > Execution Time: 559.659 ms > (17 rows) > > > test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a > BETWEEN 16336 AND 49103 ORDER BY a DESC; > QUERY PLAN > ------------------------------------------------------------------------ > Index Scan Backward using idx on public.t (actual time=1.110..4116.201 > rows=1048576.00 loops=1) > Output: a, b > Index Cond: ((t.a >= 16336) AND (t.a <= 49103)) > Index Searches: 1 > Prefetch Distance: 271.061 > Prefetch Count: 118806 > Prefetch Stalls: 1 > Prefetch Skips: 962515 > Prefetch Resets: 3 > Prefetch Histogram: [2,4) => 2, [4,8) => 7, [8,16) => 12, [16,32) => > 17, [32,64) => 24, [64,128) => 3, [128,256) => 4, [256,512) => 118737 > Buffers: shared hit=30024 read=50872 > I/O Timings: shared read=581.353 > Planning: > Buffers: shared hit=82 read=23 > I/O Timings: shared read=3.168 > Planning Time: 4.289 ms > Execution Time: 4185.407 ms > (17 rows) > > These two parts are interesting: > > Prefetch Count: 60888 > Prefetch Skips: 991211 > > Prefetch Count: 118806 > Prefetch Skips: 962515 > > It looks like the backwards scan skips fewer blocks. This is based on > the lastBlock optimization, i.e. looking for runs of the same block > number. I don't quite see why would it affect just the backwards scan, > though. Seems weird. > Actually, this might be a consequence of how backwards scans work (at least in btree). I logged the block in index_scan_stream_read_next, and this is what I see in the forward scan (at the beginning): index_scan_stream_read_next: block 24891 index_scan_stream_read_next: block 24892 index_scan_stream_read_next: block 24893 index_scan_stream_read_next: block 24892 index_scan_stream_read_next: block 24893 index_scan_stream_read_next: block 24894 index_scan_stream_read_next: block 24895 index_scan_stream_read_next: block 24896 index_scan_stream_read_next: block 24895 index_scan_stream_read_next: block 24896 index_scan_stream_read_next: block 24897 index_scan_stream_read_next: block 24898 index_scan_stream_read_next: block 24899 index_scan_stream_read_next: block 24900 index_scan_stream_read_next: block 24901 index_scan_stream_read_next: block 24902 index_scan_stream_read_next: block 24903 index_scan_stream_read_next: block 24904 index_scan_stream_read_next: block 24905 index_scan_stream_read_next: block 24906 index_scan_stream_read_next: block 24907 index_scan_stream_read_next: block 24908 index_scan_stream_read_next: block 24909 index_scan_stream_read_next: block 24910 while in the backwards scan (at the end) I see this index_scan_stream_read_next: block 24910 index_scan_stream_read_next: block 24911 index_scan_stream_read_next: block 24908 index_scan_stream_read_next: block 24909 index_scan_stream_read_next: block 24906 index_scan_stream_read_next: block 24907 index_scan_stream_read_next: block 24908 index_scan_stream_read_next: block 24905 index_scan_stream_read_next: block 24906 index_scan_stream_read_next: block 24903 index_scan_stream_read_next: block 24904 index_scan_stream_read_next: block 24905 index_scan_stream_read_next: block 24902 index_scan_stream_read_next: block 24903 index_scan_stream_read_next: block 24900 index_scan_stream_read_next: block 24901 index_scan_stream_read_next: block 24902 index_scan_stream_read_next: block 24899 index_scan_stream_read_next: block 24900 index_scan_stream_read_next: block 24897 index_scan_stream_read_next: block 24898 index_scan_stream_read_next: block 24899 index_scan_stream_read_next: block 24895 index_scan_stream_read_next: block 24896 index_scan_stream_read_next: block 24897 index_scan_stream_read_next: block 24894 index_scan_stream_read_next: block 24895 index_scan_stream_read_next: block 24896 index_scan_stream_read_next: block 24892 index_scan_stream_read_next: block 24893 index_scan_stream_read_next: block 24894 index_scan_stream_read_next: block 24891 index_scan_stream_read_next: block 24892 index_scan_stream_read_next: block 24893 These are only the blocks that ended up passes to the read stream, not the skipped ones. And you can immediately see the backward scan requests more blocks for (roughly) the same part of the scan - the min/max block roughly match. The reason is pretty simple - the table is very correlated, and the forward scan requests blocks mostly in the right order. Only rarely it has to jump "back" when progressing to the next value, and so the lastBlock optimization works nicely. But with the backwards scan we apparently scan the values backwards, but then the blocks for each value are accessed in forward direction. So we do a couple blocks "forward" and then jump to the preceding value - but that's a couple blocks *back*. And that breaks the lastBlock check. I believe this applies both to master and the prefetching, except that master doesn't have read stream - so it only does sync I/O. Could that hide the extra buffer accesses, somehow? Anyway, this access pattern in backwards scans seems a bit unfortunate. regards -- Tomas Vondra