Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ueJwD-007oZO-4O for pgsql-hackers@arkaria.postgresql.org; Tue, 22 Jul 2025 20:50:14 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ueJwB-00DXHn-S3 for pgsql-hackers@arkaria.postgresql.org; Tue, 22 Jul 2025 20:50:12 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ueJwB-00DXHf-94 for pgsql-hackers@lists.postgresql.org; Tue, 22 Jul 2025 20:50:11 +0000 Received: from relay16.mail.gandi.net ([2001:4b98:dc4:8::236]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1ueJw7-000I0d-2K for pgsql-hackers@lists.postgresql.org; Tue, 22 Jul 2025 20:50:11 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id 7386044A4A; Tue, 22 Jul 2025 20:50:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1753217407; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=iibnA+sCZjGcbDipPcY/qC88QxYVsuadeHQQO5WAn5Y=; b=pyEF5LrMMAq1szv3ifc+5o1a482C2Autk9db7jeMCW0dxZLv9Edpnog1/rczrUwlTLsd0L nXSZDmA8O3QRVtZb2FnOLRycYv2j2BGgTRGGUwu8YPIAWS1XullCwWBkL9FU9hVSky4Wrj 5IqIubzWCUO0KfhW/mkHrM+Tj5jcoXZ4TSm6SnhcwOjVj0N7uRfbpMooD+nYwKNexo5NVW 7kkp0U4PMDM5dxEe6nxyiq+dYan+ZniRR7Poo1aNIkvbPyFNj4iyTRj+DWyLwBBz3BZV6M Unvk3cSxrs/n9YmA9X9oZyMJrJ3ZN9Irvb4nb7pnFMKKjoIoHpOEjS+9rcbGOw== Content-Type: multipart/mixed; boundary="------------NLGbl4p8e2a7300vLIUkmAWy" Message-ID: <1bebfd1b-aea5-4d41-80a6-eae64b8f9eaf@vondra.me> Date: Tue, 22 Jul 2025 22:50:00 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: index prefetching To: Peter Geoghegan Cc: Nazir Bilal Yavuz , Thomas Munro , Andres Freund , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar References: <32c15a30-6e25-4f6d-9191-76a19482c556@vondra.me> <64c8b824-6203-46a3-b045-5e95b796feee@vondra.me> <03dcc1a9-c5d0-4965-889c-684dc0a7580c@vondra.me> <23f490f4-8325-408c-91a0-a6757ab2441c@vondra.me> Content-Language: en-US From: Tomas Vondra In-Reply-To: X-GND-State: clean X-GND-Score: -100 X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgdejheeltdcutefuodetggdotefrodftvfcurfhrohhfihhlvgemucfitefpfffkpdcuggftfghnshhusghstghrihgsvgenuceurghilhhouhhtmecufedtudenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurheptgfkffggfgfuvfevfhfhjgesmhdtreertddvjeenucfhrhhomhepvfhomhgrshcugghonhgurhgruceothhomhgrshesvhhonhgurhgrrdhmvgeqnecuggftrfgrthhtvghrnhephfdtieekkeduffehfeevhfdugeegudeuhfdufeegleduhfegkeejudffvdfhkeeunecuffhomhgrihhnpehgihhthhhusgdrtghomhenucfkphepkeeirdegledrvdeftddrvddtieenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepihhnvghtpeekiedrgeelrddvfedtrddvtdeipdhhvghloheplgdutddrudefjedrtddrvdgnpdhmrghilhhfrhhomhepthhomhgrshesvhhonhgurhgrrdhmvgdpnhgspghrtghpthhtohepuddtpdhrtghpthhtohepphhgsegsohifthdrihgvpdhrtghpthhtohepsgihrghvuhiikedusehgmhgrihhlrdgtohhmpdhrtghpthhtohepthhhohhmrghsrdhmuhhnrhhosehgmhgrihhlrdgtohhmpdhrtghpthhtoheprghnughrvghssegrnhgrrhgriigvlhdruggvpdhrtghpthhtoheprhhosggvrhhtmhhhrggrshesghhmrghilhdrtghomhdprhgtphhtthhopehmvghlrghnihgvphhlrghgvghmr ghnsehgmhgrihhlrdgtohhmpdhrtghpthhtohepphhgshhqlhdqhhgrtghkvghrsheslhhishhtshdrphhoshhtghhrvghsqhhlrdhorhhgpdhrtghpthhtohepghhkohhkohhlrghtohhssehprhhothhonhhmrghilhdrtghomh List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is a multi-part message in MIME format. --------------NLGbl4p8e2a7300vLIUkmAWy Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On 7/22/25 19:35, Peter Geoghegan wrote: > On Tue, Jul 22, 2025 at 9:06 AM Tomas Vondra wrote: >> Real workloads are likely to have multiple misses in a row, which indeed >> ramps up the distance quickly. So maybe it's not that bad. Could we >> track a longer history of hits/misses, and consider that when adjusting >> the distance? Not just the most recent hit/miss? > > +1 > >> FWIW I re-ran the index-prefetch-test benchmarks with restoring the >> distance for the "simple" patch. The results are in the same github >> repository, in a separate branch: >> >> https://github.com/tvondra/indexscan-prefetch-tests/tree/with-distance-restore-after-reset > > These results make way more sense. There was absolutely no reason why > the "simple" patch should have done so much worse than the "complex" > one for most of the tests you've been running. > > Obviously, whatever advantage that the "complex" patch has is bound to > be limited to cases where index characteristics are naturally the > limiting factor. For example, with the pgbench_accounts_pkey table > there are only ever 6 distinct heap blocks on each leaf page. I bet > that your "linear" test more or less looks like that, too. > Yes. It's definitely true we could construct examples where the complex patch beats the simple one for this reason. And I believe some of those examples could be quite realistic, even if not very common (like when very few index tuples fit on a leaf page). However, I'm not sure the pgbench example with only 6 heap blocks per leaf is very significant. Sure, the simple patch can't prefetch TIDs from the following leaf, but AFAICS the complex patch won't do that either. Not because it couldn't, but because with that many hits the distance will drop to ~1 (or close to it). (It'll probably prefetch a couple TIDs from the next leaf at the very end of the page, but I don't think that matters overall.) I'm not sure what prefetch distances will be sensible in queries that do other stuff. The queries in the benchmark do just the index scan, but if the query does something with the tuple (in the nodes on top), that shortens the required prefetch distance. Of course, simple queries will benefit from prefetching far ahead. > I attach pgbench_accounts_pkey_nhblks.txt, which shows a query that > (among other things) ouputs "nhblks" for each leaf page from a given > index (while showing the details of each leaf page in index key space > order). It also shows results for pgbench_accounts_pkey with pgbench > scale 1. This is how I determined that every pgbench_accounts_pkey > leaf page points to exactly 6 distinct heap blocks -- "nhblks" is > always 6. Note that this is what I see regardless of the pgbench > scale, indicating that things always perfectly line up (even more than > I would expect for very synthetic data such as this). > Thanks. I wonder how difficult would it be to add something like this to pgstattuple. I mean, it shouldn't be difficult to look at leaf pages and count distinct blocks, right? Seems quite useful. Explain would also greatly benefit from tracking something like this. The buffer "hits" and "reads" can be very difficult to interpret. > This query is unwieldy when run against larger indexes, but that > shouldn't be necessary. As with pgbench_accounts_pkey, it's typical > for synthetically generated data to have a very consistent "nhblks", > regardless of the total amount of data. > > With your "uniform" test cases, I'd expect this query to show "nhtids > == nhblks" (or very close to it), which of course makes our ability to > eagerly read further leaf pages almost irrelevant. If there are > hundreds of distinct heap blocks on each leaf page, but > effective_io_concurrency is 16 (or even 64), there's little we can do > about it. > Right. >> I'm attaching two PDFs with results for eic=16 (linear and log-scale, to >> compare timings for quick queries). This shows that with restoring >> distance after reset, the simple patch is pretty much the same as the >> complex patch. >> >> The only data set where that's not the case is the "linear" data set, >> when everything is perfectly sequential. In this case the simple patch >> performs like "master" (i.e. no prefetching). I'm not sure why is that. > > Did you restore the distance for the "complex" patch, too? I think > that it might well matter there too. > No, I did not. I did consider it, but it seemed to me it can't really make a difference (for these data sets), because each leaf has ~300 items, and the patch limits the prefetch to 64 leafs. That means it can prefetch ~20k TIDs ahead, and each heap page has ~20 items. So this should be good enough for eic=1000. It should never hit stream reset. It'd be useful to show some prefetch info in explain, I guess. It should not be difficult to track how many times was the stream reset, the average prefetch distance, and perhaps even a histogram of distances. The simple patch tracks the average distance, at least. > Isn't the obvious explanation that the complex patch benefits from > being able to prefetch without being limited by index > characteristics/leaf page boundaries, while the simple patch doesn't? > That's a valid interpretation, yes. Although the benefit comes mostly >> Anyway, it seems to confirm most of the differences between the two >> patches is due to the "distance collapse". The impact of the resets in >> the first benchmarks surprised me quite a bit, but if we don't ramp up >> the distance that makes perfect sense. >> >> The issue probably affects other queries that do a lot of resets. Index >> scan prefetching just makes it very obvious. > > What is the difference between cases like "linear / eic=16 / sync" and > "linear_1 / eic=16 / sync"? > > One would imagine that these tests are very similar, based on the fact > that they have very similar names. But we see very different results > for each: with the former ("linear") test results, the "complex" patch > is 2x-4x faster than the "simple" patch. But, with the latter test > results ("linear_1", and other similar pairs of "linear_N" tests) the > advantage for the "complex" patch *completely* evaporates. I find that > very suspicious, and wonder if it might be due to a bug/inefficiency > that could easily be fixed (possibly an issue on the read stream side, > like the one you mentioned to Nazir just now). > Yes, there's some similarity. Attached is the script I use to create the tables and load the data. The "linear" is a table with a simple sequence of values (0 to 100k). More or less - the value is a floating point, and there are 10M rows. But you get the idea. The "linear_X" variants mean the value has a noise of X% of the range. So with "linear_1" you get the "linear" value, and then random(0,1000), with normal distribution. The "cyclic" data sets are similar, except that the "sequence" also wraps around 100x. There's nothing "special" about the particular values. I simply wanted different "levels" of noise, and 1, 10 and 25 seemed good. I initially had a couple higher values, but that was pretty close to "uniform". regards -- Tomas Vondra --------------NLGbl4p8e2a7300vLIUkmAWy Content-Type: application/sql; name="create-tables.sql" Content-Disposition: attachment; filename="create-tables.sql" Content-Transfer-Encoding: base64 Y3JlYXRlIHRhYmxlIHVuaWZvcm0gKGEgZG91YmxlIHByZWNpc2lvbiwgYiB0ZXh0KSB3aXRo IChmaWxsZmFjdG9yPTI1KTsKCgpjcmVhdGUgdGFibGUgbGluZWFyIChhIGRvdWJsZSBwcmVj aXNpb24sIGIgdGV4dCkgd2l0aCAoZmlsbGZhY3Rvcj0yNSk7CgpjcmVhdGUgdGFibGUgbGlu ZWFyXzEgKGEgZG91YmxlIHByZWNpc2lvbiwgYiB0ZXh0KSB3aXRoIChmaWxsZmFjdG9yPTI1 KTsKCmNyZWF0ZSB0YWJsZSBsaW5lYXJfMTAgKGEgZG91YmxlIHByZWNpc2lvbiwgYiB0ZXh0 KSB3aXRoIChmaWxsZmFjdG9yPTI1KTsKCmNyZWF0ZSB0YWJsZSBsaW5lYXJfMjUgKGEgZG91 YmxlIHByZWNpc2lvbiwgYiB0ZXh0KSB3aXRoIChmaWxsZmFjdG9yPTI1KTsKCgpjcmVhdGUg dGFibGUgY3ljbGljIChhIGRvdWJsZSBwcmVjaXNpb24sIGIgdGV4dCkgd2l0aCAoZmlsbGZh Y3Rvcj0yNSk7CgpjcmVhdGUgdGFibGUgY3ljbGljXzEgKGEgZG91YmxlIHByZWNpc2lvbiwg YiB0ZXh0KSB3aXRoIChmaWxsZmFjdG9yPTI1KTsKCmNyZWF0ZSB0YWJsZSBjeWNsaWNfMTAg KGEgZG91YmxlIHByZWNpc2lvbiwgYiB0ZXh0KSB3aXRoIChmaWxsZmFjdG9yPTI1KTsKCmNy ZWF0ZSB0YWJsZSBjeWNsaWNfMjUgKGEgZG91YmxlIHByZWNpc2lvbiwgYiB0ZXh0KSB3aXRo IChmaWxsZmFjdG9yPTI1KTsKCgp3aXRoIHggYXMgKHNlbGVjdCAxMDAwMDAgKiByYW5kb20o KSBBUyB4IGZyb20gZ2VuZXJhdGVfc2VyaWVzKDEsMTAwMDAwMDApIHMoaSkpCmluc2VydCBp bnRvIHVuaWZvcm0gc2VsZWN0IHgsIHNoYTI1Nih4Ojp0ZXh0OjpieXRlYSkgZnJvbSB4OwoK CndpdGggeCBhcyAoc2VsZWN0IDEwMDAwMCAqIChpOjpkb3VibGUgcHJlY2lzaW9uIC8gMTAw MDAwMDApIEFTIHggZnJvbSBnZW5lcmF0ZV9zZXJpZXMoMSwxMDAwMDAwMCkgcyhpKSkKaW5z ZXJ0IGludG8gbGluZWFyIHNlbGVjdCB4LCBzaGEyNTYoeDo6dGV4dDo6Ynl0ZWEpIGZyb20g eDsKCndpdGggeCBhcyAoc2VsZWN0IDEwMDAwMCAqIChpOjpkb3VibGUgcHJlY2lzaW9uIC8g MTAwMDAwMDApIEFTIHggZnJvbSBnZW5lcmF0ZV9zZXJpZXMoMSwxMDAwMDAwMCkgcyhpKSkK aW5zZXJ0IGludG8gbGluZWFyXzEgc2VsZWN0IHggKyByYW5kb21fbm9ybWFsKDAsIDEwMDAw MC8xMDApLCBzaGEyNTYoeDo6dGV4dDo6Ynl0ZWEpIGZyb20geDsKCndpdGggeCBhcyAoc2Vs ZWN0IDEwMDAwMCAqIChpOjpkb3VibGUgcHJlY2lzaW9uIC8gMTAwMDAwMDApIEFTIHggZnJv bSBnZW5lcmF0ZV9zZXJpZXMoMSwxMDAwMDAwMCkgcyhpKSkKaW5zZXJ0IGludG8gbGluZWFy XzEwIHNlbGVjdCB4ICsgcmFuZG9tX25vcm1hbCgwLCAxMCoxMDAwMDAvMTAwKSwgc2hhMjU2 KHg6OnRleHQ6OmJ5dGVhKSBmcm9tIHg7Cgp3aXRoIHggYXMgKHNlbGVjdCAxMDAwMDAgKiAo aTo6ZG91YmxlIHByZWNpc2lvbiAvIDEwMDAwMDAwKSBBUyB4IGZyb20gZ2VuZXJhdGVfc2Vy aWVzKDEsMTAwMDAwMDApIHMoaSkpCmluc2VydCBpbnRvIGxpbmVhcl8yNSBzZWxlY3QgeCAr IHJhbmRvbV9ub3JtYWwoMCwgMjUqMTAwMDAwLzEwMCksIHNoYTI1Nih4Ojp0ZXh0OjpieXRl YSkgZnJvbSB4OwoKCndpdGggeCBhcyAoc2VsZWN0IDEwMDAwMCAqIChtb2QoaSwoMTAwMDAw MDAvMTAwKSk6OmRvdWJsZSBwcmVjaXNpb24gLyAoMTAwMDAwMDAvMTAwKSkgQVMgeCBmcm9t IGdlbmVyYXRlX3NlcmllcygxLDEwMDAwMDAwKSBzKGkpKQppbnNlcnQgaW50byBjeWNsaWMg c2VsZWN0IHgsIHNoYTI1Nih4Ojp0ZXh0OjpieXRlYSkgZnJvbSB4OwoKd2l0aCB4IGFzIChz ZWxlY3QgMTAwMDAwICogKG1vZChpLCgxMDAwMDAwMC8xMDApKTo6ZG91YmxlIHByZWNpc2lv biAvICgxMDAwMDAwMC8xMDApKSBBUyB4IGZyb20gZ2VuZXJhdGVfc2VyaWVzKDEsMTAwMDAw MDApIHMoaSkpCmluc2VydCBpbnRvIGN5Y2xpY18xIHNlbGVjdCB4ICsgcmFuZG9tX25vcm1h bCgwLCAxKjEwMDAwMC8xMDApLCBzaGEyNTYoeDo6dGV4dDo6Ynl0ZWEpIGZyb20geDsKCndp dGggeCBhcyAoc2VsZWN0IDEwMDAwMCAqIChtb2QoaSwoMTAwMDAwMDAvMTAwKSk6OmRvdWJs ZSBwcmVjaXNpb24gLyAoMTAwMDAwMDAvMTAwKSkgQVMgeCBmcm9tIGdlbmVyYXRlX3Nlcmll cygxLDEwMDAwMDAwKSBzKGkpKQppbnNlcnQgaW50byBjeWNsaWNfMTAgc2VsZWN0IHggKyBy YW5kb21fbm9ybWFsKDAsIDEwKjEwMDAwMC8xMDApLCBzaGEyNTYoeDo6dGV4dDo6Ynl0ZWEp IGZyb20geDsKCndpdGggeCBhcyAoc2VsZWN0IDEwMDAwMCAqIChtb2QoaSwoMTAwMDAwMDAv MTAwKSk6OmRvdWJsZSBwcmVjaXNpb24gLyAoMTAwMDAwMDAvMTAwKSkgQVMgeCBmcm9tIGdl bmVyYXRlX3NlcmllcygxLDEwMDAwMDAwKSBzKGkpKQppbnNlcnQgaW50byBjeWNsaWNfMjUg c2VsZWN0IHggKyByYW5kb21fbm9ybWFsKDAsIDI1KjEwMDAwMC8xMDApLCBzaGEyNTYoeDo6 dGV4dDo6Ynl0ZWEpIGZyb20geDsKCgp2YWN1dW0gZnJlZXplOwoKCmNyZWF0ZSBpbmRleCBv biB1bmlmb3JtIChhKTsKCmNyZWF0ZSBpbmRleCBvbiBsaW5lYXIgKGEpOwoKY3JlYXRlIGlu ZGV4IG9uIGxpbmVhcl8xIChhKTsKCmNyZWF0ZSBpbmRleCBvbiBsaW5lYXJfMTAgKGEpOwoK Y3JlYXRlIGluZGV4IG9uIGxpbmVhcl8yNSAoYSk7CgoKY3JlYXRlIGluZGV4IG9uIGN5Y2xp YyAoYSk7CgpjcmVhdGUgaW5kZXggb24gY3ljbGljXzEgKGEpOwoKY3JlYXRlIGluZGV4IG9u IGN5Y2xpY18xMCAoYSk7CgpjcmVhdGUgaW5kZXggb24gY3ljbGljXzI1IChhKTsKCnZhY3V1 bSBhbmFseXplOwoKY2hlY2twb2ludDsK --------------NLGbl4p8e2a7300vLIUkmAWy--