Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ucps9-0006xw-KT for pgsql-hackers@arkaria.postgresql.org; Fri, 18 Jul 2025 18:31:53 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ucps6-00FLyY-LY for pgsql-hackers@arkaria.postgresql.org; Fri, 18 Jul 2025 18:31:51 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ucps6-00FLyP-6x for pgsql-hackers@lists.postgresql.org; Fri, 18 Jul 2025 18:31:50 +0000 Received: from relay3-d.mail.gandi.net ([2001:4b98:dc4:8::223]) by makus.postgresql.org with smtp (Exim 4.96) (envelope-from ) id 1ucps3-00809l-3A for pgsql-hackers@lists.postgresql.org; Fri, 18 Jul 2025 18:31:49 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id 0965E1F68A; Fri, 18 Jul 2025 18:31:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1752863503; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=iv0nEyIHNrgiI3tzw9GzG84LNK0jqFqMbF85v95TK+w=; b=MMbaCEAhS9ajlBZKHB70w6lCCmNmjiLlg7v45LWEJ5ujcTL6EKlw6rb2vUm1b1b9V/5IPz BXqktfTlj22jz5juoxT9A2XcdP6LnSE62kuCok6hhSwWA9qXK6muvGERnEgXN9+eWeEIdV BM3K2WrdtNbejOpuiSCKDGo5ZlfFU/Wv0a7vH/BTfXEI0UJa21mF0hlRaGjB5Z6QvNDb3l ybyyxilp6mraE4gNxrFom1itwBT99Wpuk+4jSlPylKS+GJFGy0mnSso8ZRzP9PyoB4Fivw 3Bp1VTg/Clny4URmqWfEBav2M7URnJcv4eJxW+WQ+XYi2hiciYBK4+ISLgyLiA== Content-Type: multipart/mixed; boundary="------------TcRbRqn0jfAQ9M9Kv5INsPdi" Message-ID: Date: Fri, 18 Jul 2025 20:31:37 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: index prefetching From: Tomas Vondra To: Peter Geoghegan Cc: Andres Freund , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Thomas Munro , Konstantin Knizhnik , Dilip Kumar References: <57d0e292-73d5-4ab9-9855-110ee9cbd90a@vondra.me> <32c15a30-6e25-4f6d-9191-76a19482c556@vondra.me> <64c8b824-6203-46a3-b045-5e95b796feee@vondra.me> <03dcc1a9-c5d0-4965-889c-684dc0a7580c@vondra.me> Content-Language: en-US In-Reply-To: <03dcc1a9-c5d0-4965-889c-684dc0a7580c@vondra.me> X-GND-State: clean X-GND-Score: -100 X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgdeigedulecutefuodetggdotefrodftvfcurfhrohhfihhlvgemucfitefpfffkpdcuggftfghnshhusghstghrihgsvgenuceurghilhhouhhtmecufedtudenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurheptgfkffggfgfuhffvvehfjgesmhdtreertddvjeenucfhrhhomhepvfhomhgrshcugghonhgurhgruceothhomhgrshesvhhonhgurhgrrdhmvgeqnecuggftrfgrthhtvghrnhepteehgedugeeigedviedtgeekteeitdehleevffeggffguedvudeuffeluedvveeunecukfhppeekiedrgeelrddvfedtrddvtdeinecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehinhgvthepkeeirdegledrvdeftddrvddtiedphhgvlhhopegluddtrddufeejrddtrddvngdpmhgrihhlfhhrohhmpehtohhmrghssehvohhnughrrgdrmhgvpdhnsggprhgtphhtthhopeelpdhrtghpthhtohepphhgsegsohifthdrihgvpdhrtghpthhtoheprghnughrvghssegrnhgrrhgriigvlhdruggvpdhrtghpthhtoheprhhosggvrhhtmhhhrggrshesghhmrghilhdrtghomhdprhgtphhtthhopehmvghlrghnihgvphhlrghgvghmrghnsehgmhgrihhlrdgtohhmpdhrtghpthhtohepphhgshhqlhdqhhgrtghkvghrsheslhhishhtshdrphhoshhtghhrvghsqhhlrdhorhhgpdhrtghpthhtohepghhkohhkohhlrghtohhssehprhhot hhonhhmrghilhdrtghomhdprhgtphhtthhopehthhhomhgrshdrmhhunhhrohesghhmrghilhdrtghomhdprhgtphhtthhopehknhhiiihhnhhikhesghgrrhhrvghtrdhruh X-GND-Sasl: tomas@vondra.me List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is a multi-part message in MIME format. --------------TcRbRqn0jfAQ9M9Kv5INsPdi Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Hi, I was wondering why the "simple" approach performs so much worse than the "complex" one on some of the data sets. The theory was that it's due to using read_stream_reset(), which resets the prefetch distance, and so we need to "ramp up" from scratch (distance=1) for every batch. Which for the correlated data sets is very often. So I decided to do some experiments, to see if this is really the case, and maybe see if read_stream_reset() could fix this in some way. First, I added an elog(LOG, "distance %d", stream->distance); at the beginning of read_stream_next_block() to see how the distance changes during the scan. Consider a query returning 2M rows from the "cyclic" table (the attached .sql creates/pupulates it): -- selects 20% rows SELECT * FROM cyclic WHERE a BETWEEN 0 AND 20000; With the "complex" patch, the CDF of the distance looks like this: +----------+-----+ | distance | pct | +----------+-----+ | 0 | 0 | | 25 | 0 | | 50 | 0 | | 75 | 0 | | 100 | 0 | | 125 | 0 | | 150 | 0 | | 175 | 0 | | 200 | 0 | | 225 | 0 | | 250 | 0 | | 275 | 99 | | 300 | 99 | +----------+-----+ That is, 99% of the distances is in the range [275, 300]. Note: This is much higher than the effective_io_concurrency value (16), which may be surprising. But the ReadStream uses that to limit the number of I/O requests, not as a limit of how far to look ahead. A lot of the blocks are in the cache, so it looks far ahead. But with the "simple" patch it looks like this: +----------+-----+ | distance | pct | +----------+-----+ | 0 | 0 | | 25 | 99 | | 50 | 99 | | 75 | 99 | | 100 | 99 | | 125 | 99 | | 150 | 99 | | 175 | 99 | | 200 | 99 | | 225 | 99 | | 250 | 99 | | 275 | 100 | | 300 | 100 | +----------+-----+ So 99% of the distances is in [0, 25]. A more detailed view on the first couple distances: +----------+-----+ | distance | pct | +----------+-----+ | 0 | 0 | | 1 | 99 | | 2 | 99 | | 3 | 99 | | 4 | 99 | ... So 99% of distances is 1. Well, that's not very far, it effectively means no prefetching (We still issue the fadvise, though, although a comment in read_stream.c suggests we won't. Possible bug?). This means *there's no ramp-up at all*. On the first leaf the distance grows to ~270, but after the stream gets reset it stays at 1 and never increases. That's ... not great? I'm not entirely sure I decided to hack the ReadStream a bit, so that it restores the last non-zero distance seen (i.e. right before reaching end of the stream). And with that I got this: +----------+-----+ | distance | pct | +----------+-----+ | 0 | 0 | | 25 | 38 | | 50 | 38 | | 75 | 38 | | 100 | 39 | | 125 | 42 | | 150 | 47 | | 175 | 47 | | 200 | 48 | | 225 | 49 | | 250 | 50 | | 275 | 100 | | 300 | 100 | +----------+-----+ Not as good as the "complex" patch, but much better than the original. And the performance got almost the same (for this one query). Perhaps the ReadStream should do something like this? Of course, the simple patch resets the stream very often, likely mcuh more often than anything else in the code. But wouldn't it be beneficial for streams reset because of a rescan? Possibly needs to be optional. regards -- Tomas Vondra --------------TcRbRqn0jfAQ9M9Kv5INsPdi Content-Type: text/x-patch; charset=UTF-8; name="0001-read_stream-restore-prefetch-distance-after-reset.patch" Content-Disposition: attachment; filename*0="0001-read_stream-restore-prefetch-distance-after-reset.patch" Content-Transfer-Encoding: base64 RnJvbSA4MTdkNWE3ZjdkMTlkMGQ5ZmU3NWJhMWJiMDBhODE2NGE0MzgwMjFhIE1vbiBTZXAg MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBUb21hcyBWb25kcmEgPHRvbWFzQHZvbmRyYS5tZT4K RGF0ZTogRnJpLCAxOCBKdWwgMjAyNSAxNjo0MTowMyArMDIwMApTdWJqZWN0OiBbUEFUQ0hd IHJlYWRfc3RyZWFtOiByZXN0b3JlIHByZWZldGNoIGRpc3RhbmNlIGFmdGVyIHJlc2V0Cgot LS0KIHNyYy9iYWNrZW5kL3N0b3JhZ2UvYWlvL3JlYWRfc3RyZWFtLmMgfCAxNCArKysrKysr KysrKystLQogMSBmaWxlIGNoYW5nZWQsIDEyIGluc2VydGlvbnMoKyksIDIgZGVsZXRpb25z KC0pCgpkaWZmIC0tZ2l0IGEvc3JjL2JhY2tlbmQvc3RvcmFnZS9haW8vcmVhZF9zdHJlYW0u YyBiL3NyYy9iYWNrZW5kL3N0b3JhZ2UvYWlvL3JlYWRfc3RyZWFtLmMKaW5kZXggMGU3ZjU1 NTdmNWMuLjc5Njc5MWY4NDI3IDEwMDY0NAotLS0gYS9zcmMvYmFja2VuZC9zdG9yYWdlL2Fp by9yZWFkX3N0cmVhbS5jCisrKyBiL3NyYy9iYWNrZW5kL3N0b3JhZ2UvYWlvL3JlYWRfc3Ry ZWFtLmMKQEAgLTk5LDYgKzk5LDcgQEAgc3RydWN0IFJlYWRTdHJlYW0KIAlpbnQxNgkJZm9y d2FyZGVkX2J1ZmZlcnM7CiAJaW50MTYJCXBpbm5lZF9idWZmZXJzOwogCWludDE2CQlkaXN0 YW5jZTsKKwlpbnQxNgkJZGlzdGFuY2Vfb2xkOwogCWludDE2CQlpbml0aWFsaXplZF9idWZm ZXJzOwogCWludAkJCXJlYWRfYnVmZmVyc19mbGFnczsKIAlib29sCQlzeW5jX21vZGU7CQkv KiB1c2luZyBpb19tZXRob2Q9c3luYyAqLwpAQCAtNDQzLDYgKzQ0NCw3IEBAIHJlYWRfc3Ry ZWFtX2xvb2tfYWhlYWQoUmVhZFN0cmVhbSAqc3RyZWFtKQogCQlpZiAoYmxvY2tudW0gPT0g SW52YWxpZEJsb2NrTnVtYmVyKQogCQl7CiAJCQkvKiBFbmQgb2Ygc3RyZWFtLiAqLworCQkJ c3RyZWFtLT5kaXN0YW5jZV9vbGQgPSBzdHJlYW0tPmRpc3RhbmNlOwogCQkJc3RyZWFtLT5k aXN0YW5jZSA9IDA7CiAJCQlicmVhazsKIAkJfQpAQCAtODQxLDYgKzg0Myw3IEBAIHJlYWRf c3RyZWFtX25leHRfYnVmZmVyKFJlYWRTdHJlYW0gKnN0cmVhbSwgdm9pZCAqKnBlcl9idWZm ZXJfZGF0YSkKIAkJZWxzZQogCQl7CiAJCQkvKiBObyBtb3JlIGJsb2NrcywgZW5kIG9mIHN0 cmVhbS4gKi8KKwkJCXN0cmVhbS0+ZGlzdGFuY2Vfb2xkID0gc3RyZWFtLT5kaXN0YW5jZTsK IAkJCXN0cmVhbS0+ZGlzdGFuY2UgPSAwOwogCQkJc3RyZWFtLT5vbGRlc3RfYnVmZmVyX2lu ZGV4ID0gc3RyZWFtLT5uZXh0X2J1ZmZlcl9pbmRleDsKIAkJCXN0cmVhbS0+cGlubmVkX2J1 ZmZlcnMgPSAwOwpAQCAtMTAxMiw2ICsxMDE1LDkgQEAgcmVhZF9zdHJlYW1fcmVzZXQoUmVh ZFN0cmVhbSAqc3RyZWFtKQogCWludDE2CQlpbmRleDsKIAlCdWZmZXIJCWJ1ZmZlcjsKIAor CS8qIHJlbWVtYmVyIHRoZSBvbGQgZGlzdGFuY2UgKGlmIHdlIHJlc2V0IGJlZm9yZSBlbmQg b2YgdGhlIHN0cmVhbSkgKi8KKwlzdHJlYW0tPmRpc3RhbmNlX29sZCA9IE1heChzdHJlYW0t PmRpc3RhbmNlLCBzdHJlYW0tPmRpc3RhbmNlX29sZCk7CisKIAkvKiBTdG9wIGxvb2tpbmcg YWhlYWQuICovCiAJc3RyZWFtLT5kaXN0YW5jZSA9IDA7CiAKQEAgLTEwNDQsOCArMTA1MCwx MiBAQCByZWFkX3N0cmVhbV9yZXNldChSZWFkU3RyZWFtICpzdHJlYW0pCiAJQXNzZXJ0KHN0 cmVhbS0+cGlubmVkX2J1ZmZlcnMgPT0gMCk7CiAJQXNzZXJ0KHN0cmVhbS0+aW9zX2luX3By b2dyZXNzID09IDApOwogCi0JLyogU3RhcnQgb2ZmIGFzc3VtaW5nIGRhdGEgaXMgY2FjaGVk LiAqLwotCXN0cmVhbS0+ZGlzdGFuY2UgPSAxOworCS8qCisJICogUmVzdG9yZSB0aGUgb2xk IGRpc3RhbmNlLCBpZiB3ZSBoYXZlIG9uZS4gT3RoZXJ3aXNlIHN0YXJ0IGFzc3VtaW5nCisJ ICogZGF0YSBpcyBjYWNoZWQuCisJICovCisJc3RyZWFtLT5kaXN0YW5jZSA9IE1heCgxLCBz dHJlYW0tPmRpc3RhbmNlX29sZCk7CisJc3RyZWFtLT5kaXN0YW5jZV9vbGQgPSAwOwogfQog CiAvKgotLSAKMi41MC4xCgo= --------------TcRbRqn0jfAQ9M9Kv5INsPdi Content-Type: application/sql; name="cyclic.sql" Content-Disposition: attachment; filename="cyclic.sql" Content-Transfer-Encoding: base64 Y3JlYXRlIHRhYmxlIGN5Y2xpYyAoYSBkb3VibGUgcHJlY2lzaW9uLCBiIHRleHQpIHdpdGgg KGZpbGxmYWN0b3I9MjUpOwoKd2l0aCB4IGFzIChzZWxlY3QgMTAwMDAwICogKG1vZChpLCgx MDAwMDAwMC8xMDApKTo6ZG91YmxlIHByZWNpc2lvbiAvICgxMDAwMDAwMC8xMDApKSBBUyB4 IGZyb20gZ2VuZXJhdGVfc2VyaWVzKDEsMTAwMDAwMDApIHMoaSkpCmluc2VydCBpbnRvIGN5 Y2xpYyBzZWxlY3QgeCwgc2hhMjU2KHg6OnRleHQ6OmJ5dGVhKSBmcm9tIHg7Cgp2YWN1dW0g ZnJlZXplIGN5Y2xpYzsKCmNyZWF0ZSBpbmRleCBvbiBjeWNsaWMgKGEpOwo= --------------TcRbRqn0jfAQ9M9Kv5INsPdi--