Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1umAOX-00Ax9W-R3 for pgsql-hackers@arkaria.postgresql.org; Wed, 13 Aug 2025 12:15:54 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1umAOW-00F3Fz-AW for pgsql-hackers@arkaria.postgresql.org; Wed, 13 Aug 2025 12:15:52 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1umAOV-00F3Fg-Qg for pgsql-hackers@lists.postgresql.org; Wed, 13 Aug 2025 12:15:52 +0000 Received: from relay15.mail.gandi.net ([2001:4b98:dc4:8::235]) by makus.postgresql.org with smtp (Exim 4.96) (envelope-from ) id 1umAOS-000MOY-2K for pgsql-hackers@lists.postgresql.org; Wed, 13 Aug 2025 12:15:50 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id 8A2F044292; Wed, 13 Aug 2025 12:15:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1755087343; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=pSyWH5mO2cUXEbcZw8rtbaWy/mLvePveQwvbpozeJg4=; b=Y9tHkLPU5x0oSIzmU/leo693QQAnnJuLwY91LcWReOpiTG+C5d7913Amkwi0urDDdrgjll 223A+W6nLq7ID6ooyGftVF+ys/zkrS7YHHPq9ECBJWG/g+XACiBj8DaKe4eazYQFJ4Z6kq dLnuRB71rwSfmK57n3t7J6Cxhi2Y8VvOn39M5wR3FC2xdRTQVLpdL113VPk5SdXbMBXR0t rUw896hFYOe8uCWQ69QdW99KkuxP5cNtZ3Eq9jykXXpxDhIAoTRl6BJmaj82B054DEZopq eUFBZlGTL3wOCWkLF5ToeB2wUeycZOmcZXxPvrKykx6wLVvEMvhph+WpezrvhQ== Content-Type: multipart/mixed; boundary="------------fKj9zxQvUJFymmq3hpBd5ALa" Message-ID: <6cb6109d-71d6-490c-8056-d8885081b008@vondra.me> Date: Wed, 13 Aug 2025 14:15:37 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: index prefetching To: Peter Geoghegan Cc: Thomas Munro , Andres Freund , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar References: <8ed1d326-5c6e-476e-b3fd-30d3da210546@vondra.me> <152ea782-5bd4-4435-b021-0ab2da61e63d@vondra.me> <7c2f6350-6fca-4e39-b0a8-8ac735f5d58a@vondra.me> <2e63cadd-2a03-46b1-866e-7ea5d3ffd37f@vondra.me> <51b5f71b-5f19-4453-91ff-2b9f2a840c58@vondra.me> Content-Language: en-US From: Tomas Vondra In-Reply-To: X-GND-State: clean X-GND-Score: -100 X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgddufeekudekucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuifetpfffkfdpucggtfgfnhhsuhgsshgtrhhisggvnecuuegrihhlohhuthemuceftddunecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjughrpegtkfffgggfuffvvehfhfgjsehmtderredtvdejnecuhfhrohhmpefvohhmrghsucggohhnughrrgcuoehtohhmrghssehvohhnughrrgdrmhgvqeenucggtffrrghtthgvrhhnpeekudfhgfelgefhjedtheelhffgudegjedufefftedutdehlefhjeeftdevieefieenucfkphepkeeirdegledrvdeftddrvddtieenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepihhnvghtpeekiedrgeelrddvfedtrddvtdeipdhhvghloheplgdutddrudefjedrtddrvdgnpdhmrghilhhfrhhomhepthhomhgrshesvhhonhgurhgrrdhmvgdpnhgspghrtghpthhtohepuddtpdhrtghpthhtohepphhgsegsohifthdrihgvpdhrtghpthhtohepthhhohhmrghsrdhmuhhnrhhosehgmhgrihhlrdgtohhmpdhrtghpthhtoheprghnughrvghssegrnhgrrhgriigvlhdruggvpdhrtghpthhtohepsgihrghvuhiikedusehgmhgrihhlrdgtohhmpdhrtghpthhtoheprhhosggvrhhtmhhhrggrshesghhmrghilhdrtghomhdprhgtphhtthhopehmvghlrghnihgvphhlrghgvghmrghnsehgmhgrihhlrdgtohhmpdhrtghpthhto hepphhgshhqlhdqhhgrtghkvghrsheslhhishhtshdrphhoshhtghhrvghsqhhlrdhorhhgpdhrtghpthhtohepghhkohhkohhlrghtohhssehprhhothhonhhmrghilhdrtghomh List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is a multi-part message in MIME format. --------------fKj9zxQvUJFymmq3hpBd5ALa Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On 8/13/25 01:33, Peter Geoghegan wrote: > On Tue, Aug 12, 2025 at 7:10 PM Tomas Vondra wrote: >> Actually, this might be a consequence of how backwards scans work (at >> least in btree). I logged the block in index_scan_stream_read_next, and >> this is what I see in the forward scan (at the beginning): > > Just to be clear: you did disable deduplication and then reindex, > right? You're accounting for the known issue with posting list TIDs > returning TIDs in the wrong order, relative to the scan direction > (when the scan direction is backwards)? > > It won't be necessary to do this once I commit my patch that fixes the > issue directly, on the nbtree side, but for now deduplication messes > things up here. And so for now you have to work around it. > No, I forgot about that (and the the patch only applies to master). >> But with the backwards scan we apparently scan the values backwards, but >> then the blocks for each value are accessed in forward direction. So we >> do a couple blocks "forward" and then jump to the preceding value - but >> that's a couple blocks *back*. And that breaks the lastBlock check. > > I don't think that this should be happening. The read stream ought to > be seeing blocks in exactly the same order as everything else. > >> I believe this applies both to master and the prefetching, except that >> master doesn't have read stream - so it only does sync I/O. > > In what sense is it an issue on master? > > On master, we simply access the TIDs in whatever order amgettuple > returns TIDs in. That should always be scan order/index key space > order, where heap TID counts as a tie-breaker/affects the key space in > the presence of duplicates (at least once that issue with posting > lists is fixed, or once deduplication has been disabled in a way that > leaves no posting list TIDs around via a reindex). > > It is certainly not surprising that master does poorly on backwards > scans. And it isn't all that surprising that master does worse on > backwards scans when direct I/O is in use (per the explanation > Andres offered just now). But master should nevertheless always read > the TIDs in whatever order it gets them from amgettuple in. > > It sounds like amgetbatch doesn't really behave analogously to master > here, at least with backwards scans. It sounds like you're saying that > we *won't* feed TIDs heap block numbers to the read stream in exactly > scan order (when we happen to be scanning backwards) -- which seems > wrong to me. > > As you pointed out, a forwards scan of a DESC column index should feed > heap blocks to the read stream in a way that is very similar to an > equivalent backwards scan of a similar ASC column on the same table. > There might be some very minor differences, due to differences in the > precise leaf page boundaries among each of the indexes. But that > should hardly be noticeable at all. > I gave this another try, this time with disabled deduplication, and on master I also applied the patch (but now I realize that's probably unnecessary, right?). I did a couple more things for this experiment: 1) created a second table with an "inverse pattern" that's decreasing: create table t2 (like t) with (fillfactor = 20); insert into t2 select -a, b from t; create index idx2 on t2 (a); alter index idx2 set (deduplicate_items = false); reindex index idx2; The idea is that SELECT * FROM t WHERE (a BETWEEN x AND y) ORDER BY a ASC is the same "block pattern" as SELECT * FROM t2 WHERE (a BETWEEN -y AND -x) ORDER BY a DESC 2) added logging to heapam_index_fetch_tuple elog(LOG, "heapam_index_fetch_tuple block %u", ItemPointerGetBlockNumber(tid)); 3) disabled autovacuum (so that it doesn't trigger any logs) 4) python script that processes the block numbers and counts number of blocks, runs, forward/backward advances 5) bash script that runs 4 "equivalent" queries on t/t2, with ASC/DESC. And the results look like this (FWIW this is with io_method=sync): Q1: SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 Q2: SELECT * FROM t2 WHERE a BETWEEN -49103 AND -16336 master / buffered query order time blocks runs forward backward --------------------------------------------------------------- Q1 ASC 575 1048576 57365 53648 3716 Q1 DESC 10245 1048576 57365 3716 53648 Q2 ASC 14819 1048576 86061 53293 32767 Q2 DESC 1063 1048576 86061 32767 53293 prefetch / buffered query order time blocks runs forward backward --------------------------------------------------------------- Q1 ASC 701 1048576 57365 53648 3716 Q1 DESC 1805 1048576 57365 3716 53648 Q2 ASC 1221 1048576 86061 53293 32767 Q2 DESC 2101 1048576 86061 32767 53293 master / direct query order time blocks runs forward backward --------------------------------------------------------------- Q1 ASC 6101 1048576 57365 53648 3716 Q1 DESC 12041 1048576 57365 3716 53648 Q2 ASC 14837 1048576 86061 53293 32767 Q2 DESC 14690 1048576 86061 32767 53293 prefetch / direct query order time blocks runs forward backward --------------------------------------------------------------- Q1 ASC 1504 1048576 57365 53648 3716 Q1 DESC 9034 1048576 57365 3716 53648 Q2 ASC 6988 1048576 86061 53293 32767 Q2 DESC 8959 1048576 86061 32767 53293 The timings are from runs without the extra logging, but there's still quite a bit of run to run variation. But the differences are somewhat stable. Some observations: * The block stats are perfectly stable (for each query), both for each build and between builds. And also perfectly symmetrical between the ASC/DESC version of each query. The ASC does the same number of "forward" steps like DESC does "backward" steps. * There's a clear difference between Q1 and Q2, with Q2 having many more runs (and not as "nice" forward/backward steps). When I created the t2 data set, I expected Q1 ASC to behave the same as Q2 DESC, but it doesn't seem to work that way. Clearly, the "descending" pattern in t2 breaks the sequence of block numbers into many more runs. >> Could that hide the extra buffer accesses, somehow? > > I think that you meant to ask about *missing* buffer hits with the > patch, for the forwards scan. That doesn't agree with the backwards > scan with the patch, nor does it agree with master (with either the > forwards or backwards scan). Note that the heap accesses themselves > appear to have sane/consistent numbers, since we always see > "read=49933" as expected for those, for all 4 query executions that I > showed. > > The "missing buffer hits" issue seems like an issue with the > instrumentation itself. Possibly one that is totally unrelated to > everything else we're discussing. > Yes, I came to this conclusion too. The fact that the stats presented above are exactly the same for all the different cases (for each query) is a sign it's about the tracking. In fact, I believe this is about io_method. I initially didn't see the difference you described, and then I realized I set io_method=sync to make it easier to track the block access. And if I change io_method to worker, I get different stats, that also change between runs. With "sync" I always get this (after a restart): Buffers: shared hit=7435 read=52801 while with "worker" I get this: Buffers: shared hit=4879 read=52801 Buffers: shared hit=5151 read=52801 Buffers: shared hit=4978 read=52801 So not only it changes run to tun, it also does not add up to 60236. I vaguely recall I ran into this some time ago during AIO benchmarking, and IIRC it's due to how StartReadBuffersImpl() may behave differently depending on I/O started earlier. It only calls PinBufferForBlock() in some cases, and PinBufferForBlock() is what updates the hits. In any case, it seems to depend on io_method, and it's confusing. regards -- Tomas Vondra --------------fKj9zxQvUJFymmq3hpBd5ALa Content-Type: text/x-python; charset=UTF-8; name="stats.py" Content-Disposition: attachment; filename="stats.py" Content-Transfer-Encoding: base64 IyEvdXNyL2Jpbi9lbnYgcHl0aG9uCgppbXBvcnQgc3lzCgpmaXJzdF9ibG9jayA9IGludChz eXMuYXJndlsxXSkKbGFzdF9ibG9jayA9IGludChzeXMuYXJndlsyXSkKCm5ibG9ja3MgPSAw Cm5ydW5zID0gMApuZm9yd2FyZCA9IDAKbmJhY2t3YXJkID0gMAoKcHJldl9ibG9jayA9IDAK CmZvciBsaW5lIGluIHN5cy5zdGRpbjoKCgliID0gaW50KGxpbmUpCgoJaWYgYiA8IGZpcnN0 X2Jsb2NrIG9yIGIgPiBsYXN0X2Jsb2NrOgoJCWNvbnRpbnVlCgoJbmJsb2NrcyArPSAxCgoJ aWYgYiAhPSBwcmV2X2Jsb2NrOgoJCW5ydW5zICs9IDEKCglpZiBiID4gcHJldl9ibG9jayBh bmQgcHJldl9ibG9jayAhPSAwOgoJCW5mb3J3YXJkICs9IDEKCglpZiBiIDwgcHJldl9ibG9j ayBhbmQgcHJldl9ibG9jayAhPSAwOgoJCW5iYWNrd2FyZCArPSAxCgoJcHJldl9ibG9jayA9 IGIKCnByaW50IChuYmxvY2tzLCBucnVucywgbmZvcndhcmQsIG5iYWNrd2FyZCkK --------------fKj9zxQvUJFymmq3hpBd5ALa Content-Type: application/x-shellscript; name="explain-analyze.sh" Content-Disposition: attachment; filename="explain-analyze.sh" Content-Transfer-Encoding: base64 IyEvdXNyL2Jpbi9lbnYgYmFzaAoKZnVuY3Rpb24gZWEgewoKCWZvciByIGluICQoc2VxIDEg NSk7IGRvCgoJCXE9JDEKCgkJcm0gcGcubG9nOyBwZ19jdGwgLUQgdG1wL2RhdGEgLWwgcGcu bG9nIHJlc3RhcnQgPiAvZGV2L251bGwgMj4mMQoJCXN1ZG8gLi9kcm9wLWNhY2hlcy5zaAoK CQl0PSQocHNxbCB0ZXN0IC1jICIkcSIgfCBncmVwICdFeGVjdXRpb24gVGltZScgfCBhd2sg J3twcmludCAkM30nKQoKCQlzdGF0cz0kKGdyZXAgJ2hlYXBhbV9pbmRleF9mZXRjaF90dXBs ZSBibG9jaycgcGcubG9nIHwgYXdrICd7cHJpbnQgJDExfScgfCAuL3N0YXRzLnB5IDI0ODkx IDc0ODIzKQoKCQllY2hvICInJHEnIiAkdCAkc3RhdHMKCWRvbmUKfQoKCmVhICJFWFBMQUlO IChBTkFMWVpFLCBWRVJCT1NFLCBDT1NUUyBPRkYpIFNFTEVDVCAqIEZST00gdCBXSEVSRSBh IEJFVFdFRU4gMTYzMzYgQU5EIDQ5MTAzIE9SREVSIEJZIGEiCmVhICJFWFBMQUlOIChBTkFM WVpFLCBWRVJCT1NFLCBDT1NUUyBPRkYpIFNFTEVDVCAqIEZST00gdCBXSEVSRSBhIEJFVFdF RU4gMTYzMzYgQU5EIDQ5MTAzIE9SREVSIEJZIGEgREVTQyIKCmVhICJFWFBMQUlOIChBTkFM WVpFLCBWRVJCT1NFLCBDT1NUUyBPRkYpIFNFTEVDVCAqIEZST00gdDIgV0hFUkUgYSBCRVRX RUVOIC00OTEwMyBBTkQgLTE2MzM2IE9SREVSIEJZIGEiCmVhICJFWFBMQUlOIChBTkFMWVpF LCBWRVJCT1NFLCBDT1NUUyBPRkYpIFNFTEVDVCAqIEZST00gdDIgV0hFUkUgYSBCRVRXRUVO IC00OTEwMyBBTkQgLTE2MzM2IE9SREVSIEJZIGEgREVTQyIKCg== --------------fKj9zxQvUJFymmq3hpBd5ALa--