Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1umDwo-00BuTu-8V for pgsql-hackers@arkaria.postgresql.org; Wed, 13 Aug 2025 16:03:30 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1umDvl-00GaXi-5f for pgsql-hackers@arkaria.postgresql.org; Wed, 13 Aug 2025 16:02:25 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1umDvk-00GaXZ-NZ for pgsql-hackers@lists.postgresql.org; Wed, 13 Aug 2025 16:02:25 +0000 Received: from mail-wr1-x429.google.com ([2a00:1450:4864:20::429]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1umDvi-000Z4V-1r for pgsql-hackers@lists.postgresql.org; Wed, 13 Aug 2025 16:02:24 +0000 Received: by mail-wr1-x429.google.com with SMTP id ffacd0b85a97d-3b914186705so1161220f8f.0 for ; Wed, 13 Aug 2025 09:02:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bowt-ie.20230601.gappssmtp.com; s=20230601; t=1755100941; x=1755705741; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Gd/7GQ+IyXcgEOuJW4y+Gg4LBeV/3ndUPWXZsfdue7k=; b=JyCYgKKM9GUswIzSkuCftadXJYNm6fvH18iNRVPkNxNmRveCvMujlqas7I22hhmhxp sqKe0I8jY0Ucaq+feGGLYeCjU2Pyj0kpK0rhEY04x3NcRHIdhaLBxwkMv+qyfRQgRSM5 x9Npl9K4DqgamKox8oEO2p4hN3LDKTAO57e0mb59zWg8MVP5VVWVOG/+Kq4asX5dWVdK 0/68xVw4O7WCrrvmi2EeoX/EqKasatabzbj+en2R1+pkznWfXz/yu+9Z42xZvEXRulgE XrZsTRwJCWRRVS/OoJER9LgKvhbCQG6QzFz8RvD51jc2S3oVQe14tsSkE3LyftFiQRKn lYMA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755100941; x=1755705741; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Gd/7GQ+IyXcgEOuJW4y+Gg4LBeV/3ndUPWXZsfdue7k=; b=tBrNW7SzmNSb59BRlFEx8owpXpgjd1QSZmlwH7s7PJK5XX+oxbC1ecs8+xSK/Fn76i 1cXG7wwbUrbbsWAo9udbeEiNLdmeaOiaxIKom5SE/wgOh/+Abc6iZsQ5QjVoxu0+bIWB stjPPtKzrptUXKVx3i8L3OuE43c3NfPgP2TzigErrtaE+Z0+SMa78hNwIS8bzf7aYLMk fqsinEBb0eS5tdyuIIpIQOawCK680+hGneRdXxaM0nsRu1uAuovdLxoE03NASa0f/uMe kDmIQPDRsyvELtlXL/24QrqfKsanTqo2ZvcJYt8guepdPcz3fLMxH2P1N/OI0fYv09xl uS9g== X-Forwarded-Encrypted: i=1; AJvYcCW/m36BAyjsZ2+Ik6LoKwWdwFMVDPEwFe636HUOtjp21DZaVpj8OXRuZmYKoAn9GBcYx/li1cC0rUXiQXsE@lists.postgresql.org X-Gm-Message-State: AOJu0YziumBV+/Sswvt6aNkbUNEfLu8qiB6NFC/t8Kt+KC30jn5OUeFj MN0wsB1bQiz4zKHhajw1XjwIC8x7HCtDzHCByfISCjit6INAy9YYeZVzNPehoAAfaFU+sa/iwuW bLHmV3sgoy2lNnjZJfarC5cClI5UsIjwxiL2ybcKcDw== X-Gm-Gg: ASbGncv1mgxp/HPY+WwAmZDnP0KEWD4jeUHelTL96rxpusV6O8J7jgnQP1O/W3Vx8dG BhtkaH0Gu9wF74EkxAaP3JndFn1LGfYzSGJC+3W+3BTBmNjlRBO3JN64S2S+bs5Et0RZuRrfX9Q kY4G2pmp//Pry4vGGzA2YKCiNxx45N8iZdJjXQSREm77VHlXRj0qYrGdeJY9D4byW47drL4qy+t ibLBqI= X-Google-Smtp-Source: AGHT+IGrUfE691cY8l6bcdtQlVdyTLRoC/Bf7dYhAhEgHDiPZLqhNUAx4+hmVHnXHGpgDfOy4P7+rk7FJm1oQVoQ9oU= X-Received: by 2002:a05:6000:310d:b0:3b7:8154:aa36 with SMTP id ffacd0b85a97d-3b917d24577mr3034890f8f.7.1755100941011; Wed, 13 Aug 2025 09:02:21 -0700 (PDT) MIME-Version: 1.0 References: <51b5f71b-5f19-4453-91ff-2b9f2a840c58@vondra.me> <6cb6109d-71d6-490c-8056-d8885081b008@vondra.me> In-Reply-To: From: Peter Geoghegan Date: Wed, 13 Aug 2025 12:01:52 -0400 X-Gm-Features: Ac12FXykbd15BjrUmojeBkUfkBFcYyrZZSIxJy7Zu_6foTDzp8z67nRxyluqZj0 Message-ID: Subject: Re: index prefetching To: Andres Freund Cc: Tomas Vondra , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Wed, Aug 13, 2025 at 11:28=E2=80=AFAM Andres Freund = wrote: > > With "sync" I always get this (after a restart): > > > > Buffers: shared hit=3D7435 read=3D52801 > > > > while with "worker" I get this: > > > > Buffers: shared hit=3D4879 read=3D52801 > > Buffers: shared hit=3D5151 read=3D52801 > > Buffers: shared hit=3D4978 read=3D52801 > > > > So not only it changes run to tun, it also does not add up to 60236. > > This is reproducible on master? If so, how? AFAIK it is *not* reproducible on master. > If this happens only with the prefetching patch applied, is is possible t= hat > what happens here is that we occasionally re-request buffers that already= in > the process of being read in? That would only happen with a read stream a= nd > io_method !=3D sync (since with sync we won't read ahead). If we have to = start > reading in a buffer that's already undergoing IO we wait for the IO to > complete and count that access as a hit: This theory seems quite plausible to me. Though it is a bit surprising that I see incorrect buffer hit counts on the "good" forwards scan case, rather than on the "bad" backwards scan case. Here's what I mean by things being broken on the read stream side (at least with certain backwards scan cases): When I add instrumentation to the read stream side, by adding elog debug calls that show the blocknum seen by read_stream_get_block, I see out-of-order and repeated blocknums with the "bad" backwards scan case ("SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a desc"): ... NOTICE: index_scan_stream_read_next: index 1163 TID (25052,21) WARNING: prior lastBlock is 25053 for batchno 2856, new one: 25052 WARNING: blocknum: 25052, 0x55614810efb0 WARNING: blocknum: 25052, 0x55614810efb0 NOTICE: index_scan_stream_read_next: index 1161 TID (25053,3) WARNING: prior lastBlock is 25052 for batchno 2856, new one: 25053 WARNING: blocknum: 25053, 0x55614810efb0 NOTICE: index_scan_stream_read_next: index 1160 TID (25052,19) WARNING: prior lastBlock is 25053 for batchno 2856, new one: 25052 WARNING: blocknum: 25052, 0x55614810efb0 WARNING: blocknum: 25052, 0x55614810efb0 NOTICE: index_scan_stream_read_next: index 1141 TID (25051,21) WARNING: prior lastBlock is 25052 for batchno 2856, new one: 25051 WARNING: blocknum: 25051, 0x55614810efb0 ... Notice that we see the same blocknum twice in close succession. Also notice that we're passed 25052 and then subsequently passed 25053, only to be passed 25053 once more. OTOH, when I run the equivalent "good" backwards scan ("SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a"), the output looks just about perfect. I have to look around quite a bit longer before I can find repeated blocknum within successive read_stream_get_block calls: ... NOTICE: index_scan_stream_read_next: index 303 TID (74783,1) WARNING: prior lastBlock is 74782 for batchno 2862, new one: 74783 WARNING: blocknum: 74783, 0x55614810efb0 NOTICE: index_scan_stream_read_next: index 323 TID (74784,1) WARNING: prior lastBlock is 74783 for batchno 2862, new one: 74784 WARNING: blocknum: 74784, 0x55614810efb0 NOTICE: index_scan_stream_read_next: index 324 TID (74783,21) WARNING: prior lastBlock is 74784 for batchno 2862, new one: 74783 WARNING: blocknum: 74783, 0x55614810efb0 NOTICE: index_scan_stream_read_next: index 325 TID (74784,2) WARNING: prior lastBlock is 74783 for batchno 2862, new one: 74784 WARNING: blocknum: 74784, 0x55614810efb0 ... These out-of-order repeat requests are much rarer. And I *never* see identical requests in *immediate* succession, whereas those are common with the backwards scan case. I believe that the out-of-order repeat requests shown here are a legitimate consequence of the TIDs being slightly out of order in relatively few places (so the forwards scan case may well already be behaving exactly as I expect): pg@regression:5432 [2470184]=3D# select ctid, a from t where ctid between '(74783,1)' and '(74784,1)'; =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=90 =E2=94=82 ctid =E2=94=82 a =E2=94=82 =E2=94=9C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=BC=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=A4 =E2=94=82 (74783,1) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,2) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,3) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,4) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,5) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,6) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,7) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,8) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,9) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,10) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,11) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,12) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,13) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,14) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,15) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,16) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,17) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,18) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,19) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,20) =E2=94=82 49,077 =E2=94=82 =E2=94=82 (74783,21) =E2=94=82 49,078 =E2=94=82 =E2=94=82 (74784,1) =E2=94=82 49,077 =E2=94=82 =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=B4=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=98 (22 rows) Bear in mind that EXPLAIN ANALYZE shows *identical* "Buffers:" details for each query on master. So I believe that I am completely justified in expecting the calls to read_stream_get_block for the backwards scan to use identical blocknums to the ones for the equivalent/good forwards scan (except that they should be in the exact opposite order). And yet that's not what I see. Maybe this is something to do with the read position and the stream position becoming mixed up? I find it odd that the relevant readstream callback, index_scan_stream_read_next, says "If the stream position is undefined, just use the read position". That's just a guess, though. This issue is tricky to debug. I'm not yet used to debugging problems such as these (though I'll probably become an expert on it in the months ahead). --=20 Peter Geoghegan