Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ueKeU-007w5H-GJ for pgsql-hackers@arkaria.postgresql.org; Tue, 22 Jul 2025 21:35:59 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ueKeT-00EHNg-IY for pgsql-hackers@arkaria.postgresql.org; Tue, 22 Jul 2025 21:35:57 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ueKeT-00EHNY-59 for pgsql-hackers@lists.postgresql.org; Tue, 22 Jul 2025 21:35:57 +0000 Received: from mail-wm1-x335.google.com ([2a00:1450:4864:20::335]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1ueKeQ-000IKm-0Y for pgsql-hackers@lists.postgresql.org; Tue, 22 Jul 2025 21:35:57 +0000 Received: by mail-wm1-x335.google.com with SMTP id 5b1f17b1804b1-451dbe494d6so65364765e9.1 for ; Tue, 22 Jul 2025 14:35:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bowt-ie.20230601.gappssmtp.com; s=20230601; t=1753220153; x=1753824953; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=2RtOay7uTQoeGjm9fBmsnYleBfhY1ee1SdvAXmw8/X8=; b=A4LfyDs1CtRpgvBwa36v4tmsI7iJ87QLDApPl4kXO8rHlHHXXKgJaklEiWRrk28oB/ csLBIuIlzvrfPfINkZjfamepzo4PR5cNXMvBN4wzPzvDmACmBy7VGRhAkbugrSud5xWU gBt2OVMkC6stqdfYVPe4PSH8VEXElk9s8oVlROTNknQhbKE0IMvIMoCUpOFH8m0fRkIZ CB9uuGyGPE+mT01yiWMNNcacEg0nti7sdqoiq2PvRJkr81RKYOpsuzLHrqzrVEFvn6LG 5RX9kQgAUBDTIhaqbhfy8RS6FT24tD5gG99irOaFUwJy1hD2FzadOx6HRwD1BrHB1KLy Tyag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753220153; x=1753824953; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=2RtOay7uTQoeGjm9fBmsnYleBfhY1ee1SdvAXmw8/X8=; b=ElUvGEoIRk+DfGFLm1XkpNFiKkaOzH7YvF+NcCeC4f05tmomM+/SPywhbvTRq8m/aR c3yXu4+5aa2FzB9uHc0sXMD9b2DKeC9O7L6bv93ZRY6HP3Lld0KPTRSa/uIE58kf4vfL 0MLWZDv1V2q/iq1Wo5lJQhvZrka2/hliKnmZYnoOfo/B7g0LN0w4oztPKyyR9dXhzJRw clQExqcIhOaa9EC49lPk5frnFMsGkhpAIySQWmaIVYUtnfPPgvA+FnQj1adH1Y1rKh0X zxuHVePjbTU8OP/eAv398500dzqc63Add/GN+0bBzZ0OgIHdO8EX7hl0TxIFDdIJ9wR3 S51A== X-Forwarded-Encrypted: i=1; AJvYcCV8NjMsFqBbHf+pnPIwAJR0O4PU7E3ocWZHISYM+t6NdYtWnwmNvMSHqB0VYhGFmJi/MYUEyOoytz3oyXpS@lists.postgresql.org X-Gm-Message-State: AOJu0YwDJRXz4+Pippt5KeQiuGNWf0C6GL5fcm4v02mvHPdemTN6Jl1j F65Ki12RPXLcga6TD2edtTY/gtW1dzx2R4y//zLWNMzV8KoYz5xTGQGmXAEPVbXNrZ3hBMk4M3X xjcOyPsvhiDvPCmTL4CtOW2HQFOVEPzDrg5n/BsdmCw== X-Gm-Gg: ASbGncs1/uMUxFhEbLexAwxBs0RMI+gTMIQc5kYtJIYXUp7kU4wX931o7FR8L7Gp93r vUrv144jtFucVQJLsMNwXlIgDcVVjl/AzIp41JddfIbUWJMb8ksgYiin5sUjvGXxUSEWB+UvdCC XBz7j17c4o/y1qHSkBuR6mHwqa1uEWsRl8ZxLu7zmRPX4QkhX+gzMgqUQfu1+zvnFnQSZpVukr3 MR0C8g= X-Google-Smtp-Source: AGHT+IHWTi1HwZb5JHKmVjbREKUi+OI9n4H4lMXlqwSZg2TTldhJjruzryDKY9mlBykRHFaf9N2sLZKYwCSRfqNXaGM= X-Received: by 2002:a05:600c:1c88:b0:43c:fdbe:4398 with SMTP id 5b1f17b1804b1-45868c8f822mr2428195e9.6.1753220152722; Tue, 22 Jul 2025 14:35:52 -0700 (PDT) MIME-Version: 1.0 References: <32c15a30-6e25-4f6d-9191-76a19482c556@vondra.me> <64c8b824-6203-46a3-b045-5e95b796feee@vondra.me> <03dcc1a9-c5d0-4965-889c-684dc0a7580c@vondra.me> <23f490f4-8325-408c-91a0-a6757ab2441c@vondra.me> <1bebfd1b-aea5-4d41-80a6-eae64b8f9eaf@vondra.me> In-Reply-To: <1bebfd1b-aea5-4d41-80a6-eae64b8f9eaf@vondra.me> From: Peter Geoghegan Date: Tue, 22 Jul 2025 17:35:26 -0400 X-Gm-Features: Ac12FXyqQAlEotZliqZ57Gvtrtea-mEAhxnrnrnj2UqrQZ16NB6QJdRLNu_1SSQ Message-ID: Subject: Re: index prefetching To: Tomas Vondra Cc: Nazir Bilal Yavuz , Thomas Munro , Andres Freund , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Tue, Jul 22, 2025 at 4:50=E2=80=AFPM Tomas Vondra wrot= e: > > Obviously, whatever advantage that the "complex" patch has is bound to > > be limited to cases where index characteristics are naturally the > > limiting factor. For example, with the pgbench_accounts_pkey table > > there are only ever 6 distinct heap blocks on each leaf page. I bet > > that your "linear" test more or less looks like that, too. > > > > Yes. It's definitely true we could construct examples where the complex > patch beats the simple one for this reason. It's literally the only possible valid reason why the complex patch could w= in! The sole performance justification for the complex patch is that it can prevent the heap prefetching from getting bottlenecked on factors tied to physical index characteristics (when it's possible in principle to avoid getting bottlenecked in that way). Unsurprisingly, if you assume that that'll never happen, then yeah, the complex patch has no performance advantage over the simple one. I happen to think that that's a very unrealistic assumption. Most standard benchmarks have indexes that almost all look fairly similar to pgbench_accounts_pkey, from the point of view of "heap page blocks per leaf page". There are exceptions, of course (e.g., the TPC-C order table's primary key suffers from fragmentation). > And I believe some of those > examples could be quite realistic, even if not very common (like when > very few index tuples fit on a leaf page). I don't think cases like that matter very much at all. The only thing that *really* matters on the index AM side is the logical/physical correlation. Which your testing seems largely unconcerned with. > However, I'm not sure the pgbench example with only 6 heap blocks per > leaf is very significant. Sure, the simple patch can't prefetch TIDs > from the following leaf, but AFAICS the complex patch won't do that > either. Why not? > Not because it couldn't, but because with that many hits the > distance will drop to ~1 (or close to it). (It'll probably prefetch a > couple TIDs from the next leaf at the very end of the page, but I don't > think that matters overall.) Then why do your own test results continue to show such a big advantage for the complex patch, over the simple patch? > I'm not sure what prefetch distances will be sensible in queries that do > other stuff. The queries in the benchmark do just the index scan, but if > the query does something with the tuple (in the nodes on top), that > shortens the required prefetch distance. Of course, simple queries will > benefit from prefetching far ahead. Doing *no* prefetching will usually be the right thing to do. Does that make index prefetching pointless in general? > Thanks. I wonder how difficult would it be to add something like this to > pgstattuple. I mean, it shouldn't be difficult to look at leaf pages and > count distinct blocks, right? Seems quite useful. I agree that that would be quite useful. > > Did you restore the distance for the "complex" patch, too? I think > > that it might well matter there too. > > > > No, I did not. I did consider it, but it seemed to me it can't really > make a difference (for these data sets), because each leaf has ~300 > items, and the patch limits the prefetch to 64 leafs. That means it can > prefetch ~20k TIDs ahead, and each heap page has ~20 items. So this > should be good enough for eic=3D1000. It should never hit stream reset. It looks like the complex patch can reset the read stream for a couple of reasons, which I don't fully understand right now. I'm mostly thinking of this stuff: /* * If we advanced to the next batch, release the batch we no * longer need. The positions is the "read" position, and we ca= n * compare it to firstBatch. */ if (pos->batch !=3D scan->batchState->firstBatch) { batch =3D INDEX_SCAN_BATCH(scan, scan->batchState->firstBat= ch); Assert(batch !=3D NULL); /* * XXX When advancing readPos, the streamPos may get behind= as * we're only advancing it when actually requesting heap * blocks. But we may not do that often enough - e.g. IOS m= ay * not need to access all-visible heap blocks, so the * read_next callback does not get invoked for a long time. * It's possible the stream gets so mucu behind the positio= n * gets invalid, as we already removed the batch. But that * means we don't need any heap blocks until the current re= ad * position - if we did, we would not be in this situation = (or * it's a sign of a bug, as those two places are expected t= o * be in sync). So if the streamPos still points at the bat= ch * we're about to free, just reset the position - we'll set= it * to readPos in the read_next callback later. * * XXX This can happen after the queue gets full, we "pause= " * the stream, and then reset it to continue. But I think t= hat * just increases the probability of hitting the issue, it'= s * just more chance to to not advance the streamPos, which * depends on when we try to fetch the first heap block aft= er * calling read_stream_reset(). */ if (scan->batchState->streamPos.batch =3D=3D scan->batchState->firstBatch) index_batch_pos_reset(scan, &scan->batchState->streamPo= s); > > Isn't the obvious explanation that the complex patch benefits from > > being able to prefetch without being limited by index > > characteristics/leaf page boundaries, while the simple patch doesn't? > > > > That's a valid interpretation, yes. Although the benefit comes mostly The benefit comes mostly from....? > Yes, there's some similarity. Attached is the script I use to create the > tables and load the data. Another issue with the testing that biases it against the complex patch: heap fill factor is set to only 25 (but you use the default index fill-factor). > The "linear" is a table with a simple sequence of values (0 to 100k). > More or less - the value is a floating point, and there are 10M rows. > But you get the idea. > > The "linear_X" variants mean the value has a noise of X% of the range. > So with "linear_1" you get the "linear" value, and then random(0,1000), > with normal distribution. I don't get why this is helpful to test, except perhaps as a general smoke = test. If I zoom into any given "linear_1" leaf page, I see TIDs that appear in an order that isn't technically uniformly random order, but is fairly close to it. At least in a practical sense. At least for the purposes of prefetching. For example: pg@regression:5432 [104789]=3D# select itemoffset, htid from bt_page_items('linear_1_a_idx', 4); =E2=94=8C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=AC=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=90 =E2=94=82 itemoffset =E2=94=82 htid =E2=94=82 =E2=94=9C=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=BC=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=A4 =E2=94=82 1 =E2=94=82 =E2=88=85 =E2=94=82 =E2=94=82 2 =E2=94=82 (10,18) =E2=94=82 =E2=94=82 3 =E2=94=82 (463,9) =E2=94=82 =E2=94=82 4 =E2=94=82 (66,8) =E2=94=82 =E2=94=82 5 =E2=94=82 (79,9) =E2=94=82 =E2=94=82 6 =E2=94=82 (594,7) =E2=94=82 =E2=94=82 7 =E2=94=82 (289,13) =E2=94=82 =E2=94=82 8 =E2=94=82 (568,2) =E2=94=82 =E2=94=82 9 =E2=94=82 (237,2) =E2=94=82 =E2=94=82 10 =E2=94=82 (156,10) =E2=94=82 =E2=94=82 11 =E2=94=82 (432,9) =E2=94=82 =E2=94=82 12 =E2=94=82 (372,17) =E2=94=82 =E2=94=82 13 =E2=94=82 (554,6) =E2=94=82 =E2=94=82 14 =E2=94=82 (1698,11) =E2=94=82 =E2=94=82 15 =E2=94=82 (389,6) =E2=94=82 *** SNIP *** =E2=94=82 288 =E2=94=82 (1264,5) =E2=94=82 =E2=94=82 289 =E2=94=82 (738,16) =E2=94=82 =E2=94=82 290 =E2=94=82 (1143,3) =E2=94=82 =E2=94=82 291 =E2=94=82 (400,1) =E2=94=82 =E2=94=82 292 =E2=94=82 (1157,10) =E2=94=82 =E2=94=82 293 =E2=94=82 (266,2) =E2=94=82 =E2=94=82 294 =E2=94=82 (502,9) =E2=94=82 =E2=94=82 295 =E2=94=82 (85,15) =E2=94=82 =E2=94=82 296 =E2=94=82 (282,2) =E2=94=82 =E2=94=82 297 =E2=94=82 (453,5) =E2=94=82 =E2=94=82 298 =E2=94=82 (396,6) =E2=94=82 =E2=94=82 299 =E2=94=82 (267,18) =E2=94=82 =E2=94=82 300 =E2=94=82 (733,15) =E2=94=82 =E2=94=82 301 =E2=94=82 (108,8) =E2=94=82 =E2=94=82 302 =E2=94=82 (356,16) =E2=94=82 =E2=94=82 303 =E2=94=82 (235,10) =E2=94=82 =E2=94=82 304 =E2=94=82 (812,18) =E2=94=82 =E2=94=82 305 =E2=94=82 (675,1) =E2=94=82 =E2=94=82 306 =E2=94=82 (258,13) =E2=94=82 =E2=94=82 307 =E2=94=82 (1187,9) =E2=94=82 =E2=94=82 308 =E2=94=82 (185,2) =E2=94=82 =E2=94=82 309 =E2=94=82 (179,2) =E2=94=82 =E2=94=82 310 =E2=94=82 (951,2) =E2=94=82 =E2=94=94=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=B4=E2=94=80=E2=94=80=E2=94= =80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=98 (310 rows) There's actually 55,556 heap blocks in total in the underlying table. So clearly there is some correlation here. Just not enough to ever matter very much to prefetching. Again, the sole test case that has that quality to it is the "linear" test case. --=20 Peter Geoghegan