Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ujPJu-007FBq-U1 for pgsql-hackers@arkaria.postgresql.org; Tue, 05 Aug 2025 21:35:43 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ujPJs-00CFRo-Jc for pgsql-hackers@arkaria.postgresql.org; Tue, 05 Aug 2025 21:35:40 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ujPJs-00CFKQ-2H for pgsql-hackers@lists.postgresql.org; Tue, 05 Aug 2025 21:35:40 +0000 Received: from mail-wr1-x429.google.com ([2a00:1450:4864:20::429]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1ujPJo-000vPl-2O for pgsql-hackers@lists.postgresql.org; Tue, 05 Aug 2025 21:35:39 +0000 Received: by mail-wr1-x429.google.com with SMTP id ffacd0b85a97d-3b7910123a0so5930652f8f.1 for ; Tue, 05 Aug 2025 14:35:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bowt-ie.20230601.gappssmtp.com; s=20230601; t=1754429735; x=1755034535; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=OROSOnP46B3nwRsXZMsJO+xoDa2rcXu2r5FE8QhkXwU=; b=y2pO5aCuoIQSJDo/rKgdS4LmUMzlmZ96F38JpFdvmQIzFbJDkHM502XdgwK6/6p1Ye VE3YoA2zpk6e/IGPoDaTlaaN4M+PuPxd3OoEd58tNntvoWoCIRSPa4489GJ0RHYzb8WQ xW+oC7Z+LOAoQQUj2J2L7Scytp8zmyxe8gc+VyMk8tUcCC1/SQpGoLXavBLHpNjUzodB f3wl1BwSnVUlUScqE40V3CphifDDEWYTi4BcaM6OgDCrnaaFZIjpAWtrRO+u+F5LVgvY UoayiSeUuPGmLzXTmywpl1QVbVYgkdMMRkACPNRkzowgghXe8dheTbO45HlqYiQ1Mteq c7XA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1754429735; x=1755034535; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=OROSOnP46B3nwRsXZMsJO+xoDa2rcXu2r5FE8QhkXwU=; b=WVUQ+43cW7ascVorwpRFS2N4GJ2puK3SXlvoJnSBfXmDZb765A5Uo2VkCEn6lodTY9 v9P1ztnCEA7NRQTQenoLQYBS3+e1Szd7bdylanbN/w22GG9BDADnIPlJ+LFcCR5OKaX1 raQyP1RGR84th9Yt39XaIjwdy0UTqspjEyyecup7bIXLEqraHsAqRiulqivKU+IB9Yzr 8P0xlyrKyhL22zbrT4IwjjQLfVops2qddupJsY9vAxfPKckTmN/x9ptnH/lGKc39geD0 0F1RcLOmX6K1H7dewIGQdqUdxuo9S+SanXxyuiwElZuuU/6uYc2cHphQXfo4xX97hGFt hIww== X-Forwarded-Encrypted: i=1; AJvYcCWUoBZKojGv3K3//vNEm8RFcC3pXA1CoNeA8qOdomRfFOCgjzN7JRpHpf9OltPy4TayZ341u0bgtzM2/BbS@lists.postgresql.org X-Gm-Message-State: AOJu0YzLnmljGiH3/IGyl/mqCRR797A6CfOIO6XLgE/YlK+lFia+BMnr F2WdMys+qE3RZnz8mVv/zrtfuUhob3MpYtDJsGFtOMSKoKVe+gBWHS0kOy0c+tb2jaWaQrm29mD 7Cwkqkw88C/Sw7HGZG3dSIgKt2IT14iSLmrhgVnfuVA== X-Gm-Gg: ASbGncv6705xRE5fb37oOvjMyoxjev2+MfUbLrNjKoCxkS79314xiMxCEdRk7Fh/ASX qTeRbH0mIoOXsjcVfy+p/mT2yuaCH39E//GrcrJJthOaHBwcMcbhWk87xiJMOwX2T7BDGPqRPzA 6MYdlnVvJsNFvXR1BX9JFvJ+j2rJtPqzhU1jR5ZbbtKwjujWtIWbjcNcixwG2H3NjAJdOcW6N/Z SL5G0Nl3BGXfEkiyg== X-Google-Smtp-Source: AGHT+IFa0NBGlqGGD7qb8uGUQHWeVpYmPNpERyU7YKyKjvND1ZvqhWmmQAZq+JURIWFlxwAewh+JTm80ez096xUpdO0= X-Received: by 2002:a5d:5d86:0:b0:3b7:8fc4:2f62 with SMTP id ffacd0b85a97d-3b8f41ac94fmr368732f8f.33.1754429734889; Tue, 05 Aug 2025 14:35:34 -0700 (PDT) MIME-Version: 1.0 References: <23f490f4-8325-408c-91a0-a6757ab2441c@vondra.me> <1bebfd1b-aea5-4d41-80a6-eae64b8f9eaf@vondra.me> <8ed1d326-5c6e-476e-b3fd-30d3da210546@vondra.me> <38b865bd-2ae9-4a94-a788-6e3dc99ccd70@vondra.me> <306fc8c0-c882-4602-86f5-a106b9ace603@vondra.me> <152ea782-5bd4-4435-b021-0ab2da61e63d@vondra.me> <7c2f6350-6fca-4e39-b0a8-8ac735f5d58a@vondra.me> In-Reply-To: <7c2f6350-6fca-4e39-b0a8-8ac735f5d58a@vondra.me> From: Peter Geoghegan Date: Tue, 5 Aug 2025 17:35:08 -0400 X-Gm-Features: Ac12FXw-jKRgvdHxxIjNEmNDxGc5Ke1CRCTm5dEt1o1RNLzJuiDdL5tbiF5jI6k Message-ID: Subject: Re: index prefetching To: Tomas Vondra Cc: Nazir Bilal Yavuz , Thomas Munro , Andres Freund , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Tue, Aug 5, 2025 at 4:56=E2=80=AFPM Tomas Vondra wrote= : > Probably. It was hard to predict which values will be interesting, maybe > we can pick some subset now. I'll start by just doing larger steps, I > think. Maybe increase by 4x rather than 2x, that'll reduce the number of > combinations a lot. Also, I plan to stick to fillfactor=3D20, it doesn't > seem to have a lot of impact anyway. I don't think that fillfactor matters all that much, either way. A low setting provides a simple way of simulating "wide" heap tuples, but that probably isn't going to make the crucial difference. It's not like the TPC-C index I used in my own recent testing (which showed that the complex patch was almost 3x faster than the simple patch) has all that strong of a pg_stats.correlation. You can probably come up with indexes/test cases where groups of related TIDs that each point to the same heap block appear together, even though in general the index tuple heap TIDs appear completely out of order. It probably isn't even that different to a simple pgbench_accounts_pkey from a prefetching POV, though, in spite of these rather conspicuous differences. In time we might find that just using pgbench_accounts_pkey directly works just as well for our purposes (unsure of that, but seems possible). > So what other index scan variations would you suggest to test? I can > imagine e.g. IN () conditions with variable list length, maybe > multi-column indexes, and/or skip scan cases. Any other ideas? The only thing that's really interesting about IN() conditions is that they provide an easy way to write a query that only returns a subset of all index tuples from every leaf page read. You can get a similar access pattern from other types of quals, but that's not quite as intuitive. I really don't think that IN() conditions are all that special. They're perfectly fine as a way of getting this general access pattern. I like to look for and debug "behavioral inconsistencies". For example, I have an open item in my notes (which I sent to you over IM a short while ago) about a backwards scan that is significantly slower than an "equivalent" forwards scan. This involves pgbench_accounts_pkey. It's quite likely that the underlying problem has nothing much to do with backwards scans. I suspect that the underlying problem is a more general one, that could also be seen with the right forwards scan test case. In general, it might make the most sense to look for pairs of similar-ish queries that are inconsistent in a way that doesn't make sense intuitively, in order to understand and fix the inconsistency. Since chances are that it's actually just some kind of performance bug that accidentally doesn't happen in only one variant of the query. I bet that there's at least a couple of not-that-noticeable performance bugs, for example due to some hard to pin down issue with prefetch distance getting out of hand. Possibly because the read stream doesn't get to see contiguous requests for TIDs that point to the same heap page, but does see it when things are slightly out of order. Two different queries that have approximately the same accesses should have approximately the same performance -- minor variations in leaf page layout or heap page layout or scan direction shouldn't be confounding. > FWIW I'm not planning to keep testing simple vs complex patches. We've > seen the complex patch can do much better in certain workloads cases, > the fact that we can discover more such cases does not change much. > > I'm much more interested in benchmarking master vs. complex patch. Great! > > It'd also be good to just not test "sync" anymore, at some point. And > > maybe to standardize on testing either "worker" or "io_uring" for most > > individual tests. There's just too many tests right now. > > > > Agreed. Might also make sense to standardize on direct I/O when testing the patch (but probably not when testing master). The fact that we can't get any OS readahead is likely to be useful. > > Andres recently told me that he isn't expecting to be able to simulate > > read-ahead with direct I/O. It seems possible that read-ahead > > eventually won't be used at all, which argues for the complex patch. > > > > True, the complex patch could prefetch the leaf pages. What I meant was that the complex patch can make up for the fact that direct I/O presumably won't ever have an equivalent to simple read-ahead. Just by having a very flexible prefetching implementation (and without any special sequential access heuristics ever being required). > > BTW, I experimented with using READ_STREAM_USE_BATCHING (not > > READ_STREAM_DEFAULT) in the complex patch. That's probably > > deadlock-prone, but I suspect that it works well enough to get a good > > sense of what is possible. What I saw (with that same TPC-C test > > query) was that "I/O Timings" was about 10x lower, even though the > > query runtime didn't change at all. This suggests to me that "I/O > > Timings" is an independently interesting measure: getting it lower > > might not visibly help when only one query runs, but it'll likely > > still lead to more efficient use of available I/O bandwidth in the > > aggregate (when many queries run at the same time). > > > > Interesting. Does that mean we should try enabling batching in some > cases? Or just that there's room for improvement? I don't know what it means myself. I never got as far as even starting to understand what it would take to make READ_STREAM_USE_BATCHING work. AFAIK it wouldn't be hard to make that work here at all, in which case we should definitely use it. OTOH, maybe it's really hard. I just don't know right now. > Could we do the next_block callbacks in a way that make deadlocks > impossible? > > I'm not that familiar with the batch mode - how would the deadlock even > happen in index scans? I have no idea. Maybe it's already safe. I didn't notice any problems (but didn't look for them, beyond running my tests plus the regression tests). > I think the only way is to try reworking some of the index AMs to use > the new interface. For some AMs (e.g. hash) it's going to be very > similar to what you did with btree, because it basically works like a > btree. For others (GiST/SP-GiST) it may be more work. The main difficulty with GiST may be that we may be obligated to fix existing (unfixed!) bugs that affect index-only scans. The master branch is subtly broken, but we can't in good conscience ignore those problems while making these kinds of changes. > It doesn't need to be committable, just good enough to be reasonably > certain it's possible. That's what I have in mind, too. If we have support for a second index AM, then we're much less likely to over-optimize for nbtree in a way that doesn't really make sense. > Understood, and I agree in principle. It's just that given the fuzziness > I find it hard how it should look like. I suspect that index AMs are much more similar for the purposes of prefetching than they are in other ways. -- Peter Geoghegan