Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w6FGb-0046Hf-36 for pgsql-hackers@arkaria.postgresql.org; Fri, 27 Mar 2026 22:02:58 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w6FGZ-00CLtd-1E for pgsql-hackers@arkaria.postgresql.org; Fri, 27 Mar 2026 22:02:55 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w6FGY-00CLtV-2g for pgsql-hackers@lists.postgresql.org; Fri, 27 Mar 2026 22:02:55 +0000 Received: from mail-wr1-x42f.google.com ([2a00:1450:4864:20::42f]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w6FGV-00000001Oi1-2WDe for pgsql-hackers@lists.postgresql.org; Fri, 27 Mar 2026 22:02:54 +0000 Received: by mail-wr1-x42f.google.com with SMTP id ffacd0b85a97d-43b5bded412so1875220f8f.0 for ; Fri, 27 Mar 2026 15:02:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1774648970; cv=none; d=google.com; s=arc-20240605; b=BqyLrLvUQb/A8y9MAmMQEcz+jpxj3gsBGoOKP8WNeeGZsV57yDE8QGPl+bstkb01Yf juO3InnjQOpCI3w6JIY38aSWemZtmoqAVT8+BmNLHW3zj1/My5UNMXTD70a+lESENOtP wLa1KzI5xmhd3TrxbnpHUHfJPyJwtiSTMikyozM3AovIegnHUbQNohLE//1QzwdBMTGn j+InulNn6x2sMQwrC0WprOlYvmVlJcB/2pnFZsP8OKKmra7sRyc/4PQEsca3uu4eOEkP T3afqKnT2IsYgd+2/9h6M6+9T+T5BMh8SqtTFy1743OOrFwmDEBq4hyftSpLpN3z/QVt qTiQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=kxcuoXnFgaLDF1x/7FndJKAXg7Vim+VJxxTqfDoYIdY=; fh=d5oXY33XLWHEJIv9Iwr3jURDGvG8XjT00Cf3+uQMVjc=; b=i/vfVShTVEOgYeMI0zQGq926Z1r2lUh6deV1TIVFS8u6SAB6lWvSsCA6UobWFxkuAp JWkpHjXNNKTM1EWWRuNccj36Buo/cpou/dzL27LzT9giDkxtU/ojhlc4/Lios8iDmerD 09zWXDNUWhNEKS14wDxPP4UHr9scbimddP39miBizcXLcFIESOjRhEkioHUHPiAF0QGL 1U0tnXB96wkWDMtJYej+bywGDDbshL7Lwtt3pByBXv/MdaJWlenR8OVUQKxpcDWN8V/K BPUIwNOPnFlp2kb1RBI+LMBGeZlhF9yHWWDQarMvgkeVI4MOKSyYB7QWEXqFNnN3k6iY +g4g==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bowt-ie.20230601.gappssmtp.com; s=20230601; t=1774648970; x=1775253770; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=kxcuoXnFgaLDF1x/7FndJKAXg7Vim+VJxxTqfDoYIdY=; b=mYfTFS3egNNfa0v3JxbWPX4izBsqvpCbvwFVlWtW/X8PWWTBFXOJvFuBhJdX8qP8TF PNR9Eft3r2fZmHKN/Qp803DFu8jOGOp2kYH65suPsP+8r/wQSOr7bXaYyLlATkcq8KC0 hdS62uUBENf56OY/CMUS5Q7XsG23ZgJN1hTH4KIGdo2ntmxhyAH7VW4ULXTUNisi1ywJ 8pE31Ob63E4MuLVUewvHR0ftfo0MjnFEa/ktOBXNUvrdTe2z7Fx32b5DknLeYXQycEl1 wmYwEn/8vlrC93u0rcSoqp8OC3sOHiXbISpf6aKgmV1KoLvuD8d8dg5p9LWuJMJC93Xo trfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774648970; x=1775253770; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=kxcuoXnFgaLDF1x/7FndJKAXg7Vim+VJxxTqfDoYIdY=; b=VQ8WnR91UH7tucmBJKZddg0thQcfYBN9RrCQVupE03vePDYoR2uXv2LfU0auoThwof XSkOvv2ORAd0n43ptr+q9p8oWp3JhMOYyC8teq+dMA4W2QFv7cz1IpvDMjqcFL39fBsv 65Gz/SelhIzA50cu3DCfG07WMWWfgpeQ6HpaDkoyLQLi8xc3zqgQHdc7CCtynfC0jdbL yDquuQBLXOlDgTg1l8mrq+o7edOJaO5YKTlh4/x8yEQLSBUMOT9I//k5OLydcB70t8WI ox0YaAVwiBfhQ4NLSBoz+U8UrBZS+oB8P6lawsqwCA8Vy6grgQrIXxv+xt5n0k72mrbN GYlg== X-Forwarded-Encrypted: i=1; AJvYcCW0i3YhsDjeD/qt5lk/AVeWhEmB8Q7vBOJxZaLdBZ5VLJ/c9u+Op9vmFakYa9SPTBgb9YSkYFfjiZYKC3iE@lists.postgresql.org X-Gm-Message-State: AOJu0YzVRnufhPfGlu43rsfWUPKklH5Ph1tXtEQvxbRbq/+timVu+jLw I8x5sRo/dIRxEsMc93oS9bB78waJbOXDn3AleuPBIuyAFbiToBDR8NxU62m6ZF8Itk9C8tQ7ua+ q/ywZJKC3nwytjQYRa6Dmf225F6VFYocUslRbFTOKLA== X-Gm-Gg: ATEYQzwx9uLG4RlJUeYUpezr4YIs70a7j3iK2it4yekN/1mSfYA8VR3B7tg0IVvdAoL WiTIYeQ5Ta3zY4wJCLpGz03Hq9rKu+82K1C0xGksKddAjhjJs71xBuF+wH9pRn9tHkKIsCpWXjz D47vSIIKSABhDr+gqURICpuR8gyy1PCnL0rHfmN2t/Q0dbf9AnYyAN7tGja0D4iFmEchzknjhzL PMigPE88asHJLG1jybGA+8xedvfrsRvlNXtihyJDizVeTjQfR0ZvKh0dGLl/+tAfuBJ/y/u44zi ZuVK8j5nzHvZOMoEyGrZEE/5dl6zG5RrJJ2+I8UHWRMXZmNElIPzWA== X-Received: by 2002:a05:6000:40dd:b0:439:b8b2:fad0 with SMTP id ffacd0b85a97d-43b9e9da56amr6235382f8f.5.1774648969454; Fri, 27 Mar 2026 15:02:49 -0700 (PDT) MIME-Version: 1.0 References: <62qc7j3mvsyz6ucd7xh7pv7w3u7rhztevsmrzsig7fyzv6yvol@uoyjq4eelcnz> In-Reply-To: <62qc7j3mvsyz6ucd7xh7pv7w3u7rhztevsmrzsig7fyzv6yvol@uoyjq4eelcnz> From: Peter Geoghegan Date: Fri, 27 Mar 2026 18:02:22 -0400 X-Gm-Features: AQROBzBnLKALGmKc1DzKssv-FTlrUhlBTJVnHijd2vkwowrAizJzgahy25HW-cQ Message-ID: Subject: Re: index prefetching To: Andres Freund Cc: Tomas Vondra , Alexandre Felipe , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Thu, Mar 26, 2026 at 5:47=E2=80=AFPM Andres Freund wrote:> > I must admit I'm unsure how to evaluate the maximum number of batches. > > It can make sense to pursue diminishing returns. But up to what point, > > and according to what principle? > > I think the theoretical amount of required IO concurrency can be calculat= ed > based on the storage latency and IOPS. IIRC it is > > iops_qd1 =3D (1000 / latency_ms) > queue_depth =3D IOPS / iops_qd1 > queue_depth =3D IOPS / (1000 / latency_ms) > So, to be able to fully utilize current hardware with one query, we need = to be > able to reach queue depth in the low hundreds, in the case of striped che= ap > cloud SSDs. That's when a backend *just* does IO, nothing else. That sounds like a useful starting point. But empirically, as far as I can tell, the relationship between query latency and how close you are to fully saturating I/O is not linear, or anything like it. Maybe it's an S-curve? It seems that a fully I/O-bound query isn't remotely close to twice as fast as the same query when it is restricted to using only half the number of batches (half the number required to reach that saturation point). OTOH, using more batches than strictly necessary usually isn't much of a problem. So I don't think that we can rely on a precise formula, even if we're willing to make fixed assumptions about data layout (which we're not). If there were a good enough reason for index prefetching to use an unbounded number of batches, we could surely figure out a way to support that requirement. It'd be messy, and relatively hard to test. And I'd worry a bit about there being zero backstop for index-only scans. But it can be done. If we did things that way (which doesn't seem like a good idea right now), we wouldn't have to model I/O saturation at all. Which tbh makes me wonder if that kind of modelling has much practical use either way. > Something like an index scan, will have its own limit to how much it can > process in a second. If we can only do 100k IOPS while searching the inde= x, > fetching the heap tuples and processing them, we don't need to support th= e > queue depths to support doing 1M IOPS within one backend. > > That's something that can presumably be quite easily experimentally > ballparked: > > A fully cached, completely uncorrelated, index scan seems to be able to f= etch > about 1.5M page fetches on my ~6 YO server CPU with turbo boost disabled,= when > never looking at the results (i.e. using OFFSET) or immediately filtering= away > the row. > So I'd guess the limit on newer CPUs in SKUs optimized for clock > speed and boost enabled, is north of 2.5M pages/sec, higher than I'd have > thought! That's without doing any IO though. We've done good work on nbtree's ability to avoid provably unnecessary work in recent years; see _bt_set_startikey. What that means is that the majority of the index scans used to test the patch probably have _bt_readpage calls that spend most of their time simply collecting all of the TIDs from the leaf page, without any scan key overhead (barring an initial precheck within _bt_set_startikey once per _bt_readpage, to prove that the optimization is safe). With large posting list tuples, we'll do even less work, since they're just an array of ItemPointerData. > With correlated scans the limit is much lower, maybe 150k, just because > there's so many more tuples per page (and processing them trivially becom= es > the bottleneck). > > > So, to support actually utilizing the full IO IO capability, we need to a= llow > for enough batches to keep a few hundred IOs in flight at the very extrem= e > end. I'd assume you have a much better idea to how many batches that > translates to? I can give you a range. The problem is that it's a range starting from "absurdly optimistic" through to "absurdly pessimistic". Neither extreme is very unlikely (there's a wide natural variation in workloads), and it's hard to argue usefully about what will be true in most cases. In short, I can tell you plenty, but nothing that seems particularly useful for determining how many batches we should cap the ring buffer at. I don't think there's anything fundamentally objectionable about our deriving the current maximum of 64 through trial and error. I assume that INDEX_SCAN_MAX_BATCHES must be constrained to a low-ish power of two so that the ring buffer maintenance routines avoid DIV instructions (from the use of a modulo operator that the compiler cannot optimize into a bitwise AND). There just aren't that many integers that even qualify as candidates! I'm pretty sure that 32 is likely too low (though it's hard to tell with buffered I/O on a fast local SSD). 128 might still be too low in extreme corner cases involving high latency and few matches per batch (though I doubt it). 256 seems too implausibly high to ever make sense (but I've been wrong before). -- Peter Geoghegan