Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8gjs-000mfU-1w for pgsql-hackers@arkaria.postgresql.org; Fri, 03 Apr 2026 15:47:17 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w8gjr-00CIwg-08 for pgsql-hackers@arkaria.postgresql.org; Fri, 03 Apr 2026 15:47:15 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8gjq-00CIwW-2L for pgsql-hackers@lists.postgresql.org; Fri, 03 Apr 2026 15:47:15 +0000 Received: from mail-ed1-x532.google.com ([2a00:1450:4864:20::532]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w8gjo-00000000ObA-3SyA for pgsql-hackers@postgresql.org; Fri, 03 Apr 2026 15:47:14 +0000 Received: by mail-ed1-x532.google.com with SMTP id 4fb4d7f45d1cf-66c4c7e2bb7so2706555a12.0 for ; Fri, 03 Apr 2026 08:47:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1775231231; cv=none; d=google.com; s=arc-20240605; b=lw9DAif0qQBn7P7ALKuiPwtdlDZ3JX55KFxpMX4G7LZzJv8CLcWzZjnO+p34HZXQib d5+pW/xKBWQKkKp38444Tmhym1CzFcWeZcf7Z03yqYJzydQPJqECaBUzmYvSvQ2PoCcJ 3Q8W5Y14oqDMYvbg86gpoW48hcnSHma3TKMB8FS1IXIurtOQEyg/6Xl4vZh0K1Fio7v2 IcmMzGkfUqbjf0CD3IYTQQswuYcaV+qDnSyjZeE+bqL4fBnS/QNeAt43DbyLmDXPmx6X GwO5jHVGea37G60P3K6FI81P1V0eQYgauAldqFT995X25zVigHCviDrm5eVzgRrE9ZmG u4tg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=0TbwktKQjqEVfKC/dDfdbKfUej10GYqAoL1kvcl2Vs0=; fh=62GROJdbzln/QsGjE7a71YRuD5BIah4s1ki9oo1aJ7c=; b=CC6mi1CR6O/Auw9PIwHq8MzacNoaQDjQIpIZfVvydyM8e5wZ9Ay45qg+4yC7lO7X5/ J1lIOZrGeRtl20o1YEZMMmNMnjNm7oc7P/LcfSKww6yWbYKCs5XqJs782FnWhx5Q+KZu PmhbaBgfVgsFVEDFEMUERBV6Jekv5dbGJWOQul6RhFxrK969JYePz0VePNQbaKsDwUrb nLkzY5Jbq8qz/QYDUpRYiHDJx77XNx/sJ/asfQci4qFotTPf3YpnVJ6iyNUSzsx6p8ZH Rzr36NALE/DRQJ36ZH5S6W+eSI3SeRt0K+b5YF4rMk6ou5dohmI778AA32lvYQDwAOnn QuFg==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1775231231; x=1775836031; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=0TbwktKQjqEVfKC/dDfdbKfUej10GYqAoL1kvcl2Vs0=; b=bpUGHQfDNtNfvjPSHK4FRr79X5sp9JjTRkItP2Y6WB9uv0UsYF7MzBVm8XYA6Mq4Ml az7d6VigI+Dnoe/JBM0O1loUgkFliWvCqhBYnSIqnFTYPSBMKVpQGUbeFzbjMZjP5gSN P19XsIqTC69y+e0QiQ1Xux2TBKgo7ZwmFydDDy0jYL24qGkhk8x/0gzDcOF9443B6Bj1 T6QRjJSgzfBnDh2Nj+BKe9+G07aHNj+V9kUY6FGYCPK7sH3qNb9/XYnxzN9uGt/1BYzZ iDGfG70Qh/gAjcVxwR7+m74wRnQxGKIGBsr5ilZ0UzC7l2V7h0RaJjuFkyNVkiZaOAmk T9Rw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775231231; x=1775836031; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=0TbwktKQjqEVfKC/dDfdbKfUej10GYqAoL1kvcl2Vs0=; b=Y6p9jGhdXpQDBtMlWzR7DVVHnbGH/0Ran+Y2QijZt3U57z47r/P4aWWPSka+5G/466 WyMb2ZmbGyXtZPbe1Y5hrNfmR2z2Q8EsbzHsuYHFHZbrGbQsxaC8pesRhcsi4evu+TaG haNnquNZST8XLJhSndUBRN1i3KKnzMGV4E9FTu1NWMP7fWqJp4X5YYY4n/UZcehFDSo5 Nko5EoAkJ1+5EYKC5WFAW2oOIr17nExP28+Mct4zh8sMxWAluch8wMzTLVtL5g8PuPYi oN+Hht9ADtRCSNWVtDA+/rEL/cTDEDQShMElARbSvPWXdqHOA7VliU3S+sjK6/KXiTkV Gfmg== X-Gm-Message-State: AOJu0YyBjx3UUKFtaZs3G7B/pi/eW6uRam3j7xy7w2zXPrphjC531ZpI 15KAcl9tquEzEYktdDMiv6v39y8TvM/dB9PEE+zPMW+CTsRF0IAiTNDMrDZKb55uoeZbcA3JuB4 vSBqhs4WbxqcwCBHgbMaSWpxRFbt6zXs= X-Gm-Gg: AeBDietghI2BcRKkHNXYNkkJfjSzz5pn8rCefCkWuZmzQXj7om+rAq35zI1/VrAEfw8 tcHN5rRcTmVVeB7mLaTjyGNTEZyz28G8izf72eOcAYCu6e0nKzK36khncO0zkN2qBRS1obUR3nQ PFziDLFiecFejgxdPraiJ/c91JZb6IQVY02BiEucMbyeVDz0Q4FJkh5Ue3WtdNiBtRDrUSgseZf gLTPwE/aXFpfOeNfCxIKJbqKX0wRY60NTb+zwBMoG9kot9kWivRIM0RGxaI0SPR1nutjqd8PEZY 65WRR7/3+GGMUDAJIKCv/SWv7sjnjrt+J5fru8ET7p7PJOZQYj+8VJJtO7MPwVYjH3LamucQyDG zQdgGL5at X-Received: by 2002:a05:6402:2547:b0:65f:71ed:7ab9 with SMTP id 4fb4d7f45d1cf-66e3f708baemr2017915a12.22.1775231230432; Fri, 03 Apr 2026 08:47:10 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Melanie Plageman Date: Fri, 3 Apr 2026 11:46:59 -0400 X-Gm-Features: AQROBzB-z4UWMACIAnAApvXOFplqkRDzkXMx0AODmeyK4XbLtKk0WTID32KdQyI Message-ID: Subject: Re: AIO / read stream heuristics adjustments for index prefetching To: Andres Freund Cc: pgsql-hackers@postgresql.org, Thomas Munro , Peter Geoghegan , Tomas Vondra , Nazir Bilal Yavuz Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Thu, Apr 2, 2026 at 11:47=E2=80=AFAM Andres Freund = wrote: > > What do you think about the updated patch to achieve that that I posted? here is some review on 0005 and 0006 earlier posted concrete things: ------- - I=E2=80=99d reorder stream=E2=86=92distance =3D=3D 0 in read_stream_look_= ahead() (and issue), I found myself asking why it wasn=E2=80=99t first - I agree with bilal that if (stream->pinned_buffers + stream->pending_read_nblocks >=3D stream->max_pinned_buffers) return false; should be in the other commit because it isn=E2=80=99t required with the current code because of how distance is set and is more confusing. wasn=E2=80=99t it an assert before? - perhaps the new heuristic for allowing further look ahead if we are building an IO and it isn=E2=80=99t big enough yet should be in its own com= mit - why have read_stream_pause() save combine_distance if it isn=E2=80=99t go= ing to zero it out? - I think the comments you added to the read stream struct for combine_distance and readahead_distance should indicate units (i.e. blocks) - why not remove the combine_distance requirement from the fast path entry criteria (could save resume_combine_distance in the fast path and restore it after a miss) - can we move the fast path to where we found out we got a hit? - should_issue_now has a lot of overlap with should_read_ahead which makes it confusing to figure out how they are different, but I think that is related to readahead_distance being in units of blocks and not IOs (which I talk about later) - There's also some assorted typos and comment awkwardness that I assume you will clean up in the polish step (e.g. "if we have an the process" -> "if we are in the process", "oveflow" -> "overflow", The NB comment still says stream->distance =3D=3D 0 but should say stream->readahead_distance =3D=3D 0) some more ambiguous stuff: ------- > In my experiments the batching was primarily useful to allow to reduce th= e > command submission & interrupt overhead when interacting with storage, i.= e. > when actually doing IO, not just copying from the page cache. It still seems to me that batching would help guarantee parallel memcpys for workers when data is in the kernel buffer cache because if the IO is quick, the same worker may pick up the next IO. --- You mentioned that we don't want to read too far ahead (including for a single combined IO) in part because: > The resowner and private refcount mechanisms take more CPU cycles if you = have > more buffers pinned But I don't see how either distance is responding to this or self-calibrating with this in mind --- >> I liked the idea of being more aggressive to do IO combining. What is >> the reason for gradually increasing combine_distance, is it to not do >> unnecessary IOs at the start? > > Yea. It'd perhaps not be too bad with the existing users, but it'd *reall= y* > hurt with index scan prefetching, because of query plans where we only co= nsume > part of the index scan (e.g. a nested loop antijoin). You said we have to ramp up combine_distance because in index prefetching when we don=E2=80=99t need all the blocks, it can hurt. But if = we ramp it up unconditionally (assuming we did IO), then I don=E2=80=99t see h= ow this would solve the problem as it will ramp up regardless after just a few IOs anyway. --- > I've been experimenting with going the other way, by having > read_stream_should_look_ahead() do a check like /* * If we already have IO in flight, but are close enough to to the * distance limit that we would not start a fully sized IO, don't even * start a pending read until later. * * This avoids calling the call thing the next block callback in cases = we * would not start the pending read anyway. For some users the work to * determine the next block is non-trivial, so we don't want to do so * earlier than necessary. * * A secondary benefit of this is that some callers use parallel worker= s * with each their own read stream to process a global list of blocks, = and * only calling the next block callback when ready to actually issue IO * makes it more likely for one backend to get consecutive blocks. */ if (stream->pinned_buffers > 0 && stream->pending_read_nblocks =3D=3D 0 && stream->pinned_buffers + stream->combine_distance >=3D stream->readahead_distance) return false; So, as you say, with, for example, a large io_combine_limit and small effective_io_concurrency, this would result in much lower IO concurrency than we want. But, I think this speaks to the central tension I see with the new combine_distance and readahead_distance: it seems like readahead_distance should now be in units of IO and combine_distance in units of blocks. If that were the case, this heuristic wouldn't have a downside. Obviously ramping readahead_distance up and down when it is in units of IO becomes much more fraught though. But our criteria for wanting to make bigger IOs is different than our criteria for wanting more IOs in flight. --- I think it is weird to have combine_distance only be relevant when readahead_distance is low. You said: > We could, but I don't think there would be a benefit in doing so. In my m= ind, > what combine_distance is used for, is to control how far to look ahead to > allow for IO combining when the readahead_distance is too low to allow fo= r IO > combining. But if pinned_buffers > 0, we already have another IO in flig= ht, > so the combine_distance mechanism doesn't have to kick in. But it seems like for a completely uncached workload bigger IOs is still beneficial. We may want to avoid reading ahead to get bigger IOs when there is a chance we only need a few blocks. And, we want to parallelize copies across workers, so we want a combine_distance that is small enough that, relative to effective_io_concurrency, we can split IOs across workers some. But, otherwise, we want maximally sized IOs. In the medium term, perhaps we can try to use executor hints to handle the former. And, assuming you think readahead_distance in IOs is a non-starter, maybe we could use a heuristic that tries to balance IOs by dividing readahead_distance by combine_distance when combine_distance is sufficiently high or something like that. - Melanie