Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8oXM-000uDd-0W for pgsql-hackers@arkaria.postgresql.org; Sat, 04 Apr 2026 00:06:52 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w8oXJ-00Ed2u-1N for pgsql-hackers@arkaria.postgresql.org; Sat, 04 Apr 2026 00:06:49 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8oXI-00Ed2m-3A for pgsql-hackers@lists.postgresql.org; Sat, 04 Apr 2026 00:06:49 +0000 Received: from mail-ed1-x52c.google.com ([2a00:1450:4864:20::52c]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w8oXG-00000000Sq9-2w20 for pgsql-hackers@postgresql.org; Sat, 04 Apr 2026 00:06:48 +0000 Received: by mail-ed1-x52c.google.com with SMTP id 4fb4d7f45d1cf-66a33f61d80so4100423a12.0 for ; Fri, 03 Apr 2026 17:06:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1775261204; cv=none; d=google.com; s=arc-20240605; b=goxpxxW+sOT343W96+pXh+aj1MUeUlsRlpbvK+mpkNYMdYql7D0+VFJAaCGi9cb5/H s9u+NT3NPrMEGqD2eswyawlNug3BAErF8S8lfRkpRsWAcdJO41hNh4iBcCwOEZcZWYow ekGap03goc9FB0YOQGkRI2d+SJPde8/NuA5zlumgHHNCbJuk4g3jNTBRrEoNnTQOwfv1 rHoZ5pab6qPeNXJbxE5345BlcRYXzHUD71KViRXj2aOnJqKLCXFyvAlhrvoXZo5BMkOn 1vSuUwkgVU3Gel/F8YYIZUo3427MFixK7H8kaqy/oa/GTURdF3uSoTRuJlQ7f/k9Diij Dfsw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=K2SoLGsQKB60VzIllI29GkYCCammTi88ENYRi/momFU=; fh=62GROJdbzln/QsGjE7a71YRuD5BIah4s1ki9oo1aJ7c=; b=JuvvSjwzcGNs1NeQ+e1r9inhWvhLvIA6fzDT/7EsQSWTusmKn/PaYskbhsmw6EOUGL kWx2I4lB5qaXeESwkcjPxwG5Sp5Q9986cCFSeNoe38V7kVKH3CxYZxQal8wli4O1rahJ bzS0Pn66dFhV9nZZsSIMXaVdEWGmXU8Hz7PTGNgLerJaxoqyA7hbudgMnFKK6+/LIvqQ KbypN0Oamq2dMAQBC7oRpKUuPT8XCP7BbQuGU3s4p1vTGGGuMvmNzs6ltPLIE6yRAwKP t40j8y5MHcdow95Duj1+D7VPLIXu0pRNkAHrppf8COFx6rYy7lYFj+EOyp3VOucnUFAC sxXQ==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1775261204; x=1775866004; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=K2SoLGsQKB60VzIllI29GkYCCammTi88ENYRi/momFU=; b=hOurPO6nW930Ba7uwZCjVH5K/BOzRO/AbJHiXfWY3ULeDLMww2YYvVumOeOBIpHZFm IByu8C5aJeTw3LvGT+/cGVeiRuVPtR89eEBKFrog45zKuV5fsfCX9aeKUbTgTGX53TXa y5pIFj27mTG1rr5OsgjWVTzEcjJynCvXo4jeEwL1KvHvfSivQMvmPfWZixZreTIey0B5 w2FfUK/WDbNMDUDxYCMg0qfWAJ1KjaUzkcxyHPx6mIlc0x1kq/RNgFhtPKUAnfBNdEzM cOMc3vsgGJkSmp7WI3Oc3+MbG6M0OC2gjchdgJS10n79d9Tk7zMTzscetWhrWNsetn/K BN4g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775261204; x=1775866004; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=K2SoLGsQKB60VzIllI29GkYCCammTi88ENYRi/momFU=; b=XbqJ08hac9Hho5G1euDaeM3aBEk7K/obwGtO7HP7fq3teyf7txqPWVRcjbBszSgLmB Hujho6fVvQv4Tj2Gpg23yJdS1uKW4kzoQDtAgPdKRtU/XT1xv5QthXiilSBO/NAMcEdm XKJomCi9kz+63q3zL4m64iOA8VFIuiB13kmaZ6gnM27XuFuOVFHpdDJsDZKvydWBf9KO IJBQs0pZVSR/+yA9FXO16yP/4NlzScKq+h3988WdhJ9GhC+JcZTNqiYrYnbFqDESvkfH GQAf6lT9t041N8evz46tBOGMUQwjEKdg7gjYhwCP9SAwKhohCyXOlT/SPdS2N4cEkFtn Hk8A== X-Gm-Message-State: AOJu0YyebRwvoX9v/zENQGLNyuWz/aWhV6Sv3MuxLgiUlWVrYOcDf4LZ K4nRDKqIZ+V82dH20CfxSpZOQH1TKhjsLB53b+M+3ZlUa534xHo+ZB3fPr2Xy0ITC++SrRQoKzu 4PQVIu+1vvNDVUM5POSgi8Qfy8qNGcOgC+XCd X-Gm-Gg: AeBDiesufJ+kJAIV0zMNpOfhsic6niomZ8pRsJhYwdobCZdK0ELUv1ZNT4c/ylmqG4G V+4Dl2T72gA8cKxrzihuA+aV66gt3bRXOZUoZ7vXSUAer8LAZbHOhovoh1tJX3CQWQaUXu6Wbco X1dAC6s+31Y46z1NEdcVj5MxuLeO8YPwOlsjczGgF+N6zuA6WrSMyE3xvYccXeJizALYOw2PbBy QRHnxn1qEJCGZx1ydNdm8fy7GTrJZnA0Jj1SFm4oteAg+oC6N6M+yntdgtEnlCfEdkL1uob9kpN NvkZVSz6s7pb7FsJfAE90PrNT1pKThR1VTGVtHUUEmw+FP5dVxP2hYIO5vo2QU89GrPgTvAapAi ONWuedu3j X-Received: by 2002:a05:6402:3228:b0:66e:5c60:761e with SMTP id 4fb4d7f45d1cf-66e5c607691mr1406924a12.0.1775261204118; Fri, 03 Apr 2026 17:06:44 -0700 (PDT) MIME-Version: 1.0 References: <24bjkmnkuapbs7wvcecvtrb3gvbrzg3extlkzpbg2f7dwt7h42@3e4vg6cd33iw> In-Reply-To: <24bjkmnkuapbs7wvcecvtrb3gvbrzg3extlkzpbg2f7dwt7h42@3e4vg6cd33iw> From: Melanie Plageman Date: Fri, 3 Apr 2026 20:06:32 -0400 X-Gm-Features: AQROBzCvg2Y9jYv9P6wdq0BEihuRNiJ4d2zhqa8uGpDNE57AYdM_1P3D6T6YNIM Message-ID: Subject: Re: AIO / read stream heuristics adjustments for index prefetching To: Andres Freund Cc: pgsql-hackers@postgresql.org, Thomas Munro , Peter Geoghegan , Tomas Vondra , Nazir Bilal Yavuz Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Fri, Apr 3, 2026 at 4:36=E2=80=AFPM Andres Freund w= rote: > > Attached are a revised set of commits. 0001 aio: io_uring: Trigger async processing for large IOs LGTM 0002 read_stream: Move logic about IO combining & issuing to helpers LGTM 0003 read stream: Split decision about look ahead for AIO and combining Stale comment reference to stream->distance around line 505 Typo above ReadStream->combine_distance: "The limits for read-ahead an combining" should be "and combining". typo in comment in read_stream_look_ahead() "even if we are reached the read-ahead limit" I think you meant "even if we have reached the read-ahead limit" I didn't really notice this before * doing full io_combine_limit sized reads. */ if (flags & READ_STREAM_FULL) - stream->distance =3D Min(max_pinned_buffers, stream->io_combine_lim= it); + { + stream->readahead_distance =3D Min(max_pinned_buffers, stream->io_combine_limit); + stream->combine_distance =3D Min(max_pinned_buffers, stream->io_combine_limit); + } that distance started out at io_combine_limit for READ_STREAM_FULL. That doesn't make that much sense to me, wouldn't we set readahead_distance to max_pinned_buffers() for READ_STREAM_FULL? + /* + * Allow looking further ahead if we are in the process of building a + * larger IO, the IO is not yet big enough and we don't yet have IO in + * flight. Note that this is allowed even if we are reached the + * read-ahead limit (but not the buffer pin limit, combine_distance is + * capped by it and we are checking for pinned_buffers =3D=3D 0). + * + * The reason this is restricted to not yet having an IO in flight is t= hat + * once we are actually reading ahead, we will not issue IOs before the= y + * have reached the full size (or can't be grown further). But we *have= * + * to issue an IO once pinned_buffers =3D=3D 0, otherwise there won't b= e a + * buffer to return to the caller. + * + * This is important for cases where either effective_io_concurrency is + * low or we never need to wait for IO and thus are not increasing the + * distance. Without this we would end up with lots of small IOs. + */ + if (stream->pending_read_nblocks > 0 && + stream->pinned_buffers =3D=3D 0 && + stream->pending_read_nblocks < stream->combine_distance) + return true; + I'd strongly recommend an oxford comma here for clarity: "larger IO, the IO is not yet big enough and we don't yet have IO in" -> "larger IO, the IO is not yet big enough, and we don't yet have IO in" when you say "combine_distance is capped by it" -- what is it here? Do you mean max_pinned_buffers? It is ambiguous like maybe you could mean the read-ahead limit you just mentioned Maybe rephrase that whole parenthetical like "Note that this can exceed the read-ahead limit. It cannot exceed the buffer pin limit because pinned_buffers =3D=3D 0 and combine_distance is capped by max_pinned_buffers." And then the next paragraph also is confusing * The reason this is restricted to not yet having an IO in flight is t= hat * once we are actually reading ahead, we will not issue IOs before the= y * have reached the full size (or can't be grown further). But we *have= * * to issue an IO once pinned_buffers =3D=3D 0, otherwise there won't b= e a * buffer to return to the caller. "restricted to not yet having an IO in flight is awkward" Maybe just say which part you are referring to (pinned_buffers =3D=3D 0) My understanding about the pinned_buffers =3D=3D 0 criteria, is that we specifically want to perform the look-ahead even if we would exceed readahead_distance when this is the first IO -- and we haven't submitted other I/Os yet. Like when eic is low but we still want to do read combining. But this comment makes it sound like it is a control flow issue (e.g. that there "won't be a buffer to return to the caller") I asked AI to come up with something better, and this is what it said: "The pinned_buffers =3D=3D 0 restriction is important: if we already have buffers to return to the caller, we can let the normal issue logic in read_stream_should_issue_now() decide when to submit. But if we have no pinned buffers, we must keep building the pending IO until it reaches combine_distance (or can't grow further), because otherwise we'd submit a small IO and have nothing to return while waiting for more." I'm not sure it's 100% correct, but maybe you can rework your version of the paragraph a bit. 0004 read_stream: Only increase read-ahead distance when waiting for IO As of this commit READ_STREAM_FULL will never actually get up to the max_pinned_buffers if needed_wait is never true. That's probably fine, but it is different than before. Anyway, LGTM. 0005 - Allow read_stream_reset() to not wait for IO completion I can't comment on whether or not doing this in resowner cleanup is a good architectural decision, as I don't know enough about that part of postgres. /* Stop looking ahead. */ - stream->readahead_distance =3D 0; - stream->combine_distance =3D 0; + stream->readahead_distance =3D -1; + stream->combine_distance =3D -1; I don't remember why we couldn't have a separate boolean field for end-of-stream/stream-paused/stream-reset-in-progress? Did you ever figure it out? Separately, maybe we should forbid calling pgaio_wref_abandon() on IOs that haven't been submitted yet. You can get a wref before the IO is submitted. And I feel like weird stuff could happen if we let people call pgaio_wref_abandon() on defined or staged IOs. For example, resowner cleanup may take a very long time because of submitting and waiting for a bunch of IOs. And IO handles are released before locks, and I think we don't want to spend a long time waiting cleaning up IO handles while holding locks. Then I was thinking maybe you _do_ want to allow calling pgaio_wref_abandon() on IOs that haven't been submitted because pgaio_io_release() only works on PGAIO_HS_HANDED_OUT IOs and not on defined or staged ones If you don't forbid calling pgaio_wref_abandon on unsubmitted IOs, then you'd probably want to restructure the code in ResourceOwnerReleaseInternal() to loop through and submit any unsubmitted IO first and then wait for IOs after. pgaio_wref_abandon() does not invalidate the wait ref (which seems fine to have the caller do it), but you should mention that and make it clear that it doesn't do anything to the wait ref itself. This isn't totally clear especially because your comment says * Declare that a wait reference to an IO, started by this backend, is not = of * interest to this backend anymore. really what it is doing is declaring that the caller is no longer interested in receiving the IO result and a weird future caller might want to abandon checking the result and still wait for completion separately on the wait ref, so making it sound like this function disables that isn't right either In pgaio_wref_abandon(), it seems like you could get rid of this guard if (ioh->report_return) for clarity. in the context of pgaio_wref_abandon() it shouldn't ever be NULL, right? Then it doesn't save you anything and implying it could be NULL here is confusing + HOLD_INTERRUPTS(); + + /* + * It is safe to perform this check before checking if the IO was recyc= led + * as the owner of an IO cannot change. + */ + if (!am_owner) + elog(ERROR, "only IOs owned by current backend can be abandoned"); + no reason for the error to be after the HOLD_INTERRUPTS(), right? I thought it made it harder to understand the comment about why we need to hold interrupts here and if it can just be moved above, then that's better In AbandonReadBuffers() header comment: "This needs to called" -> "This needs to be called" I do find it a little weird that if someone calls AbandonReadBuffers() and doesn't remember to release the buffer pin, it won't get released... and in our case, it doesn't just happen in the caller -- it happens in the caller's caller I wonder if we are going to want the option for someone to abandon all the buffers in read_stream_pause()? Right now if you call read_stream_reset() it overrides the resume_distance, so if you wanted to restart the stream abandoning all existing IOs but keep up the distance, you won't be able to do that. It will revert down to 1. @@ -538,7 +539,7 @@ read_stream_should_issue_now(ReadStream *stream) * If the callback has signaled end-of-stream, start the pending read * immediately. There is no further potential for IO combining. */ - if (stream->readahead_distance =3D=3D 0) + if (stream->readahead_distance <=3D 0) return true; I don't think you should get here when readahead_distance < 0. Maybe we should assert that? I didn't look at any of the test stuff yet. - Melanie