Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8jrK-000pXs-1Q for pgsql-hackers@arkaria.postgresql.org; Fri, 03 Apr 2026 19:07:10 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w8jrJ-00DNl7-0z for pgsql-hackers@arkaria.postgresql.org; Fri, 03 Apr 2026 19:07:09 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8jpK-00DJn5-2I for pgsql-hackers@lists.postgresql.org; Fri, 03 Apr 2026 19:05:07 +0000 Received: from mail-ed1-x52a.google.com ([2a00:1450:4864:20::52a]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w8jpI-00000000QAZ-3Urn for pgsql-hackers@postgresql.org; Fri, 03 Apr 2026 19:05:06 +0000 Received: by mail-ed1-x52a.google.com with SMTP id 4fb4d7f45d1cf-66e6d6e2a5fso665735a12.1 for ; Fri, 03 Apr 2026 12:05:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1775243103; cv=none; d=google.com; s=arc-20240605; b=DAbs4OC0iT3DCG7H0rjoMxL4fbmOsOAiV4tQJSIqP7rbQcW+YGOXR/i7hFH5xYVXSP XCQqyN3m22d1wfdNAtlQK1JxlH9JIAp/gIRZKF5Zo8CeJvbZYllZftW3ZEQaDeFVVYEj xzt5YJpDunT1Qaf3aAJT8FgobXfHrp10ciuM7xzyQctCpAKKoH7Rz0cLiNvbyWTScu4E pyKjmFDAMh+DcRVbgiK+zK6LefsIVAkz0zs1LwMiK6WuR2TSR+8V0zSF0X3wYiTlKD8g ahvGNcL71OMk/TG4tUJk36od8yPLox5bOqmKKx1jhZmopkARrF7VGcp8wgN6jKms4O+J /kEA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=XclYyj27Rjv+vExWiEJLILgeOW1KBezi8JcZ3f3F9o4=; fh=62GROJdbzln/QsGjE7a71YRuD5BIah4s1ki9oo1aJ7c=; b=Jvbky6MbYYWU4bKFHbHlZ7rXhAxWSeQAbbX9Eb59ooC48ID0wzQ/1F9O1cHihXB8+C P/av/d6RXGl0bnN+zKXUTftk8UFFm1MkOSsdFVE+p7ESnoSxMTXvpBlCBwSaCU/pwf+m 6SEe7MNpqAADstHZx3th4r7EuTbTOMiHa4WZ5L6QfcdSP4eKCUtAIIrvcE+CnTqQfzrN 74Ly5FWVs2DRKpgwSBKBPk2QXV7qPxhjEruPppMEqcADmeK59+DRK8sPOP8OTm08UN7Q scAnCU2a1M+AkRuXaX5+FbouD9d1HTDKxOe2hYy6orfb5EsnYE3Syq7yDHPYHvb+hXhr 4pBg==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1775243103; x=1775847903; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=XclYyj27Rjv+vExWiEJLILgeOW1KBezi8JcZ3f3F9o4=; b=G5zhSkRoEVBDYOpdy4Lq95ksnXHoqxuuJIZb8fs1giqqON0RvfEuR0MeAsWH7bti5s aNg2W82l94E9rQU1FnslJAj1rbzpDYnGjiB5pUfNpPSuAezmeREMz90XX5Dus1O3p3cQ 6GoaXoX2QM42OJYAqBQ1p9sW8ns1gpN4rM8ZbBFA2M0o+yI3lfJtwtP9OaTd5vBOngYK 3jPe66xJu3Aldn81MrhYcvQxngcEUFzlm9QNBxwaPwGoNxBBMdB/ff7O0B7A14K7X+Kq XNo1L4INj5MvbdV3MouyxVnQLPNQVFiYxnMh/pvyztDYCDCCgGAsAv9o29LiT7QUeNtm Hxhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775243103; x=1775847903; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=XclYyj27Rjv+vExWiEJLILgeOW1KBezi8JcZ3f3F9o4=; b=H8SxSjGE5a6HZwxcom6sDB80sTxkqKiGcwdQuaPay2cSzfu7URR7WYeiqceFwJXwQ9 LqHKhQtSwBT1hXZHe3AWEc6DfQsJ0GRhNiATacVXVWjxNLEOmts2MwEHckFNSCMoE/+V c8derQzBCdLzETQfzjk6fArVqaywCoWgvlFWCUFumsf0rpWId3hyjbQKFCcD0DFRCJaF 4WgYAVcPb43JAgdN5fyF2ECWLSLC1KhpXc1Jfr8Zdqn1xDO35CB7+yAWYl/evuEIoaCY iKZFk36gaJgPpodCfamastzzYYqnHseMSstK+zIsoiM52JRwcwNfdy91ESz4MX1ssJK4 ztkw== X-Gm-Message-State: AOJu0YyrA/2hucnaIUwmKYWSKSI0S5r4OrJ0kWA3oVe3MogbY6CnPj8b 0J7uwunxKQpW8cPdw6b5OZugfV7NuYiZ2FKy3C034byhNJ0uWc3YRM3GLfchb2dDHTPLdOwOlfy ylCxgzCqlUWU2R7tAxcExpHp1Duc62ls= X-Gm-Gg: AeBDiestCM9sBG3IWpZEOMleiDfysHCV6Uawcbh4Lt92QctQbY3OqVn4ebmg8I7GDeM McMC5yHoHQKA8qCekEsFP2+3b1vjoOOym4C/4B7C0ZEh7TPHXcxyha1bzldxb/kdp1ZRyaFyJQV 0FNgfnFZ1LbTPzecRR6ttforyP74HyyOhKuD9TQKLaOUDDdGUr+I8FIhoWfVDuM2D4vnLuHR6zf TA3076yxnn+sScQeyDr/6O08jR5+OvcyLymG1l3e7ylb52chRZ5Mzc2arM/skvtxJ+a8nuhXNci I6qHqaJdGXx0gkU2Z8SjaQVAUhj8BX8RraHEb/hhIbxkr38bbrSddy9a4AQFtt1yWFzcGX7feTf 0i4s1UTGi X-Received: by 2002:a05:6402:3789:b0:66b:d544:e784 with SMTP id 4fb4d7f45d1cf-66e3f6eedd4mr1913639a12.13.1775243102738; Fri, 03 Apr 2026 12:05:02 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Melanie Plageman Date: Fri, 3 Apr 2026 15:04:51 -0400 X-Gm-Features: AQROBzAfJ8GZITbRWAFQ0XM8683Jj9UjrkxndhwhvCz5qPScoj3K1M-gbouS-Xw Message-ID: Subject: Re: AIO / read stream heuristics adjustments for index prefetching To: Andres Freund Cc: pgsql-hackers@postgresql.org, Thomas Munro , Peter Geoghegan , Tomas Vondra , Nazir Bilal Yavuz Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Fri, Apr 3, 2026 at 1:30=E2=80=AFPM Andres Freund w= rote: > > > - why not remove the combine_distance requirement from the fast path > > entry criteria (could save resume_combine_distance in the fast path > > and restore it after a miss) > > Because entering the fast path prevents IO combining from happening. So i= t's > absolutely crucial that we do *not* enter it while we would combine. But if it is a buffer hit, obviously we can't do IO combining anyway, or am I misunderstanding the fast path's common case? > > > In my experiments the batching was primarily useful to allow to reduc= e the > > > command submission & interrupt overhead when interacting with storage= , i.e. > > > when actually doing IO, not just copying from the page cache. > > > > It still seems to me that batching would help guarantee parallel > > memcpys for workers when data is in the kernel buffer cache because if > > the IO is quick, the same worker may pick up the next IO. > > When you say workers, do you mean for io_method=3Dworker? Or the internal= kernel > threads of io_uring for async IOs? I'm talking about io_method worker. > > You mentioned that we don't want to read too far ahead (including for > > a single combined IO) in part because: > > > > > The resowner and private refcount mechanisms take more CPU cycles if = you have > > > more buffers pinned > > > > But I don't see how either distance is responding to this or > > self-calibrating with this in mind > > Using the minimal required distance to avoid needing to wait for IO compl= etion > is responding to that, no? Without these patches we read ahead as far as > possible, even if all the data is in the page cache, which makes this iss= ue > way worse (without these patches it's a major source of regressions in th= e > index prefetching patch). But we aren't using the minimal distance to avoid needing to wait for IO completion. We are also using a higher distance to try and get IO combining and toallow for async copying into the kernel buffer cache, etc, etc. There's a lot of different considerations; it isn't just two opposing forces. And, I'd imagine that the relationship between the number of buffers pinned and CPU cycles required for resowner/refcount isn't perfectly linear. > > I think it is weird to have combine_distance only be relevant when > > readahead_distance is low. You said: > > > > > We could, but I don't think there would be a benefit in doing so. In = my mind, > > > what combine_distance is used for, is to control how far to look ahea= d to > > > allow for IO combining when the readahead_distance is too low to allo= w for IO > > > combining. But if pinned_buffers > 0, we already have another IO in = flight, > > > so the combine_distance mechanism doesn't have to kick in. > > > > But it seems like for a completely uncached workload bigger IOs is > > still beneficial. > > Massively so - the storage layer getting too small IOs really hurts. > > But, as the code stands, we *do* end up with large IOs in that case, beca= use > we will not issue the IO until it is "complete". If we need to do actual = IO, > the readahead_distance will be larger, and allow multiple full sized IOs = to be > issued. > > /* > * We don't start the pending read just because we've hit the distance li= mit, > * preferring to give it another chance to grow to full io_combine_limit = size > * once more buffers have been consumed. But this is not desirable in al= l > * situations - see below. > */ > static inline bool > read_stream_should_issue_now(ReadStream *stream) > { > ... > /* > * If we've already reached io_combine_limit, there's no chance o= f growing > * the read further. > */ > if (pending_read_nblocks >=3D stream->io_combine_limit) > return true; > > /* same if capped not by io_combine_limit but combine_distance */ > if (stream->combine_distance > 0 && > pending_read_nblocks >=3D stream->combine_distance) > return true; > > /* > * If we currently have no reads in flight or prepared, issue the= IO once > * we are not looking ahead further. This ensures there's always = at least > * one IO prepared. > */ > if (stream->pinned_buffers =3D=3D 0 && > !read_stream_should_look_ahead(stream)) > return true; > > return false; > > So, unless we are not reading ahead, we are not issuing IOs until they ar= e > fully sized (or can't be combined, obviously). > > Am I misunderstanding what you're talking about? I'm not saying that we don't do IO combining at high distances, I'm more saying that it is confusing that combine_distance controls how far we look ahead when readahead_distance is low but when readahead_distance is high, it controls when we issue the IO and not how far we look ahead. I don't think we should change course now, but I wanted to call out that this felt a little uncomfortable to me. - Melanie