Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w7gBS-005aZq-02 for pgsql-hackers@arkaria.postgresql.org; Tue, 31 Mar 2026 20:59:34 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w7gBQ-00DFHB-0N for pgsql-hackers@arkaria.postgresql.org; Tue, 31 Mar 2026 20:59:32 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w7gBP-00DFH1-2S for pgsql-hackers@lists.postgresql.org; Tue, 31 Mar 2026 20:59:32 +0000 Received: from mail-ed1-x536.google.com ([2a00:1450:4864:20::536]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w7gBM-00000002EjL-3omX for pgsql-hackers@postgresql.org; Tue, 31 Mar 2026 20:59:31 +0000 Received: by mail-ed1-x536.google.com with SMTP id 4fb4d7f45d1cf-66bb4d4fcb4so5748221a12.2 for ; Tue, 31 Mar 2026 13:59:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1774990767; cv=none; d=google.com; s=arc-20240605; b=AZ3C+qPxhhv0Aa2tRKV6ROAq75Lni2kAAi8DkjxksnEI47FwOz7wOosAZyJ7qxY/GY jRtTWdQeCxJHG/EInBISGrDGiQTfe5DyuyneZTtLri3FJF/5dW/y2+7tfC7NLIlGLoN5 FcbyXlkCN81Xs3HsXPX27oGOkOEBPRmfBDPN7U2dYDKXdXnZgWnCoksKYS11rbZFQYJx lilsPsPHUfqcdb31FBdITzG+lGc5p/kqkB9PAFuDQ4fuvfrv8Cj15O82k9WblVZAflP6 1uoq9t098ki9qOFFxxrwLNipVJBMQTMc0lZZ+4YCf/qRKgcaDxYSmjrUC9c/hYxHK7dT u0UQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=TKxr+8RdNGcr6xNzTOvN3BCzXrVlx/MxFVaz4UnrMzQ=; fh=62GROJdbzln/QsGjE7a71YRuD5BIah4s1ki9oo1aJ7c=; b=cXFeQePzkIlZgE+29XWgpBSL7SkSvfhonRXGkjcsK5cGqIsyq8YghWArLwTSmwUnw9 VC1BWhc1obmGUVnH19vVA8tAviW9xur9jJfayH//rP4YBD+4hWpLmtM+smQyDWva4ZG2 wWc7NbrPw5M6iHsv+QvS2KyKoEE85UYa5r90sdJuCzJZblL1YP9GMMUXcyCM43ecc1JW yL8BahcgydkMMrmHq0Dr6wZbwfQfG5TgkagaUGU7JGfZUIXTqRRuPBjmGEALEYu4ANPR soNcqkbcJVS3WFClsgTxU4cHh1j0RAayQahCZXlK29pvG6pSPUactYRVXD6O8WVX4N4t YFOg==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1774990767; x=1775595567; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=TKxr+8RdNGcr6xNzTOvN3BCzXrVlx/MxFVaz4UnrMzQ=; b=RpOm8aWKrip2d2dW53dUwx7nyMCyMA8x7pkBTivtQ7R8ZuB+Ka9jYOM4Xhg0oT4uQy uVD7r4fu4NC98c3mUUaxkNwMaqc7+MGw/l+b+npgO9xfQ98/uYvCX6YM6vjy/hD1BhnJ Qrd2bRGpTVSeyvnonMq9aI+/yLiN+4Op9GY5TN1CuPCoJRRvmjFizSHIaiFoOKQsXOXJ JaMdlW9a4m8sX94dSOaV7If92KuBxTPy9mg4GZxqrMEiNlwkdhbPSUQWWyU5i6FrRLEY uE/37iLUVVorM2hBAlW8c2806Ef4aOHrs0lKgwv0Bb12cF6DW0F0Ad606WDWDGdv2ZK9 604w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774990767; x=1775595567; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=TKxr+8RdNGcr6xNzTOvN3BCzXrVlx/MxFVaz4UnrMzQ=; b=lUkd7zTyFQHjUrWGBoMCAz1RfeCOOj6gqXQhW5p2eayDZ/GNudSfcEX4zqr5xIZCDS FWf4CG81iFupMS8RFO0Ytk0ZqMW1dtZQEjnHFhfM8RavQhyYB1HbcQyoc50buWn4f7Gy enYa4K91RO9SMJnwAKZ2ikKdkoofY/tvXeHeBDO6HTuSYQ5lIGCTz/ss81TFZld/g+k1 z/v1trjvSDQ9oAdcUoBynwC4o9yv0dT992TnCaIM7wv4DKk6Yr4tQIEJ2dLNVCD5W83j wcBQQVzZ2mgjFsdFXjP8M1CdxlyN5CUhg9Pb8Xc8LFTIppzIBpIfqdmgaM/LQs8oqBzp JBaA== X-Gm-Message-State: AOJu0Yw9/i/kBZVgmGvGci082yPCr6rHcMD1J5qNNLhOa4FwZWfMiULI 0Cp8uTH5FJRKVwa0mY3ktPTjCuyiBrEeG1mrgdns+306e86Zg14x0CXiR/9pnqS7uvuyVr8oHTF 05wYgOxteuTw5DaFUQIJCjPDzd6q7xVg= X-Gm-Gg: ATEYQzxG+uV0rU1CeDMvqTmtryDPhq4VzfsQlLM7lPe5Tk0zXGm65zkOWGr2kduBAu1 dm3EYXBaWQAD+uzVK4lqLOig3/oDsOxcvLWAvj4LeVumOUwaUJLIfvfH4p1FPtIap5iR9O7SJxz Qxq3iWDqla+C7/GfyYx2o6XwB3/jJNe/lhHMDCF1Ss564aT9EE2SK8HwAHEdACOu8EPvTDFseiz 9U5CSAnVUgO1K4t3jXt5eXPlh1CttJEkReCEfOmg0X6lds4CnnmgrG8WTZOZKSnUpVQYc5oQ3B5 QoyoV/WfKBlwtrEFDCmZYrjzEBjp2P4DI0xZ+rr/udQKLrdTXm939t0St1VNwBrN+qRc29I440D lgX9YZQwE47iOJhms8P8= X-Received: by 2002:a05:6402:44d9:b0:66b:b6e2:66dc with SMTP id 4fb4d7f45d1cf-66db01a8f95mr471536a12.11.1774990766364; Tue, 31 Mar 2026 13:59:26 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Melanie Plageman Date: Tue, 31 Mar 2026 16:59:14 -0400 X-Gm-Features: AQROBzCeYNwiEguyneIjDDyIe1496OhaeTcT-U8OH6yOi5PdCRjT5JMznTDXN34 Message-ID: Subject: Re: AIO / read stream heuristics adjustments for index prefetching To: Andres Freund Cc: pgsql-hackers@postgresql.org, Thomas Munro , Peter Geoghegan , Tomas Vondra , Nazir Bilal Yavuz Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Tue, Mar 31, 2026 at 12:02=E2=80=AFPM Andres Freund = wrote: > > 0005+0006: Only increase distance when waiting for IO In "aio: io_uring: Trigger async processing for large IOs" (0005), the first sentence of the commit message is incomplete. Is there any reason for both the io size and inflight IOs threshold to be 4? If they should be the same, I think it would be better if this was a macro. This may not matter, but the old code checked in_flight_before > 5 before incrementing if for the current IO. The new code counts it after pushing the current IO onto the submission list. So the new way is slightly more aggressive. 0006 "(read_stream: Only increase distance when waiting for IO)" looks good to me from a code perspective. I don't yet have ideas for handling potential parallel bitmapheapscan regressions. > Unfortuntely with io_uring the situation is more complicated, because > io_uring performs reads synchronously during submission if the data i= s the > kernel page cache. This can reduce performance substantially compare= d to > worker, because it prevents parallelizing the copy from the page cach= e. > There is an existing heuristic for that in method_io_uring.c that add= s a > flag to the IO submissions forcing the IO to be processed asynchronou= sly, > allowing for parallelism. Unfortunately the heuristic is triggered b= y the > number of IOs in flight - which will never become big enough to tgrig= ger > after using "needed to wait" to control how far to read ahead. > > So 0005 expands the io_uring heuristic to also trigger based on the s= izes > of IOs - but that's decidedly not perfect, we e.g. have some experime= nts > showing it regressing some parallel bitmap heap scan cases. It may b= e > better to somehow tweak the logic to only trigger for worker. Trigger which logic only for worker, you mean only increasing the distance when waiting? > As is this has another issue, which is that it prevents IO combining = in > situations where it shouldn't, because right now using the distance t= o > control both. See 0008 for an attempt at splitting those concerns. Even if you can't combine into a single IO, it seems like a low distance is problematic because it degrades batching and causes us to have to call io_uring_enter for every block (I think). At least when I was experimenting with this, the syscall overhead seemed non-negligible. It's also true that this meant the memcpys couldn't be parallelized, but system call overhead also seems to have been a factor. Setting aside more complicated prefetching systems, what it seems like we are saying is that for all "miss" cases (not in SB) a distance of above 1 is advantageous (unless we are only doing 1 IO). I wonder if there is something hacky we can do like not decaying distance below io_combine_limit if there has been a recent miss or growing it up to at least io_combine_limit if we aren't getting all hits. > 0007: Make read_stream_reset()/end() not wait for IO > > This is a quite experimental, not really correct as-is, patch to avoi= d > unnecessarily waiting for in-flight IO when read_stream_reset() is do= ne > while there's in-flight IO. This is useful for things like nestloop > antioins with quals on the inner side (without the qual we'd not trig= ger > any readahead, as that's deferred in the index prefetching patch). > > As-is this will leave IOs visible in pg_aios for a while, potentially > until the backends exit. That's not right. Separating the problems: the handle slot exhaustion seems like it could be solved by having the backend process discard IOs when it needs one and there isn't any. Or is that not work we want to do in a hot path? The pg_aios view problems seem solvable with a flag on the IO like "DISCARDED". But the buffers staying pinned is different. It seems like you'll need the backend to process the discarded IOs at some point. Maybe it should do that before idling waiting for input? When discarding IOs, I don't really understand why the foreign IO path, doesn't just clear its own wait ref (not the buffer descriptor one) and move on -- instead you have it wait. I haven't finished reviewing 0008 yet. > One thing that's really annoying around this is that we have no infrastru= cture > for testing that the heuristics keep working. It's very easy to improve o= ne > thing while breaking something else, without noticing, because everything > keeps working. Agreed that something here would be useful. - Melanie