public inbox for [email protected]
help / color / mirror / Atom feedFrom: Andres Freund <[email protected]>
To: Melanie Plageman <[email protected]>
Cc: [email protected]
Cc: Thomas Munro <[email protected]>
Cc: Peter Geoghegan <[email protected]>
Cc: Tomas Vondra <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Subject: Re: AIO / read stream heuristics adjustments for index prefetching
Date: Fri, 3 Apr 2026 19:10:48 -0400
Message-ID: <3gkuvs3lz3u3skuaxfkxnsysfqslf2srigl6546vhesekve6v2@va3r5esummvg> (raw)
In-Reply-To: <24bjkmnkuapbs7wvcecvtrb3gvbrzg3extlkzpbg2f7dwt7h42@3e4vg6cd33iw>
References: <f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu>
<CAAKRu_bfwBzg7=Zy88st6gBJf97Wkd3k=+m1ecApn=59SwmKSw@mail.gmail.com>
<pj4kgtdrevvkfbmlri6p27belctxru7ytyprcb6v74c7zbh3l6@m4dcu2rljedv>
<CAAKRu_b+tymZQc2vv__k8oPHsVQppg9QmaeQLiupZ3t2AO=z3A@mail.gmail.com>
<dj6njk2kfy4ztfduptrk5olv43j2zhebhbz62ip2y7gyeddirm@s2ywu64xxftw>
<CAAKRu_b24CSa1ja8yEu+5PnTpWcpL0EFdRL1TwQye7VffvYjJA@mail.gmail.com>
<24bjkmnkuapbs7wvcecvtrb3gvbrzg3extlkzpbg2f7dwt7h42@3e4vg6cd33iw>
Hi,
There are a bunch of heuristics mentioned in the following proposed commit:
On 2026-04-03 16:36:03 -0400, Andres Freund wrote:
> Subject: [PATCH v5 1/5] aio: io_uring: Trigger async processing for large IOs
>
> io_method=io_uring has a heuristic to trigger asynchronous processing of IOs
> once the IO depth is a bit larger. That heuristic is important when doing
> buffered IO from the kernel page cache, to allow parallelizing of the memory
> copy, as otherwise io_method=io_uring would be a lot slower than
> io_method=worker in that case.
>
> An upcoming commit will make read_stream.c only increase the read-ahead
> distance if we needed to wait for IO to complete. If to-be-read data is in the
> kernel page cache, io_uring will synchronously execute IO, unless the IO is
> flagged as async. Therefore the aforementioned change in read_stream.c
> heuristic would lead to a substantial performance regression with io_uring
> when data is in the page cache, as we would never reach a deep enough queue to
> actually trigger the existing heuristic.
>
> Parallelizing the copy from the page cache is mainly important when doing a
> lot of IO, which commonly is only possible when doing largely sequential IO.
>
> The reason we don't just mark all io_uring IOs as asynchronous is that the
> dispatch to a kernel thread has overhead. This overhead is mostly noticeable
> with small random IOs with a low queue depth, as in that case the gain from
> parallelizing the memory copy is small and the latency cost high.
>
> The facts from the two prior paragraphs show a way out: Use the size of the IO
> in addition to the depth of the queue to trigger asynchronous processing.
>
> One might think that just using the IO size might be enough, but
> experimentation has shown that not to be the case - with deep look-ahead
> distances being able to parallelize the memory copy is important even with
> smaller IOs.
> +/*
> + * io_uring executes IO in process context if possible. That's generally good,
> + * as it reduces context switching. When performing a lot of buffered IO that
> + * means that copying between page cache and userspace memory happens in the
> + * foreground, as it can't be offloaded to DMA hardware as is possible when
> + * using direct IO. When executing a lot of buffered IO this causes io_uring
> + * to be slower than worker mode, as worker mode parallelizes the
> + * copying. io_uring can be told to offload work to worker threads instead.
> + *
> + * If the IOs are small, we only benefit from forcing things into the
> + * background if there is a lot of IO, as otherwise the overhead from context
> + * switching is higher than the gain.
> + *
> + * If IOs are large, there is benefit from asynchronous processing at lower
> + * queue depths, as IO latency is less of a crucial factor and parallelizing
> + * memory copies is more important. In addition, it is important to trigger
> + * asynchronous processing even at low queue depth, as with foreground
> + * processing we might never actually reach deep enough IO depths to trigger
> + * asynchronous processing, which in turn would deprive readahead control
> + * logic of information about whether a deeper look-ahead distance would be
> + * advantageous.
> + *
> + * We have done some basic benchmarking to validate the thresholds used, but
> + * it's quite plausible that there are better values.
Thought it'd be useful to actually have an email to point to in the above
comment, with details about what benchmark I ran.
Previously I'd just manually run fio with different options, I made it a bit
more systematic with the attached (only halfway hand written) script.
I attached two different results, once when allowing access to multiple cores,
and once with a single core (simulating a very busy machine).
(nblocks is in multiples of 8KB)
Multi-core:
nblocks iod async bw_gib_s lat_usec
1 1 0 4.2075 1.5802
1 1 1 1.0462 6.9652
1 2 0 4.1362 3.4533
1 2 1 1.9284 7.6040
1 4 0 4.0030 7.3720
1 4 1 4.2713 6.9086
1 8 0 4.1653 14.4072
1 8 1 4.3301 13.8365
1 16 0 4.1829 28.9216
1 16 1 4.3006 28.1261
1 32 0 4.0735 59.6232
1 32 1 4.3248 56.1614
I.e at nblocks=1, there's pretty much no gain from async, and the latency
increases markedly at the low end and just about catches up at the high end.
Around an iodepth 4 the loss from async nonexistant or minimal.
2 1 0 5.7289 2.4261
2 1 1 1.8708 7.7466
2 2 0 5.7964 5.0144
2 2 1 3.3749 8.7417
2 4 0 5.8434 10.2023
2 4 1 7.9783 7.3977
2 8 0 5.8166 20.7226
2 8 1 8.2545 14.5431
2 16 0 5.8215 41.6613
2 16 1 8.2354 29.3879
2 32 0 5.6530 86.0286
2 32 1 8.3218 58.3826
With nblocks=2, there start to be gains at higher IO depths, but they're still
somewhat limited. Latency already starts to be better at iodepth 4.
4 1 0 7.4131 3.8807
4 1 1 3.2133 9.1827
4 2 0 7.3150 8.0854
4 2 1 5.4983 10.8039
4 4 0 7.2784 16.5097
4 4 1 11.2717 10.5699
4 8 0 7.2873 33.2331
4 8 1 16.6299 14.4164
4 16 0 7.1606 67.8777
4 16 1 16.9794 28.4981
4 32 0 6.2954 154.6834
4 32 1 16.3686 59.3610
With nblocks=4, async shows much more substantial gains. Latency of async at
the high end is also much improved.
8 1 0 8.0403 7.3503
8 1 1 4.6038 12.7202
8 2 0 8.0052 14.9161
8 2 1 8.5176 13.9987
8 4 0 8.1519 29.6698
8 4 1 14.8211 16.1640
8 8 0 7.8525 61.8612
8 8 1 27.5860 17.4434
8 16 0 6.8887 141.3268
8 16 1 34.1307 28.3463
8 32 0 6.9031 282.2350
8 32 1 38.2430 50.7700
With nblocks=8, async is faster already at iodepth 2.
64 1 0 9.1983 52.6768
64 1 1 8.1505 59.5486
128 1 0 7.5442 128.8704
128 1 1 7.3481 132.2355
Somewhere nblocks=64 and 128, we reach the point where there's basically no
loss at iodepth 1.
This seems to validate setting IOSQE_ASYNC around a block size of >= 4 and a
queue depth of > 4. I guess it could make sense to reduce it from > 4 to >= 4
based on these numbers, but I don't think it matters terribly.
Obviously with just one core there will only ever be a loss from doing an
asynchronous / concurrent copy from the page cache. But it's interesting to
see where the overhead of async starts to be less of a factor.
At iodepth 1 (worse case, a context switch for every IO)
nblocks iod async bw_gib_s lat_usec
1 1 0 4.2324 1.5692
1 1 1 1.7883 3.9574
2.36x bw regression
2 1 0 5.7914 2.4004
2 1 1 2.9585 4.8417
1.96x bw regression
4 1 0 7.3171 3.9242
4 1 1 4.2450 6.8171
1.7x bw regression
8 1 0 8.1162 7.2674
8 1 1 5.7536 10.2948
1.4x bw regression
16 1 0 8.8559 13.5212
16 1 1 7.1163 16.8277
1.6x bw regression
But the IO depth would not stay at 1 in the case of postgres with the proposed
changes, it'd ramp up due to needing to wait for the kernel to complete those
IOs asynchronously.
Therefore comparing that to a deeper IO depth.
nblocks iod async bw_gib_s lat_usec
1 16 0 4.1094 29.4339
1 16 1 3.3922 35.7044
1.21x bw regression
2 16 0 5.8381 41.5402
2 16 1 4.8104 50.4571
1.21x bw regression
4 16 0 7.1204 68.2612
4 16 1 5.6479 86.0973
1.26x bw regression
8 16 0 7.0780 137.5520
8 16 1 6.1687 157.8805
1.14x bw regression
16 16 0 7.4523 261.4281
16 16 1 6.7192 290.0837
1.10x bw regression
This assumes a very extreme scenario (no cycles whatsoever available for
parallelism), so I'm just looking for the worst case regression here.
I don't think there's very clear indicators for what cutoffs to use in the
onecpu data. Clearly we shouldn't go for async for single block IOs, but we
aren't. With the default io_combine_limit=16 effective_io_concurrency=16,
we'd end up with 1.10x regression in the extreme case of only having a single
core available (but that one fully!) and doing nothing other than IO.
Seems ok to me.
I ran it on three other machines (newer workstation, laptop, old laptop) as
well, with similarly shaped results (although considerably higher & lower
throughputs across the board, depending on the machine).
Zen 4 Laptop:
nblocks iod async bw_gib_s lat_usec
1 1 0 6.0989 1.1408
1 1 1 1.4477 5.1246
1 2 0 6.9600 2.0827
1 2 1 2.8750 5.1711
1 4 0 7.0283 4.2307
1 4 1 8.9174 3.3169
Suprisingly bigger difference between sync/async at iod=1, but it's again
similar around iod=4 blocks.
4 1 0 14.5638 1.9616
4 1 1 5.1245 5.8016
4 2 0 14.8867 3.9607
4 2 1 12.1841 4.8662
4 4 0 14.8678 8.0764
4 4 1 21.5077 5.5417
Similar.
16 1 0 21.0754 5.5891
16 1 1 12.6180 9.4753
16 2 0 20.2770 11.8353
16 2 1 24.3277 9.8172
At nblocks=16, iod=2 starts already starts to be faster.
Greetings,
Andres Freund
Attachments:
[text/x-python] bench_async_uring.py (3.0K, 2-bench_async_uring.py)
download | inline:
#!/usr/bin/env python3
import argparse
import json
import subprocess
import sys
def run_fio(directory, nblocks, iodepth, force_async,
size,
runtime):
bs = nblocks * 8 * 1024
cmd = [
"fio",
f"--directory={directory}",
f"--size={size}",
"--name=read",
"--invalidate=0",
"--rw=read",
"--direct=0",
"--buffered=1",
"--time_based=1",
f"--runtime={runtime}",
"--ioengine=io_uring",
f"--iodepth={iodepth}",
f"--force_async={force_async}",
f"--bs={bs}",
"--output-format=json",
]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
return json.loads(result.stdout)
def extract_metrics(data):
"""
Extract bandwidth (GiB/s) and average latency (µs) from fio JSON.
fio JSON reports.:
"""
read_stats = data["jobs"][0]["read"]
bw_kibs = read_stats["bw"] # KiB/s
bw_gibs = bw_kibs / 1024**2 # KiB/s → GiB/s
lat_ns = read_stats["lat_ns"]["mean"] # nanoseconds
lat_usec = lat_ns / 1000.0 # → µs
return bw_gibs, lat_usec
def main():
parser = argparse.ArgumentParser(
description="Run fio sequential read benchmarks across parameter combos."
)
parser.add_argument(
"--directory", default="/srv/fio",
help="fio test directory (default: /srv/fio)",
)
parser.add_argument(
"--size", default="4GiB",
help="fio file size (default: 4GiB)",
)
parser.add_argument(
"--runtime", type=int, default=1,
help="Seconds per test (default: 1)",
)
parser.add_argument(
"--nblocks", type=int, nargs="+",
default=[1, 2, 4, 8, 16, 32, 64, 128],
help="Block-count values to test (bs = nblocks * 8 KiB)",
)
parser.add_argument(
"--iodepths", type=int, nargs="+",
default=[1, 2, 4, 8, 16, 32],
help="iodepth values to test",
)
args = parser.parse_args()
print("nblocks\tiod\tasync\tbw_gib_s\tlat_usec")
for nblocks in args.nblocks:
for iodepth in args.iodepths:
for force_async in [0, 1]:
try:
data = run_fio(
directory=args.directory,
nblocks=nblocks,
iodepth=iodepth,
force_async=force_async,
size=args.size,
runtime=args.runtime,
)
bw_gibs, lat_usec = extract_metrics(data)
print(f"{nblocks}\t{iodepth}\t{force_async}\t{bw_gibs:.4f}\t{lat_usec:.4f}")
sys.stdout.flush()
except subprocess.CalledProcessError as exc:
print(f"# ERROR nblocks={nblocks} iod={iodepth} async={force_async}: {exc}",
file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main()
[text/tab-separated-values] results_manycore.tsv (2.1K, 3-results_manycore.tsv)
download | inline:
nblocks iod async bw_gib_s lat_usec
1 1 0 4.2075 1.5802
1 1 1 1.0462 6.9652
1 2 0 4.1362 3.4533
1 2 1 1.9284 7.6040
1 4 0 4.0030 7.3720
1 4 1 4.2713 6.9086
1 8 0 4.1653 14.4072
1 8 1 4.3301 13.8365
1 16 0 4.1829 28.9216
1 16 1 4.3006 28.1261
1 32 0 4.0735 59.6232
1 32 1 4.3248 56.1614
2 1 0 5.7289 2.4261
2 1 1 1.8708 7.7466
2 2 0 5.7964 5.0144
2 2 1 3.3749 8.7417
2 4 0 5.8434 10.2023
2 4 1 7.9783 7.3977
2 8 0 5.8166 20.7226
2 8 1 8.2545 14.5431
2 16 0 5.8215 41.6613
2 16 1 8.2354 29.3879
2 32 0 5.6530 86.0286
2 32 1 8.3218 58.3826
4 1 0 7.4131 3.8807
4 1 1 3.2133 9.1827
4 2 0 7.3150 8.0854
4 2 1 5.4983 10.8039
4 4 0 7.2784 16.5097
4 4 1 11.2717 10.5699
4 8 0 7.2873 33.2331
4 8 1 16.6299 14.4164
4 16 0 7.1606 67.8777
4 16 1 16.9794 28.4981
4 32 0 6.2954 154.6834
4 32 1 16.3686 59.3610
8 1 0 8.0403 7.3503
8 1 1 4.6038 12.7202
8 2 0 8.0052 14.9161
8 2 1 8.5176 13.9987
8 4 0 8.1519 29.6698
8 4 1 14.8211 16.1640
8 8 0 7.8525 61.8612
8 8 1 27.5860 17.4434
8 16 0 6.8887 141.3268
8 16 1 34.1307 28.3463
8 32 0 6.9031 282.2350
8 32 1 38.2430 50.7700
16 1 0 8.8933 13.4650
16 1 1 6.2728 18.9827
16 2 0 8.9076 27.1436
16 2 1 11.7220 20.5300
16 4 0 8.7233 55.6745
16 4 1 18.7624 25.7505
16 8 0 7.4987 129.8232
16 8 1 34.5575 27.9686
16 16 0 7.3837 263.8333
16 16 1 41.0612 47.2465
16 32 0 7.3259 531.7938
16 32 1 39.9109 97.4890
32 1 0 9.2683 26.0485
32 1 1 7.6606 31.5179
32 2 0 9.0020 53.9343
32 2 1 13.1658 36.4441
32 4 0 7.5486 128.9595
32 4 1 22.4968 43.0776
32 8 0 7.6493 254.6875
32 8 1 39.0059 49.6149
32 16 0 7.5547 515.7824
32 16 1 41.4617 93.7583
32 32 0 6.6947 1162.7013
32 32 1 36.8926 211.2120
64 1 0 9.1983 52.6768
64 1 1 8.1505 59.5486
64 2 0 7.6384 127.3833
64 2 1 13.9811 69.1716
64 4 0 7.5413 258.3375
64 4 1 25.5018 76.2306
64 8 0 7.3730 528.3678
64 8 1 41.5893 93.4699
64 16 0 6.7242 1157.6784
64 16 1 35.1358 221.7170
64 32 0 5.2491 2958.7474
64 32 1 29.6408 526.3284
128 1 0 7.5442 128.8704
128 1 1 7.3481 132.2355
128 2 0 7.5959 256.3192
128 2 1 14.3860 135.3077
128 4 0 7.5891 513.4345
128 4 1 26.2082 148.3515
128 8 0 6.7218 1158.2197
128 8 1 39.5513 196.9944
128 16 0 5.1950 2990.2587
128 16 1 28.7749 542.1595
128 32 0 4.8389 6388.4904
128 32 1 27.5857 1131.2934
[text/tab-separated-values] results_onecore.tsv (2.1K, 4-results_onecore.tsv)
download | inline:
nblocks iod async bw_gib_s lat_usec
1 1 0 4.2324 1.5692
1 1 1 1.7883 3.9574
1 2 0 4.0756 3.5073
1 2 1 2.0574 7.1336
1 4 0 4.0813 7.2245
1 4 1 2.5805 11.5444
1 8 0 4.1485 14.4645
1 8 1 3.1191 19.3035
1 16 0 4.1094 29.4339
1 16 1 3.3922 35.7044
1 32 0 4.1652 58.3173
1 32 1 3.5813 67.8628
2 1 0 5.7914 2.4004
2 1 1 2.9585 4.8417
2 2 0 5.8205 5.0033
2 2 1 3.2484 9.0866
2 4 0 5.8692 10.1507
2 4 1 4.1805 14.3272
2 8 0 5.8241 20.7047
2 8 1 4.5100 26.7865
2 16 0 5.8381 41.5402
2 16 1 4.8104 50.4571
2 32 0 5.7680 84.3214
2 32 1 4.8923 99.4498
4 1 0 7.3171 3.9242
4 1 1 4.2450 6.8171
4 2 0 7.3149 8.0876
4 2 1 4.6114 12.9498
4 4 0 7.3564 16.3417
4 4 1 5.2204 23.0800
4 8 0 7.3753 32.8332
4 8 1 5.5436 43.7378
4 16 0 7.1204 68.2612
4 16 1 5.6479 86.0973
4 32 0 6.2542 155.6801
4 32 1 5.4395 179.0695
8 1 0 8.1162 7.2674
8 1 1 5.7536 10.2948
8 2 0 8.1180 14.7826
8 2 1 5.7472 20.9051
8 4 0 8.0499 30.0124
8 4 1 6.2692 38.6020
8 8 0 7.9290 61.2775
8 8 1 6.3000 77.1385
8 16 0 7.0780 137.5520
8 16 1 6.1687 157.8805
8 32 0 6.9722 279.4301
8 32 1 6.2175 313.5523
16 1 0 8.8559 13.5212
16 1 1 7.1163 16.8277
16 2 0 8.8395 27.3402
16 2 1 7.1646 33.7138
16 4 0 8.6280 56.2576
16 4 1 6.8501 70.9189
16 8 0 7.5552 128.8521
16 8 1 6.5925 147.6890
16 16 0 7.4523 261.4281
16 16 1 6.7192 290.0837
16 32 0 7.3669 528.8536
16 32 1 6.7170 580.7891
32 1 0 9.2169 26.2060
32 1 1 8.1352 29.6627
32 2 0 8.9783 54.0723
32 2 1 7.4881 64.8356
32 4 0 7.7601 125.4378
32 4 1 7.0137 138.7156
32 8 0 7.7013 252.9703
32 8 1 7.0259 277.4011
32 16 0 7.6469 509.4715
32 16 1 7.0901 550.0196
32 32 0 6.7442 1154.2708
32 32 1 6.4801 1204.9072
64 1 0 9.1398 53.0560
64 1 1 8.4876 57.0746
64 2 0 7.7871 124.9611
64 2 1 7.2915 133.3649
64 4 0 7.7413 251.6239
64 4 1 7.2876 267.2993
64 8 0 7.7091 505.3741
64 8 1 7.1725 543.5188
64 16 0 6.8242 1140.7509
64 16 1 6.6159 1179.8322
64 32 0 5.1921 2991.8800
64 32 1 5.3160 2934.9295
128 1 0 7.7871 124.8521
128 1 1 7.4671 130.0443
128 2 0 7.6681 253.8789
128 2 1 7.3989 263.0230
128 4 0 7.6174 511.4664
128 4 1 7.3589 529.6611
128 8 0 6.8233 1141.0293
128 8 1 6.6625 1170.2110
128 16 0 5.1618 3009.7684
128 16 1 5.4797 2848.2109
128 32 0 4.8204 6413.3049
128 32 1 4.9751 6274.0151
view thread (23+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
Subject: Re: AIO / read stream heuristics adjustments for index prefetching
In-Reply-To: <3gkuvs3lz3u3skuaxfkxnsysfqslf2srigl6546vhesekve6v2@va3r5esummvg>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox