Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1urc0X-009OAR-L0 for pgsql-hackers@arkaria.postgresql.org; Thu, 28 Aug 2025 12:45:39 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1urc0X-002t4s-3g for pgsql-hackers@arkaria.postgresql.org; Thu, 28 Aug 2025 12:45:37 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1urc0W-002t4j-MF for pgsql-hackers@lists.postgresql.org; Thu, 28 Aug 2025 12:45:37 +0000 Received: from relay2-d.mail.gandi.net ([2001:4b98:dc4:8::222]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1urc0U-002Jqo-20 for pgsql-hackers@lists.postgresql.org; Thu, 28 Aug 2025 12:45:36 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id CBE80430D8; Thu, 28 Aug 2025 12:45:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1756385130; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Sg/F1SXR+W/63HyX7isf/811CEYEn8G8c9j/JIf+ZBU=; b=fxiquXhZUlbY0nLaBylVd9ccg4O+s/F7uTF7BVs5MrS1G5/rJ/a7nhKPg/UqsTLfyf7qol diUXN3aCJtfZIystAfQufZAaLBRuFMicfi62F9VpQ2QLyDYG0e2TCqGx0sd+Go4TC8uast 8K9EA+igM6Hxd3EQv3oc0qeGzVSVhb5MjsHCw+ivEvCdgSuN8lPUK4lBtCsNSfI093OLt+ rodlB2eqTiQPQ0FUWfUM1g2SepC3Q06G5WJTqGoXqyGGrLZ268vrjwQlQXDvD3ScTyu1TP zckBpQ2BiuZJ+CJ7Ljhwo7NU6dOonwzlk0xgEGHxdyYfD62vHc1AYaMfl7bM5Q== Content-Type: multipart/mixed; boundary="------------wUA0OePheeTvJSrgUmSd5UxG" Message-ID: <1c9302da-c834-4773-a527-1c1a7029c5a3@vondra.me> Date: Thu, 28 Aug 2025 14:45:24 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: index prefetching From: Tomas Vondra To: Andres Freund Cc: Peter Geoghegan , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar References: <6wyxbnry2unm3kbcu2sabhzhs7baoedlg77xqm42chpofjq45g@igst42zpl7ok> <5v2wuxg65l5e3s6uf373zskcqqoukmraxiucnvgn4t7b5cmeqx@5mhqsurdj6xn> <6butbqln6ewi5kuxz3kfv2mwomnlgtate4mb4lpa7gb2l63j4t@stlwbi2dvvev> <0dd33755-cab8-49c8-b1ed-698732577fbb@vondra.me> Content-Language: en-US In-Reply-To: <0dd33755-cab8-49c8-b1ed-698732577fbb@vondra.me> X-GND-State: clean X-GND-Score: -100 X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgddukedutdeiucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuifetpfffkfdpucggtfgfnhhsuhgsshgtrhhisggvnecuuegrihhlohhuthemuceftddunecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjughrpegtkfffgggfuffhvfevfhgjsehmtderredtvdejnecuhfhrohhmpefvohhmrghsucggohhnughrrgcuoehtohhmrghssehvohhnughrrgdrmhgvqeenucggtffrrghtthgvrhhnpeetheegudegieegvdeitdegkeetiedtheelveffgefggfeuvdduueffleeuvdevueenucfkphepkeeirdegledrvdeftddrvddtieenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepihhnvghtpeekiedrgeelrddvfedtrddvtdeipdhhvghloheplgdutddrudefjedrtddrvdgnpdhmrghilhhfrhhomhepthhomhgrshesvhhonhgurhgrrdhmvgdpnhgspghrtghpthhtohepuddtpdhrtghpthhtoheprghnughrvghssegrnhgrrhgriigvlhdruggvpdhrtghpthhtohepphhgsegsohifthdrihgvpdhrtghpthhtohepthhhohhmrghsrdhmuhhnrhhosehgmhgrihhlrdgtohhmpdhrtghpthhtohepsgihrghvuhiikedusehgmhgrihhlrdgtohhmpdhrtghpthhtoheprhhosggvrhhtmhhhrggrshesghhmrghilhdrtghomhdprhgtphhtthhopehmvghlrghnihgvphhlrghgvghmrghnsehgmhgrihhlrdgtohhmpdhrtghpthhto hepphhgshhqlhdqhhgrtghkvghrsheslhhishhtshdrphhoshhtghhrvghsqhhlrdhorhhgpdhrtghpthhtohepghhkohhkohhlrghtohhssehprhhothhonhhmrghilhdrtghomh X-GND-Sasl: tomas@vondra.me List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is a multi-part message in MIME format. --------------wUA0OePheeTvJSrgUmSd5UxG Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 8/26/25 17:06, Tomas Vondra wrote: > > > On 8/26/25 01:48, Andres Freund wrote: >> Hi, >> >> On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote: >>> >>> ... >>> >>> I'm not sure what's causing this, but almost all regressions my script >>> is finding look like this - always io_method=worker, with distance close >>> to 2.0. Is this some inherent io_method=worker overhead? >> >> I think what you might be observing might be the inherent IPC / latency >> overhead of the worker based approach. This is particularly pronounced if the >> workers are idle (and the CPU they get scheduled on is clocked down). The >> latency impact of that is small, but if you never actually get to do much >> readahead it can be visible. >> > > Yeah, that's quite possible. If I understand the mechanics of this, this > can behave in a rather unexpected way - lowering the load (i.e. issuing > fewer I/O requests) can make the workers "more idle" and therefore more > likely to get suspended ... > > Is there a good way to measure if this is what's happening, and the > impact? For example, it'd be interesting to know how long it took for a > submitted process to get picked up by a worker. And % of time a worker > spent handling I/O. > I kept thinking about this, and in the end I decided to try to measure this IPC overhead. The backend/ioworker communicate by sending signals, so I wrote a simple C program that does "signal echo" with two processes (one fork). It works like this: 1) fork a child process 2) send a signal to the child 3) child notices the signal, sends a response signal back 4) after receiving response, go back to (2) This happens until the requested number of signals is sent, and then it prints stats like signals/second etc. The C file is attached, I'm sure it's imperfect but it does the trick. And the results mostly agree with the benchmark results from yesterday. Which makes sense, because if the distance collapses to ~1, the AIO with io_method=worker starts doing about the same thing for every block. If I run the signal test on the ryzen machine, I get this: ----------------------------------------------------------------------- root@ryzen:~# ./signal-echo 1000000 nmm_signals = 1000000 parent: sent 100000 signals in 196909 us (1.97) ... parent: sent 1000000 signals in 1924263 us (1.92 us) signals / sec = 519679.48 ----------------------------------------------------------------------- So it can do about 500k signals / second. This means that requesting blocks one by one (with distance=1), a single worker can do about 4GB/s, assuming there's no other work (no actual I/O, no checksum checks, ...). Consider the warm runs with 512MB shared buffers, which means there's no I/O but the data needs to be copied from page cache (by the worker). An explain analyze for the query says this: Buffers: shared hit=2573018 read=455610 That's 455610 blocks to read, mostly one by one. So a bit less than 1 second just for the IPC, but there's also the memcpy etc. An example result from the benchmark looks like this: master: 967ms patched: 2353ms So that's ~1400ms difference. So a bit more, but in the right ballpark, and the extra overhead could be the due to AIO being more complex than sync I/O, etc. Not sure. The xeon can do ~190k signals/second, i.e. about 1/3 of ryzen, so the index scan would spend ~3 seconds on the IPC. Timings for the same test look like this: master: 3049ms patched: 9636ms So, that's about 2x the expected difference. Not sure where the extra overhead comes from, might be due to NUMA (which the ryzen does not have). So I think the IPC overhead with "worker" can be quite significant, especially for cases with distance=1. I don't think it's a major issue for PG18, because seq/bitmap scans are unlikely to collapse the distance like this. And with larger distances the cost amortizes. It's much bigger issue for the index prefetching, it seems. This is for the "warm" runs with 512MB, with the basic prefetch patch. I'm not sure it explains the overhead with the patches that increase the prefetch distance (be it mine or Thomas' patch), or cold runs. The regresions seem to be smaller in those cases, though. regards -- Tomas Vondra --------------wUA0OePheeTvJSrgUmSd5UxG Content-Type: text/x-csrc; charset=UTF-8; name="signal-echo.c" Content-Disposition: attachment; filename="signal-echo.c" Content-Transfer-Encoding: base64 I2luY2x1ZGUgPHVuaXN0ZC5oPgojaW5jbHVkZSA8c3RkaW8uaD4KI2luY2x1ZGUgPHN0ZGxp Yi5oPgojaW5jbHVkZSA8c2lnbmFsLmg+CiNpbmNsdWRlIDxzeXMvdGltZS5oPgoKc3RhdGlj IHZvbGF0aWxlIHNpZ19hdG9taWNfdCBzaWduYWxfcmVjZWl2ZWQgPSAwOwoKc3RhdGljIHZv aWQgaGFuZGxlX3NpZ25hbChpbnQgc2lnbm8pCnsKCXNpZ25hbF9yZWNlaXZlZCA9IDE7Cn0K CmludAptYWluKGludCBhcmdjLCBjaGFyICoqYXJndikKewoJaW50CW51bV9zaWduYWxzOwoJ aW50CXBhcmVudF9waWQgPSBnZXRwaWQoKTsKCWludAljaGlsZF9waWQ7CgoJaWYgKGFyZ2Mg IT0gMikKCXsKCQlwcmludGYoImludmFsaWQgbnVtYmVyIG9mIGFyZ3VtZW50c1xuIik7CgkJ ZXhpdCgxKTsKCX0KCglpZiAoc2lnbmFsKFNJR1VSRywgaGFuZGxlX3NpZ25hbCkgPT0gU0lH X0VSUikKCXsKCQlwcmludGYoImZhaWxlZCB0byBzZXQgc2lnbmFsXG4iKTsKCQlleGl0KDIp OwoJfQoKCW51bV9zaWduYWxzID0gYXRvaShhcmd2WzFdKTsKCglwcmludGYoIm5tbV9zaWdu YWxzID0gJWRcbiIsIG51bV9zaWduYWxzKTsKCgljaGlsZF9waWQgPSBmb3JrKCk7CgoJaWYg KGNoaWxkX3BpZCAhPSAwKQoJewoJCWludCBjbnQgPSAwOwoJCWludCB3YWl0aW5nID0gMDsK CgkJc3RydWN0IHRpbWV2YWwgdGltZV9zdGFydDsKCQlzdHJ1Y3QgdGltZXZhbCB0aW1lX2Vu ZDsKCgkJaW50IGR1cmF0aW9uOwoKCQkvKiBzbGVlcCBhIGJpdCwgc28gdGhhdCBjaGlsZCBz dGFydHMgKi8KCQl1c2xlZXAoMTAwMDApOwoKCQlnZXR0aW1lb2ZkYXkoJnRpbWVfc3RhcnQs IE5VTEwpOwoKCQl3aGlsZSAoY250IDwgbnVtX3NpZ25hbHMpCgkJewoJCQlpZiAod2FpdGlu ZyA9PSAwKQoJCQl7CgkJCQkvKiBzZW5kIHNpZ25hbCB0byBjaGlsZCAqLwoJCQkJd2FpdGlu ZyA9IDE7CgkJCQlraWxsKGNoaWxkX3BpZCwgU0lHVVJHKTsKCQkJfQoKCQkJLyogaGF2ZSBy ZWNlaXZlZCByZXNwb25zZT8gKi8KCQkJaWYgKHNpZ25hbF9yZWNlaXZlZCkKCQkJewoJCQkJ c2lnbmFsX3JlY2VpdmVkID0gMDsKCQkJCXdhaXRpbmcgPSAwOwoJCQkJY250Kys7CgoJCQkJ aWYgKGNudCAlIDEwMDAwMCA9PSAwKQoJCQkJewoJCQkgICAgICAgICAgICAgICAgZ2V0dGlt ZW9mZGF5KCZ0aW1lX2VuZCwgTlVMTCk7CgogICAgICAgIAkJCSAgICAgICAgZHVyYXRpb24g PSAodGltZV9lbmQudHZfc2VjIC0gdGltZV9zdGFydC50dl9zZWMpICogMTAwMDAwMCArCgkJ CQkJCQkodGltZV9lbmQudHZfdXNlYyAtIHRpbWVfc3RhcnQudHZfdXNlYyk7CgoJCQkgICAg ICAgICAgICAgICAgcHJpbnRmKCJwYXJlbnQ6IHNlbnQgJWQgc2lnbmFscyBpbiAlZCB1cyAo JS4yZilcbiIsIGNudCwgZHVyYXRpb24sIChkdXJhdGlvbiAqIDEuMCAvIGNudCkpOwoJCQkJ fQoJCQl9CgkJfQoKCQlnZXR0aW1lb2ZkYXkoJnRpbWVfZW5kLCBOVUxMKTsKCgkJZHVyYXRp b24gPSAodGltZV9lbmQudHZfc2VjIC0gdGltZV9zdGFydC50dl9zZWMpICogMTAwMDAwMCAr ICh0aW1lX2VuZC50dl91c2VjIC0gdGltZV9zdGFydC50dl91c2VjKTsKCgkJcHJpbnRmKCJw YXJlbnQ6IHNlbnQgJWQgc2lnbmFscyBpbiAlZCB1cyAoJS4yZiB1cylcbiIsIGNudCwgZHVy YXRpb24sIGR1cmF0aW9uICogMS4wIC8gY250KTsKCgkJcHJpbnRmKCJzaWduYWxzIC8gc2Vj ID0gJS4yZlxuIiwgY250IC8gKGR1cmF0aW9uIC8gMTAwMDAwMC4wKSk7Cgl9CgllbHNlCgl7 CgkJLyogd2FpdCBmb3Igc2lnbmFsLCBzZW5kIHNpZ25hbCBiYWNrICovCgkJd2hpbGUgKDEp CgkJewoJCQlpZiAoc2lnbmFsX3JlY2VpdmVkKQoJCQl7CgkJCQlzaWduYWxfcmVjZWl2ZWQg PSAwOwoJCQkJa2lsbChwYXJlbnRfcGlkLCBTSUdVUkcpOwoJCQl9CgkJfQoJfQoKCXJldHVy biAwOwp9Cg== --------------wUA0OePheeTvJSrgUmSd5UxG--