Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1umIsp-00DHgq-1w for pgsql-hackers@arkaria.postgresql.org; Wed, 13 Aug 2025 21:19:43 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1umIsn-000OA4-II for pgsql-hackers@arkaria.postgresql.org; Wed, 13 Aug 2025 21:19:41 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1umIsn-000O9t-8j for pgsql-hackers@lists.postgresql.org; Wed, 13 Aug 2025 21:19:41 +0000 Received: from relay6-d.mail.gandi.net ([2001:4b98:dc4:8::226]) by makus.postgresql.org with smtp (Exim 4.96) (envelope-from ) id 1umIsk-000QYh-1c for pgsql-hackers@lists.postgresql.org; Wed, 13 Aug 2025 21:19:40 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id C0400442E0; Wed, 13 Aug 2025 21:19:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1755119975; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=LKbiUfzygX2dcadafQXhcfcRQn59yn4DJIZz5yLKUfQ=; b=fXx0VyVwBvowt5+EPxWterf4LIAfpIzsHrgEi0D1SSXqg63Bsi0PrlSShW8T8FHVAi2/Hv te6dPLzHEMZ8qaVN5pO4Ob0WWcm4MpZLRZXQ25RP+xaHAwmNom3b0mFm9ubKBL8LmZjgub ZyHvsvDAb/Vh4mP3zG0LLeloRuRpFtdwfd3FCsE5UYOFuTyqdthXtlcEFV93+942GOwAJM oFdrH7aT//NiOZRruBmsnBRV4MG5uMn6WRolry6oTo8tGvQiy5PyN2eS9n1CcZFjGs+EDA yfF3uryJTlQXKvjcy+RoQnLvqu63hDWZKWZJXeVdrrZcHjspPy8RGTw0TmZfZQ== Message-ID: Date: Wed, 13 Aug 2025 23:19:29 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: index prefetching To: Peter Geoghegan , Andres Freund Cc: Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar References: <51b5f71b-5f19-4453-91ff-2b9f2a840c58@vondra.me> <6cb6109d-71d6-490c-8056-d8885081b008@vondra.me> Content-Language: en-US From: Tomas Vondra In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-GND-State: clean X-GND-Score: -100 X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdefgddufeelvdekucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuifetpfffkfdpucggtfgfnhhsuhgsshgtrhhisggvnecuuegrihhlohhuthemuceftddunecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjughrpefkffggfgfuvfevfhfhjggtgfesthekredttddvjeenucfhrhhomhepvfhomhgrshcugghonhgurhgruceothhomhgrshesvhhonhgurhgrrdhmvgeqnecuggftrfgrthhtvghrnhepuedvvdeifefffeekudeggfdtieeglefggeduheffveeihefggfehgfdvudetffevnecukfhppeekiedrgeelrddvfedtrddvtdeinecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehinhgvthepkeeirdegledrvdeftddrvddtiedphhgvlhhopegluddtrddufeejrddtrddvngdpmhgrihhlfhhrohhmpehtohhmrghssehvohhnughrrgdrmhgvpdhnsggprhgtphhtthhopedutddprhgtphhtthhopehpghessghofihtrdhivgdprhgtphhtthhopegrnhgurhgvshesrghnrghrrgiivghlrdguvgdprhgtphhtthhopehthhhomhgrshdrmhhunhhrohesghhmrghilhdrtghomhdprhgtphhtthhopegshigrvhhuiiekudesghhmrghilhdrtghomhdprhgtphhtthhopehrohgsvghrthhmhhgrrghssehgmhgrihhlrdgtohhmpdhrtghpthhtohepmhgvlhgrnhhivghplhgrghgvmhgrnhesghhmrghilhdrtghomhdprhgtphhtt hhopehpghhsqhhlqdhhrggtkhgvrhhssehlihhsthhsrdhpohhsthhgrhgvshhqlhdrohhrghdprhgtphhtthhopehgkhhokhholhgrthhoshesphhrohhtohhnmhgrihhlrdgtohhm X-GND-Sasl: tomas@vondra.me List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 8/13/25 18:01, Peter Geoghegan wrote: > On Wed, Aug 13, 2025 at 11:28 AM Andres Freund wrote: >>> With "sync" I always get this (after a restart): >>> >>> Buffers: shared hit=7435 read=52801 >>> >>> while with "worker" I get this: >>> >>> Buffers: shared hit=4879 read=52801 >>> Buffers: shared hit=5151 read=52801 >>> Buffers: shared hit=4978 read=52801 >>> >>> So not only it changes run to tun, it also does not add up to 60236. >> >> This is reproducible on master? If so, how? > > AFAIK it is *not* reproducible on master. > >> If this happens only with the prefetching patch applied, is is possible that >> what happens here is that we occasionally re-request buffers that already in >> the process of being read in? That would only happen with a read stream and >> io_method != sync (since with sync we won't read ahead). If we have to start >> reading in a buffer that's already undergoing IO we wait for the IO to >> complete and count that access as a hit: > > This theory seems quite plausible to me. Though it is a bit surprising > that I see incorrect buffer hit counts on the "good" forwards scan > case, rather than on the "bad" backwards scan case. > > Here's what I mean by things being broken on the read stream side (at > least with certain backwards scan cases): > > When I add instrumentation to the read stream side, by adding elog > debug calls that show the blocknum seen by read_stream_get_block, I > see out-of-order and repeated blocknums with the "bad" backwards scan > case ("SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a > desc"): > > ... > NOTICE: index_scan_stream_read_next: index 1163 TID (25052,21) > WARNING: prior lastBlock is 25053 for batchno 2856, new one: 25052 > WARNING: blocknum: 25052, 0x55614810efb0 > WARNING: blocknum: 25052, 0x55614810efb0 > NOTICE: index_scan_stream_read_next: index 1161 TID (25053,3) > WARNING: prior lastBlock is 25052 for batchno 2856, new one: 25053 > WARNING: blocknum: 25053, 0x55614810efb0 > NOTICE: index_scan_stream_read_next: index 1160 TID (25052,19) > WARNING: prior lastBlock is 25053 for batchno 2856, new one: 25052 > WARNING: blocknum: 25052, 0x55614810efb0 > WARNING: blocknum: 25052, 0x55614810efb0 > NOTICE: index_scan_stream_read_next: index 1141 TID (25051,21) > WARNING: prior lastBlock is 25052 for batchno 2856, new one: 25051 > WARNING: blocknum: 25051, 0x55614810efb0 > ... > > Notice that we see the same blocknum twice in close succession. Also > notice that we're passed 25052 and then subsequently passed 25053, > only to be passed 25053 once more. > I did investigate this, and I don't think there's anything broken in read_stream. It happens because ReadStream has a concept of "ungetting" a block, which can happen after hitting some I/O limits. In that case we "remember" the last block (in read_stream_look_ahead calls read_stream_unget_block), and we return it again. It may seem as if read_stream_get_block() produced the same block twice, but it's really just the block from the last round. All duplicates produced by read_stream_look_ahead were caused by this. I suspected it's a bug in lastBlock optimization, but that's not the case, it happens entirely within read_stream. And it's expected. It's also not very surprising this happens with backwards scans more. The I/O is apparently much slower (due to missing OS prefetch), so we're much more likely to hit the I/O limits (max_ios and various other limits in read_stream_start_pending_read). regards -- Tomas Vondra