Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w5D9Z-0030qV-1q for pgsql-hackers@arkaria.postgresql.org; Wed, 25 Mar 2026 01:35:25 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w5D9W-00ACf2-1h for pgsql-hackers@arkaria.postgresql.org; Wed, 25 Mar 2026 01:35:22 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w5D9W-00ACet-0f for pgsql-hackers@lists.postgresql.org; Wed, 25 Mar 2026 01:35:22 +0000 Received: from mail-wr1-x435.google.com ([2a00:1450:4864:20::435]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w5D9T-00000000ygZ-28RK for pgsql-hackers@lists.postgresql.org; Wed, 25 Mar 2026 01:35:21 +0000 Received: by mail-wr1-x435.google.com with SMTP id ffacd0b85a97d-43b4d734678so1816308f8f.1 for ; Tue, 24 Mar 2026 18:35:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1774402518; cv=none; d=google.com; s=arc-20240605; b=ZmSqdZ3i0OA7dYsF3heOYnTTWcuICpH1Um/i9fJyj5m8X0KuVT5Pnwx4hpeUv3/xYn 6GBIvPbz2ibUPEdj2z3qJ2D+g9gGHWARfherFoaEpvCp30cTDHWTBGwa48Oq2NhRU6QG jA6HakxQoqjAYohOv8q994QB3/bu8rlrIG0Sth3Ju8SeoaI4P6Jb1CnGyph2yDezzXxw WBD33dLlu3tGvmWIAwGbQqOPJQTmk61DWIDpwniZHYJHOw3PNSDtAMUJCK5t6eKs4JJ4 RYO4GotwOtmIZxfdj7WMFQDDeFqZAOUEzk8KQLIFPLLTLaCWiIPDVnd6p7+At0FRhMet HQqg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=FA+c69sXperZlls5RCJNau13ZPZ7nyHZZ2xr7N8qVZ0=; fh=DiYHkmuwcJJ6czg+7eZMZXcuYeFkbNSDbmQJMxHnxfw=; b=lbxMHHaHw/Zyu4WJ0cq9cnGQNIZnxFtLB8xnljnM8lfADzq281xjf/LEY0m2cibar1 JrM8Q0ER6dslZo/u4x4uQcqezUFfIerIA+NLAWH7icRHz8UzcjpOBasmBp208Ka0DOKf 5RvHa94TXtoW4WBPdUqQVXz1Rt5K2e2tEY3KjFQURYxWGq8roBnp09JyaRYrck8+cAw1 SeCQ+IRq1fGDs3/dM0EASZRclVPfqItkzgT7RAwlNTYr0g3iB9UiJJ9K7LpRZt+KDz8m g8N0UB/r9qMDTfwGESzDjXISM0uwCGqPqpT8tjMOp2ZIP9+fqpBXfxQuaiPeqJENI5DL BYcA==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bowt-ie.20230601.gappssmtp.com; s=20230601; t=1774402518; x=1775007318; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=FA+c69sXperZlls5RCJNau13ZPZ7nyHZZ2xr7N8qVZ0=; b=riUWyTtP85OC6/pL/F07Aj9aEyGPp1kaAo/h004LQvDqmvEAyvRlu85qC2UGHTBIuD XEMLtjo+QeisXY5CmCwJPcHlpBk7rL64uztbPfPC9pQFL8P5GISg6r2R6aukmodr+54A 6Y6x79TiP7dZnZACdvIo+JAQ9UbrIqySXSRjCPLAjFmdSKMH1D8X7SDJGRYOe/Vgng4m 9JuV4HWo1DrCAbn7Gx6y7PoQycyfVlLWZC7wH85J9tUS2njPf3pjSYDSdwKjam93JR0t aVALVQU8SrTJl2Wkhc8g+lRIAEc4coPDR9IKm+0X6a8h7Al0q6CJeVhPpj59b/TCqWh9 inQQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774402518; x=1775007318; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=FA+c69sXperZlls5RCJNau13ZPZ7nyHZZ2xr7N8qVZ0=; b=rcxIz9KrHk82jPeTDJItA5idUBEfK8xxIxt9Oi/Cbe9pePTIVIQOcVUXFamIGJWZOC NZqkivqhp+GJD3WWVn391tfkzcTUXt0ryPpRwcaR+zKByufWV/vau5Z1NgTc21o7dD8G YvU0VlD96BMLX81ZzOrBwQSPhuhvAM6WHGjSmYs09oS7HVUqpxDjCK2R1xPsQeY0JN9m wWhQiqJvN9qBBl1gy3ttH4JjuG+T1PV7qADxr9Z2xW/32oLqMzLBlWMnh7eeoh7RG5nw f0M48hEfb4fB1mrj6zvfxXeHYZF9hidSkaH1HIP6pnKiB8UzP3srBvzUrrI88I1iB7uQ /0aA== X-Forwarded-Encrypted: i=1; AJvYcCXLy/Tw9/Y97mrPeLC3Q5D823Q+F6zR3p6QoZDA0Vm60VIMOf1MHN9m4CehJaSUsjkGgB01JsILVH+IumRP@lists.postgresql.org X-Gm-Message-State: AOJu0YxDIQXcr9YsyGeMubjB8ywuyobc/byJ5qWSadjfpu6wadbZ7imz O8NUm4gF0EGn4xOuywAseAPp3HpXzI0PwivyE2Iw9M9kS55YBeMFZNVDoHazQe+rsMQZFHXcxdE uh3LdkxlSJ7QK7pdWpTgam4webjCELwAaPVsXto3/LA== X-Gm-Gg: ATEYQzwQfFE/v3u9AEqVPdisx/Tqw5X3Tgn05d1NHiT141XyUfZRrk3j+ZVGrov7U15 Kb0N90WQWBfLmupEdp8heBBNIGXlo5EyzZ9cDBxfNZ9Zr4Dc8qHFg5SPVir0ENgAkTMqT402aNP IUGQIpqs+6vNPF55eBGQWIQNnCanwZ+6ELzelD4vOtYbhVkyld5VwgUSGeHe1CoJahtor1hFN1Q VRvCFuU5X7FmV4uH8N+lCUMcvU1ysnxo7gFU5JXPOd9v1/jTyFfitPC/sM712cx+7Qm4gzVg1cb PDj2HlfP2BUc5HVRdGitCwbNUdyy7MqniMFLQa5XmRSWO/l5Qp1cmQ== X-Received: by 2002:a5d:5d07:0:b0:439:b652:af34 with SMTP id ffacd0b85a97d-43b887f32cfmr2138859f8f.0.1774402518343; Tue, 24 Mar 2026 18:35:18 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Peter Geoghegan Date: Tue, 24 Mar 2026 21:34:51 -0400 X-Gm-Features: AQROBzARu0UiCIsthi1dFbeqD9NrEREGX751fZpjmze-rJjWrJALAwGKZQ8fg30 Message-ID: Subject: Re: index prefetching To: Andres Freund Cc: Tomas Vondra , Alexandre Felipe , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Tue, Mar 24, 2026 at 1:27=E2=80=AFPM Andres Freund = wrote: > > But that means that it won't be triggered when we don't enter the "if > > (hscan->xs_blk !=3D ItemPointerGetBlockNumber(tid))" block that contain= s > > all this code. Besides, it just doesn't seem possible that > > heap_page_prune_opt would release its caller's pin. > > I was more concerned about read_stream_next_buffer() returning the wrong > block, due to prefetching somehow "desynchronizing" with the scan positio= n and > catching that when it's clear that we just read a new block, rather than = in a > place where it could be either the continuation of a scan on the same pag= e or > a new page. Then I don't follow. The existing assertions will catch that (I should know, they've failed enough times during development). Basically, I don't get the concern about heap_page_prune_opt releasing its caller's pin. Even if that happened, the existing assertions would still catch it. > I think I had largely missed the "danger" of index only scans here. I thi= nk > it'd be good to call that out more explicitly in these comments. Will do. > > > Does this only happen when paused? > > > > This "prefetchPos->valid =3D false" stuff is approximately the opposite > > of pausing. Pausing resolves the problem of prefetchPos getting so far > > ahead of scanPos that the batch ring buffer runs out of slots. Whereas > > this prefetchPos invalidation code helps the read stream deal with > > prefetchPos falling behind scanPos. > > Because I had somewhat missed the real cause of the problem - not calling= the > read stream code due to index only scans - I thought that somehow we coul= d end > up in this state due to not resuming prefetching before the scan position > overtakes the prefetch position. But I don't think that actually happen. Right, it can't happen. In any case the assertions we have are quite effective at catching problems like that. For example, if we don't resume prefetching and consume another batch, there's an assertion for that. Actually, there's more than one. There's a direct assertion, on the scan side. And the read stream callback itself has a precondition assertion that the read stream is not paused. > > > Wonder if it's worth somehow asserting that after this the page is ac= tually > > > unguarded after the call. > > > > We used to, but the new layering forced me to remove it. Any ideas > > about how to add it back? > > Adding an "isGuarded" field to IndexScanBatchData would be the easiest > way. That way we can make assertions about the state without knowing anyt= hing > about the internal mechanism of how guarding is implemented. > > I doubt setting/clearing that field even when assertions are disabled wil= l be > measurable, as long as you place it alongside the other booleans where th= ere's > padding space available. I've prototyped that, and it works well. It'll be in v18. > After replacing the pause with an error I found that it's surprisingly ea= sy to > hit on slow storage (or on fast storage if you set needed_wait=3Dtrue in > read_stream_next_buffer()). I've not done any performance validation on > whether that means the limit is too low. It's been a while since I last validated performance to justify the current maximum number of batches. I used buffered I/O for that. I'm sure that a higher maximum with very slow storage and a very high effective_io_concurrency will provide some benefit. But perfectly handling that isn't essential for the first committed version of index prefetching. I must admit I'm unsure how to evaluate the maximum number of batches. It can make sense to pursue diminishing returns. But up to what point, and according to what principle? --=20 Peter Geoghegan