Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1raTu6-009vAg-62 for pgsql-hackers@arkaria.postgresql.org; Thu, 15 Feb 2024 05:03:22 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1raTu4-00AWqk-SR for pgsql-hackers@arkaria.postgresql.org; Thu, 15 Feb 2024 05:03:20 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1raTu3-00AWqc-Sf for pgsql-hackers@lists.postgresql.org; Thu, 15 Feb 2024 05:03:20 +0000 Received: from out5-smtp.messagingengine.com ([66.111.4.29]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1raTu0-006p7l-OY for pgsql-hackers@lists.postgresql.org; Thu, 15 Feb 2024 05:03:18 +0000 Received: from compute1.internal (compute1.nyi.internal [10.202.2.41]) by mailout.nyi.internal (Postfix) with ESMTP id C483B5C0079; Thu, 15 Feb 2024 00:03:13 -0500 (EST) Received: from mailfrontend1 ([10.202.2.162]) by compute1.internal (MEProxy); Thu, 15 Feb 2024 00:03:13 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm3; t=1707973393; x=1708059793; bh=nbjAKuP4d8 pITTyDVI7JFfhccT4IJAj9jKI2L12fwik=; b=nRI/FaD4VF2EQnNVGGBQbs44eX lbXRO1lMLFgSGIzR4ary/yJyogmQow3BFfESVoiznShp9Kl0tLPOraDUsEdDuYuy ASOdPk5hn2HTeKn+Bqj3J7b5HW2Q4778OcMY5G2Jc87UGoJryUltfevOW6D7ibih 5M+BtaEu8jtJVTpSCLXBOxEsq/OcQy+I3iw0t5Wx7mLNkj6LJTY4Y9ajBiMPPbQA yykDFSKetk+SVBVjDbb7KF/SpOTqsY6xMC0qcgOQx662b2sVM0k3XvVp3XQne20A MbFNED5IHHPBAuBaleAe4eAHiBjr2XIJL+QRCGmTJbQQjAM6IGYtcxyTAfzQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm3; t=1707973393; x=1708059793; bh=nbjAKuP4d8pITTyDVI7JFfhccT4I JAj9jKI2L12fwik=; b=V+NFxsWVk5iaPFLy/8L+skhqF3W/xBVU/TLqNU1DJJJd O32jg4uf8SmzforP49zNkAaO1wViX+bWQSRQcMgQ0LodDXlccEVxPBQpmAAeEMrj Hg6opFPP50ngV45iimkhFAyTulrEctBuwwsrSrDeWoznFLOTEmBCNmnME8B55RKJ jrrh5JivNfPftcrqBJWWykPcXFcGzdtsW4zXHF8YB+qfzJvm3ZFknTd12d3PTwZk 0Udi3ty9ADqExbMhJ4xfAbQp4PCuYX/FVJLluo11z3IhJjoL8qpsp6JuIabpMbLd k6c/ivQ0FtqgSuiuwzsCnSXOJYIyf23Uk2ODhzborg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvledrudelgddvgecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpeffhffvvefukfhfgggtuggjsehttdertddttddvnecuhfhrohhmpeetnhgurhgv shcuhfhrvghunhguuceorghnughrvghssegrnhgrrhgriigvlhdruggvqeenucggtffrrg htthgvrhhnpedvffefvefhteevffegieetfefhtddvffejvefhueetgeeludehteevudei tedtudenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpe grnhgurhgvshesrghnrghrrgiivghlrdguvg X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Thu, 15 Feb 2024 00:03:13 -0500 (EST) Date: Wed, 14 Feb 2024 21:03:11 -0800 From: Andres Freund To: Robert Haas Cc: Tomas Vondra , Melanie Plageman , PostgreSQL Hackers , Georgios , Thomas Munro , Konstantin Knizhnik , Dilip Kumar Subject: Re: index prefetching Message-ID: <20240215050311.5zdmzvz2zckg3s6g@awork3.anarazel.de> References: <482ec3ff-52ad-415d-96fd-f3832a894023@enterprisedb.com> <56176b8d-956c-487e-ab09-310db4581c07@enterprisedb.com> <4867452a-b853-4813-a6da-9bb06a336f8b@enterprisedb.com> <5fc8f150-f350-4cb3-b6b7-ffcc21ed7bc2@enterprisedb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2024-02-15 09:59:27 +0530, Robert Haas wrote: > I would have thought that the way this prefetching would work is that > we would bring pages into shared_buffers sooner than we currently do, > but not actually pin them until we're ready to use them, so that it's > possible they might be evicted again before we get around to them, if > we prefetch too far and the system is too busy. The issue here is that we need to read index leaf pages (synchronously for now!) to get the tids to do readahead of table data. What you describe is done for the table data (IMO not a good idea medium term [1]), but the problem at hand is that once we've done readahead for all the tids on one index page, we can't do more readahead without looking at the next index leaf page. Obviously that would lead to a sawtooth like IO pattern, where you'd regularly have to wait for IO for the first tuples referenced by an index leaf page. However, if we want to issue table readahead for tids on the neighboring index leaf page, we'll - as the patch stands - not hold a pin on the "current" index leaf page. Which makes index prefetching as currently implemented incompatible with kill_prior_tuple, as that requires the index leaf page pin being held. > Alternately, it also seems OK to read those later pages and pin them right > away, as long as (1) we don't also give up pins that we would have held in > the absence of prefetching and (2) we have some mechanism for limiting the > number of extra pins that we're holding to a reasonable number given the > size of shared_buffers. FWIW, there's already some logic for (2) in LimitAdditionalPins(). Currently used to limit how many buffers a backend may pin for bulk relation extension. Greetings, Andres Freund [1] The main reasons that I think that just doing readahead without keeping a pin is a bad idea, at least medium term, are: a) To do AIO you need to hold a pin on the page while the IO is in progress, as the target buffer contents will be modified at some moment you don't control, so that buffer should better not be replaced while IO is in progress. So at the very least you need to hold a pin until the IO is over. b) If you do not keep a pin until you actually use the page, you need to either do another buffer lookup (expensive!) or you need to remember the buffer id and revalidate that it's still pointing to the same block (cheaper, but still not cheap). That's not just bad because it's slow in an absolute sense, more importantly it increases the potential performance downside of doing readahead for fully cached workloads, because you don't gain anything, but pay the price of two lookups/revalidation. Note that these reasons really just apply to cases where we read ahead because we are quite certain we'll need exactly those blocks (leaving errors or queries ending early aside), not for "heuristic" prefetching. If we e.g. were to issue prefetch requests for neighboring index pages while descending during an ordered index scan, without checking that we'll need those, it'd make sense to just do a "throway" prefetch request.