Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rGJKj-00Eeog-4Q for pgsql-hackers@arkaria.postgresql.org; Thu, 21 Dec 2023 13:43:29 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1rGJKh-009R4A-OT for pgsql-hackers@arkaria.postgresql.org; Thu, 21 Dec 2023 13:43:27 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rGJKh-009R42-Ee for pgsql-hackers@lists.postgresql.org; Thu, 21 Dec 2023 13:43:27 +0000 Received: from wout5-smtp.messagingengine.com ([64.147.123.21]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rGJKe-00BVSW-Js for pgsql-hackers@lists.postgresql.org; Thu, 21 Dec 2023 13:43:26 +0000 Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailout.west.internal (Postfix) with ESMTP id D6B3B3200A62; Thu, 21 Dec 2023 08:43:22 -0500 (EST) Received: from mailfrontend1 ([10.202.2.162]) by compute4.internal (MEProxy); Thu, 21 Dec 2023 08:43:23 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm2; t=1703166202; x=1703252602; bh=AFTZ5oyZ+e 7GctAY/PFnLHE6jbqWfBmb2yaV5mSs0bY=; b=dG8HeIjUfArDchSMyclVqqhb8R P9KN1x2VfXtt0ybhVUyGQzJ6hy2eDurp/aAfo3rJRopKj4XTuz4GXITxIDAumLFc e4S4SinvEhRw+A+/d57zCGYXxY60Nb/F6AC/Br/F6lmC4ieKnvZI9P09TgkmdMgg JiONCRyiseJdC5phv+EG1BQl/Hz4FCsaW6kEHplOmxWZs23RHV0HyIhe6K0Pzw/b g/YVyWkNHhoCgnpGdzfYw7DYISdTvyM6RFBx79QRdGuckSywfVzIs1ZDk5urcms5 nctMfEiZZSsMXx2FLYHU9KTgPezuYCNfX2H/lvsGmevlJU0N3OOoMKC2tOQg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm2; t=1703166202; x=1703252602; bh=AFTZ5oyZ+e7GctAY/PFnLHE6jbqW fBmb2yaV5mSs0bY=; b=gn72eEiPk/dYE7YZ7AEB0hElrICSJgPP7i5fJMBpqPcq KAfp5RmLPIZ5D/XmqCtPlVfPZ+Yxxv+vKbbn/X2R/gRABEh+blAEd0YLb1/BxMDP Ntz6Mg3ZIztECylYO11CkPIJy4yqnbTSOk40Xp7ES6iLdJm+TgmVvTrSeKrnrac0 PYbWZs7AL9EGG3US7bdlCh5PLD2EkZrLEuajYrK2Mtoih/1YCiVbuprvm9eAFlhl AxSftj1fnOt2rcL/BlVzNY9PzBXBksEV19kD1BUCCjpxJHhMWDHKlgXY1Aw7EjoT ElfQFVaBMp83lEgrXA2tNTgQ44p30+5/Lc3ML5N9UQ== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvkedrvdduhedgfedtucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhepfffhvfevuffkfhggtggujgesthdtredttddtvdenucfhrhhomheptehnughr vghsucfhrhgvuhhnugcuoegrnhgurhgvshesrghnrghrrgiivghlrdguvgeqnecuggftrf grthhtvghrnhepvdfffeevhfetveffgeeiteefhfdtvdffjeevhfeuteegleduheetvedu ieettddunecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomh eprghnughrvghssegrnhgrrhgriigvlhdruggv X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Thu, 21 Dec 2023 08:43:21 -0500 (EST) Date: Thu, 21 Dec 2023 05:43:14 -0800 From: Andres Freund To: Tomas Vondra Cc: Robert Haas , PostgreSQL Hackers , Georgios Subject: Re: index prefetching Message-ID: <20231221134314.wf2rs62d37u62j7t@alap3.anarazel.de> References: <8c86c3a6-074e-6c88-3e7e-9452b6a37b9b@enterprisedb.com> <3cd40425-965a-5ce1-1af3-d51971c44b93@enterprisedb.com> <8ec36f51-b863-60e3-20e2-b9c981c5ce5e@enterprisedb.com> <06bb7d02-2c44-3062-731e-a735ba13da7e@enterprisedb.com> <367160ea-b1ed-4481-e804-bca509128878@enterprisedb.com> <280dc83c-a16f-4424-1319-95e7e3f798bd@enterprisedb.com> <98ba4b25-fae8-c1f4-1597-8093375a1986@enterprisedb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <98ba4b25-fae8-c1f4-1597-8093375a1986@enterprisedb.com> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2023-12-21 13:30:42 +0100, Tomas Vondra wrote: > You're right a lot of this is a guesswork. I don't think we can do much > better, because it depends on stuff that's out of our control - each OS > may do things differently, or perhaps it's just configured differently. > > But I don't think this is really a serious issue - all the read-ahead > implementations need to work about the same, because they are meant to > work in a transparent way. > > So it's about deciding at which point we think this is a sequential > pattern. Yes, the OS may use a slightly different threshold, but the > exact value does not really matter - in the worst case we prefetch a > couple more/fewer blocks. > > The OS read-ahead can't really prefetch anything except sequential > cases, so the whole question is "When does the access pattern get > sequential enough?". I don't think there's a perfect answer, and I don't > think we need a perfect one - we just need to be reasonably close. For the streaming read interface (initially backed by fadvise, to then be replaced by AIO) we found that it's clearly necessary to avoid fadvises in cases of actual sequential IO - the overhead otherwise leads to easily reproducible regressions. So I don't think we have much choice. > Also, while I don't want to lazily dismiss valid cases that might be > affected by this, I think that sequential access for index paths is not > that common (with the exception of clustered indexes). I think sequential access is common in other cases as well. There's lots of indexes where heap tids are almost perfectly correlated with index entries, consider insert only insert-only tables and serial PKs or inserted_at timestamp columns. Even leaving those aside, for indexes with many entries for the same key, we sort by tid these days, which will also result in "runs" of sequential access. > Obviously, the latter case has much more severe impact, but it depends > on the exact workload / access pattern etc. The only "perfect" solution > would be to actually check the page cache, but well - that seems to be > fairly expensive. > What I was envisioning was something self-tuning, based on the I/O we > may do later. If the prefetcher decides to prefetch something, but finds > it's already in cache, we'd increase the distance, to remember more > blocks. Likewise, if a block is not prefetched but then requires I/O > later, decrease the distance. That'd make it adaptive, but I don't think > we actually have the info about I/O. How would the prefetcher know that hte data wasn't in cache? > Alternatively, I was thinking about moving the prefetches into a > separate worker process (or multiple workers), so we'd just queue the > request and all the overhead would be done by the worker. The main > problem is the overhead of calling posix_fadvise() for blocks that are > already in memory, and this would just move it to a separate backend. I > wonder if that might even make the custom cache unnecessary / optional. The AIO patchset provides this. > AFAICS this seems similar to some of the AIO patch, I wonder what that > plans to do. I need to check. Yes, most of this exists there. The difference that with the AIO you don't need to prefetch, as you can just initiate the IO for real, and wait for it to complete. Greetings, Andres Freund