Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1umf9n-001bQz-Du for pgsql-hackers@arkaria.postgresql.org; Thu, 14 Aug 2025 21:06:43 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1umf9i-009mZ5-La for pgsql-hackers@arkaria.postgresql.org; Thu, 14 Aug 2025 21:06:38 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1umf9i-009mYr-7u for pgsql-hackers@lists.postgresql.org; Thu, 14 Aug 2025 21:06:38 +0000 Received: from mail-wr1-x42d.google.com ([2a00:1450:4864:20::42d]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1umf9f-000nMl-2e for pgsql-hackers@lists.postgresql.org; Thu, 14 Aug 2025 21:06:37 +0000 Received: by mail-wr1-x42d.google.com with SMTP id ffacd0b85a97d-3b9e414252dso722548f8f.3 for ; Thu, 14 Aug 2025 14:06:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bowt-ie.20230601.gappssmtp.com; s=20230601; t=1755205594; x=1755810394; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=7nyZcVeOt5nR38i/DhMOCO7K65CYWZeN3Fxalmhjn28=; b=Kp109AEru3Q3oDHJeCw+hJBvh3pS6MCYH/C1T6b3o5aNurmg/05OX+fAJMN1oxMmoy HfTfkLKkAROZgIL+J5emcqAig6iMiCeeAcX4VDDQlmgC1QGIRk+Dt6OHV221mw8wurxc mKL9ec8ISNrIjZX8BIzdOExB3foCM+899NyMV4bTACCo0Ry3zpsriomagkP3/xnIlj4q q5GME/FdQyKvtPfQu8NgtLLcot3KYlbchs3gZMuJqQyD71OFJXSZCfWCBxYobCP9kFX3 YbUh1GLEhM2aEoGaSIBiMmUCv0fv01J4K5DgGFJQwC6kUFN9I85ayrgR04P+MlBkJEDf k06A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755205594; x=1755810394; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=7nyZcVeOt5nR38i/DhMOCO7K65CYWZeN3Fxalmhjn28=; b=lib+I9R7R5dSVJ1TTlDxhLjFH1ft7urdTLSa5JFfgWas6ioFlmmlGTIw3J+BYvWAqr y8L++fPtAPnUHsvSMkaRLdBql02mkiiO+eS41XMYnxhXRu66DHRGE1zpjE+5RLPawAT/ Hky29xnDoo6IVq8z8RuGsZyEko1zn+Ox5VJ7ek09q+HH5r12oNoqLfLq0Xi7cSy3iUXu MsDH6f1qWQno6wOTDtQzPMTKEeYPx+sCXvozo9goiFf7PzEiRP3MzzG19aD4ZtRB7pHO Aios0YqNBL4vSCL3/Dq6zkluAp5K9J4u5mmEziG2+6ytEd4Mp1p4Cz2uytJT+F8ukppn dQtg== X-Forwarded-Encrypted: i=1; AJvYcCW5Yrb6VswGe8oBMZdkt2ct4hneO5Cwcs6hAKl4ZM5HbqTqiDRx5EUmAOgVBrSWvj3jOcnKPVrNYTKHWchl@lists.postgresql.org X-Gm-Message-State: AOJu0YzHXDS1a+3OhqL6hxvAUOjyV4VSZJztEEcpmtHuyNgAJD/5oghR Xspfv1w28Oa2wDvjQXvCnOmzuYY73JQRjEmbW7Esw3t35seCIaLxk+R73+3+qbSrTk5YcELavU9 V49TK88aqDlENHwXTRO6nO4SM4GeA9sMMtBJ/TtgN9g== X-Gm-Gg: ASbGnctIorhWCR07ZpFkDZfJbJgNlQlGoN/6HB2HYsOmrROaLZ0s7b0j5VeoZdmgfi1 fa8J8cIce1Corntk4/BsK4Sa+eMuztosBgq4Ky6qeAuzi0Yi8hOUHtFLKMyVHXqjjt5YoZbiZkb N9/MSx0cetCF1D0jZO//uvkV5frcEH5hdC/tm/G3K0Rqc5MgIBNlCxY2IDX322KZ0hzaGTuCe5q lnAQEXT8puBXWMR2SiTL5PQIUkO X-Google-Smtp-Source: AGHT+IGz951e9O+28hs54NqTXsgXwlhtpr27H/8z6zbq64V9xiW3Ju2iEKDCOuGuZd0lPOuupqdSKACi7LSo7ZSkwS4= X-Received: by 2002:a05:6000:2089:b0:3a4:eed9:755d with SMTP id ffacd0b85a97d-3b9edfd196emr3656601f8f.3.1755205594241; Thu, 14 Aug 2025 14:06:34 -0700 (PDT) MIME-Version: 1.0 References: <7c308720-5cd4-4508-8cc0-6e55a6352731@vondra.me> <6wyxbnry2unm3kbcu2sabhzhs7baoedlg77xqm42chpofjq45g@igst42zpl7ok> In-Reply-To: From: Peter Geoghegan Date: Thu, 14 Aug 2025 17:06:07 -0400 X-Gm-Features: Ac12FXyjR4MH8aJLRSsDbH_g1LGndvAI7L1DXNxbD8CORBfWmRQDjCMs7mbmaEo Message-ID: Subject: Re: index prefetching To: Andres Freund Cc: Tomas Vondra , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Thu, Aug 14, 2025 at 4:44=E2=80=AFPM Andres Freund = wrote: > Interesting. In the sequential case I see some waits that are not attribu= ted > in explain, due to the waits happening within WaitIO(), not WaitReadBuffe= rs(). > Which indicates that the read stream is trying to re-read a buffer that > previously started being read. I *knew* that something had to be up here. Thanks for your help with debugg= ing! > read_stream_start_pending_read() > -> StartReadBuffers() > -> AsyncReadBuffers() > -> ReadBuffersCanStartIO() > -> StartBufferIO() > -> WaitIO() > > There are far fewer cases of this in the random case. Index tuples with TIDs that are slightly out of order are very normal. Even for *perfectly* sequential inserts, the FSM tends to use the last piece of free space on a heap page some time after the heap page initially becomes "almost full". I recently described this to Tomas on this thread [1]. > From what I can tell the sequential case so often will re-read a buffer t= hat > it is already in the process of reading - and thus wait for that IO befor= e > continuing - that we don't actually keep enough IO in flight. Oops. There is an existing stop-gap mechanism in the patch that is supposed to deal with this problem. index_scan_stream_read_next, which is the read stream callback, has logic that is supposed to suppress duplicate block requests. But that's obviously not totally effective, since it only remembers the very last heap block request. If this same mechanism remembered (say) the last 2 heap blocks it requested, that might be enough to totally fix this particular problem. This isn't a serious proposal, but it'll be simple enough to implement. Hopefully when I do that (which I plan to soon) it'll fully validate your theory. > We can optimize that by deferring the StartBufferIO() if we're encounteri= ng a > buffer that is undergoing IO, at the cost of some complexity. I'm not su= re > real-world queries will often encounter the pattern of the same block bei= ng > read in by a read stream multiple times in close proximity sufficiently o= ften > to make that worth it. We definitely need to be prepared for duplicate prefetch requests in the context of index scans. I'm far from sure how sophisticated that actually needs to be. Obviously the design choices in this area are far from settled right now. [1] DC1G2PKUO9CI.3MK1L3YBZ2V3T@bowt.ie --=20 Peter Geoghegan