Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1raFQX-008oSU-8Q for pgsql-hackers@arkaria.postgresql.org; Wed, 14 Feb 2024 13:35:53 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1raFPV-005zc4-3W for pgsql-hackers@arkaria.postgresql.org; Wed, 14 Feb 2024 13:34:49 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1raFPU-005zbw-Kl for pgsql-hackers@lists.postgresql.org; Wed, 14 Feb 2024 13:34:48 +0000 Received: from mail-wm1-x333.google.com ([2a00:1450:4864:20::333]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1raFPQ-006iPr-OO for pgsql-hackers@lists.postgresql.org; Wed, 14 Feb 2024 13:34:46 +0000 Received: by mail-wm1-x333.google.com with SMTP id 5b1f17b1804b1-411d3c75026so9190505e9.0 for ; Wed, 14 Feb 2024 05:34:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=enterprisedb.com; s=google; t=1707917682; x=1708522482; darn=lists.postgresql.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=rH/1jtbExn0E6/m7DFnB5/A44HI5qC2eEHu9fhJrJtY=; b=ZiwbRc5eJRVHW8ukrRFkJaQ4V7jt9VNVptpxg4VSmthgKLtbFn38J29UzGabzg1vPX defc3KwEza/MQLw3KKfFz+W9RYoOWPBxu6Dv0bUL6vQOIP8gxQSxhDlNAUwn7yvVhP2t jrhCvRcEzmyg/SZDYisiHqZOzZBcltl0wjkj2wYU1enhLVqe61V7tvITUoZtMrQ+WSc5 nABJ2C7Mfw3KHbQTr7eNWHHeudswZ1Jc50Q63Z8XpWoJjdaTeuDhggXSvovftR5twxN4 TKqkIaQpNXcfEj6AZwrsSxpKH8KnLd/V12rvPRphVeNGvgrXfZ7D4TpKq2vQsp57RzzR MqGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707917682; x=1708522482; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=rH/1jtbExn0E6/m7DFnB5/A44HI5qC2eEHu9fhJrJtY=; b=FPkrJJt1TC+NVFwt7yqXC5fJRMf9CTH6C9O0WV0eGgtK6EU4WUQaEmMmKs8bOuDr3k 0XJVmhU6Ew/pN/JOKk3WpWgodbLz0XdQ05frRCdyIkGPB8mMJ4ZGdhVsMKIfssdFpghE dKLRBjqHMnUD9eLY0zdeg2f7IRlzV3LF8o/tWy28DfZoXYiQ1y9EFT3IqEskXj1pAK3W bzFbj6v7YNdTbs2x/gVKPN/XFPctjmXBY9R+mNajC2CtxUXF9e4Hgg92ASIdVn4tOi6X hbNYLLVnFvMqw/hVV2gS/v5Q2Jg4SD7fS0A5D/0Koq+g9H+tZChWOPCusNMEpNL75gvb x3yg== X-Forwarded-Encrypted: i=1; AJvYcCVgDFfJBv4ODGwQBeW4rIyuGFPWfZgqOhpMkTFfhzLDLtJ7m147306DaMGjTYCi5nr4fsrdAeZSM+Y6cnjQtAjFqb/SBba265mKt6y37ZEIHbHs X-Gm-Message-State: AOJu0YyJK+w2g9c6QcmrvX3S6/GW8hKBlwri0AKrQYj/G8NlFdhxmL2t TaD4JFdw440bT1v/0jl+R61nXSU2WkS0ZoVO9HfJyq38FP+R9dbc+t9x22k3Ng== X-Google-Smtp-Source: AGHT+IHug0KWhAIoJ0o7qPk2HDLG2mkpsrqumHnr800zv30Th4pP6FlKGPCxeylvCAjPqAgCHDEWYw== X-Received: by 2002:a05:600c:19c9:b0:410:6efd:3169 with SMTP id u9-20020a05600c19c900b004106efd3169mr2362665wmq.41.1707917682180; Wed, 14 Feb 2024 05:34:42 -0800 (PST) X-Forwarded-Encrypted: i=1; AJvYcCUN2kWZjRIaW3oNadocV/d1QClvUaT1LvEw8BMTx79ZLu1j8CT9V7YBlBKR8fAk87Gkq+St0+QBFNbYbxWlRwDuBT1e2U88bT1jBk7RR3KH5SFPj1+mjY6R/k6j7oHt8BDc25pAo7Y5rM3AumtAmlg7ecTZfoHKGOW4jlL2iPBsGpl6NaQZR3m98MFrxur3md7oBs90ZkshsP9FQoZlJ/PcODgOitvLGOM1zLrQFpxgGz7QcboRkP9IZTkRnvlOGKNdXFUs/IeUD3Q4HEOGhMlwPlr9PYJCreVYpe8= Received: from [10.137.0.18] (ip-86-49-229-30.bb.vodafone.cz. [86.49.229.30]) by smtp.gmail.com with ESMTPSA id bx23-20020a5d5b17000000b0033cf29a67c7sm738876wrb.7.2024.02.14.05.34.41 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 14 Feb 2024 05:34:41 -0800 (PST) Message-ID: Date: Wed, 14 Feb 2024 14:34:40 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: index prefetching To: Peter Geoghegan Cc: Melanie Plageman , Robert Haas , Andres Freund , PostgreSQL Hackers , Georgios , Thomas Munro , Konstantin Knizhnik , Dilip Kumar References: <8ec36f51-b863-60e3-20e2-b9c981c5ce5e@enterprisedb.com> <98ba4b25-fae8-c1f4-1597-8093375a1986@enterprisedb.com> <20231221134314.wf2rs62d37u62j7t@alap3.anarazel.de> <20231221154352.ijtg6wloa3nowivh@alap3.anarazel.de> <482ec3ff-52ad-415d-96fd-f3832a894023@enterprisedb.com> <56176b8d-956c-487e-ab09-310db4581c07@enterprisedb.com> <4867452a-b853-4813-a6da-9bb06a336f8b@enterprisedb.com> <4f5f16ef-df1e-4e09-9b3f-2e0961ab5117@enterprisedb.com> Content-Language: en-US From: Tomas Vondra In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 2/13/24 20:54, Peter Geoghegan wrote: > On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra > wrote: >> On 2/7/24 22:48, Melanie Plageman wrote: >> I admit I haven't thought about kill_prior_tuple until you pointed out. >> Yeah, prefetching separates (de-synchronizes) the two scans (index and >> heap) in a way that prevents this optimization. Or at least makes it >> much more complex :-( > > Another thing that argues against doing this is that we might not need > to visit any more B-Tree leaf pages when there is a LIMIT n involved. > We could end up scanning a whole extra leaf page (including all of its > tuples) for want of the ability to "push down" a LIMIT to the index AM > (that's not what happens right now, but it isn't really needed at all > right now). > I'm not quite sure I understand what is "this" that you argue against. Are you saying we should not separate the two scans? If yes, is there a better way to do this? The LIMIT problem is not very clear to me either. Yes, if we get close to the end of the leaf page, we may need to visit the next leaf page. But that's kinda the whole point of prefetching - reading stuff ahead, and reading too far ahead is an inherent risk. Isn't that a problem we have even without LIMIT? The prefetch distance ramp up is meant to limit the impact. > This property of index scans is fundamental to how index scans work. > Pinning an index page as an interlock against concurrently TID > recycling by VACUUM is directly described by the index API docs [1], > even (the docs actually use terms like "buffer pin" rather than > something more abstract sounding). I don't think that anything > affecting that behavior should be considered an implementation detail > of the nbtree index AM as such (nor any particular index AM). > Good point. > I think that it makes sense to put the index AM in control here -- > that almost follows from what I said about the index AM API. The index > AM already needs to be in control, in about the same way, to deal with > kill_prior_tuple (plus it helps with the LIMIT issue I described). > In control how? What would be the control flow - what part would be managed by the index AM? I initially did the prefetching entirely in each index AM, but it was suggested doing this in the executor would be better. So I gradually moved it to executor. But the idea to combine this with the streaming read API seems as a move from executor back to the lower levels ... and now you're suggesting to make the index AM responsible for this again. I'm not saying any of those layering options is wrong, but it's not clear to me which is the right one. > There doesn't necessarily need to be much code duplication to make > that work. Offhand I suspect it would be kind of similar to how > deletion of LP_DEAD-marked index tuples by non-nbtree index AMs gets > by with generic logic implemented by > index_compute_xid_horizon_for_tuples -- that's all that we need to > determine a snapshotConflictHorizon value for recovery conflict > purposes. Note that index_compute_xid_horizon_for_tuples() reads > *index* pages, despite not being aware of the caller's index AM and > index tuple format. > > (The only reason why nbtree needs a custom solution is because it has > posting list tuples to worry about, unlike GiST and unlike Hash, which > consistently use unadorned generic IndexTuple structs with heap TID > represented in the standard/generic way only. While these concepts > probably all originated in nbtree, they're still not nbtree > implementation details.) > I haven't looked at the details, but I agree the LP_DEAD deletion seems like a sensible inspiration. >>> Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps >>> there is an easier way to fix this, as I don't think the mvcc test >>> failed on Tomas' version. >>> >> >> I kinda doubt it worked correctly, considering I simply ignored the >> optimization. It's far more likely it just worked by luck. > > The test that did fail will have only revealed that the > kill_prior_tuple wasn't operating as expected -- which isn't the same > thing as giving wrong answers. > Possible. But AFAIK it did fail for Melanie, and I don't have a very good explanation for the difference in behavior. > Note that there are various ways that concurrent TID recycling might > prevent _bt_killitems() from setting LP_DEAD bits. It's totally > unsurprising that breaking kill_prior_tuple in some way could be > missed. Andres wrote the MVCC test in question precisely because > certain aspects of kill_prior_tuple were broken for months without > anybody noticing. > > [1] https://www.postgresql.org/docs/devel/index-locking.html Yeah. There's clearly plenty of space for subtle issues. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company