Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vyFcv-00HZ8K-0g for pgsql-hackers@arkaria.postgresql.org; Thu, 05 Mar 2026 20:48:57 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vyFct-001MWJ-18 for pgsql-hackers@arkaria.postgresql.org; Thu, 05 Mar 2026 20:48:55 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vyFZp-001JWm-1W for pgsql-hackers@lists.postgresql.org; Thu, 05 Mar 2026 20:45:46 +0000 Received: from mail-ed1-x52c.google.com ([2a00:1450:4864:20::52c]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vyFZn-000000014Q7-2vIY for pgsql-hackers@postgresql.org; Thu, 05 Mar 2026 20:45:45 +0000 Received: by mail-ed1-x52c.google.com with SMTP id 4fb4d7f45d1cf-660d77cacc2so5903894a12.1 for ; Thu, 05 Mar 2026 12:45:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1772743543; cv=none; d=google.com; s=arc-20240605; b=M+4uRoIhIeZGfI5AkzXreKvP7rMNkqXqe1VR83h2HmekTPKILuutb2FltZqHTGwMda t0cltSF2g2sjccOt0gldc7hgF08Gvapph6jgU3XNGc+Y1QG/nN8GYldzT6ZLDzRfOBoe MvFbXc8mNmLm9OEyr9vJu8Py+arua0TOlBWGr0S24nmSY0Mh+8qCmufUdo67PqExIx+g IyhZEQYEPfqhV+MjSgFHmQ2PKP93NjO8qNGN1Kyvj777JQkJdtqpSRcVNzmoNCic/VW/ 2MejpLweQp4uhZs2wMQ3iuakbQyXkEM3kgfRmYPTgJafhJ+M53pRwfbNsUCxQNtesGlm K+Zg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=e4agvXP1Arm60aoiJq/5lViJwtPGupVNivwxfviL/Gk=; fh=2WaUsHAVYblLasocSngqY+6iYqrirKIfX9PNyLonYbU=; b=Sjy3dqEu/ruoPJFMIuueKEy+goQsSTUAJjbWmdxN0G2/W7XKTpRncgwa6UiFVhed5z YuPBY7llBUf2z4XVw/p19TXBDUVbULLdJLNuT/DnUNJeeD3VQbhz7xjWkZC6lg9ZHgFv WjkoG6AEo5Nmu7SQPJPUKzLM6azaCl+/Fqojtw1AKBaEH6KoDEt7ObOhsscALOqKrqol Nsr9suij++vG70PxD6RxoMRGAbEWJXRILBy8MWL6YkSbrQUwl5BrC83Aeob2zU7r17oa t56PsJ7cz+BExhRKaFfNUgBelbZypnS7tLlrxBJ6l8HdMn5Fu4VQHesFJRGiSboz0ZQU Ltcw==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1772743543; x=1773348343; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=e4agvXP1Arm60aoiJq/5lViJwtPGupVNivwxfviL/Gk=; b=ndJpQW07ZeHkwLbGPUGLD8AibYHWQvPD+StaIJM2ccQ7qRWt9jare9U2sV2h7fDCBc SZKhkcww+qJfF/P/W9uU+C85uRNjc9TzdmTFnDCewqRyWwS8QpGymvA3cno8xljZeolh KBdr/qRmG5zUoyEatHM9fv1JtjElWUwKnleeirYlOy63uIzKoNUVIvemrMW2PaI6S9x5 0A0z9aXvGTKh2aXH6ALLQ5yj7ccwg8qD0urP7ARQme25EXqfg+zOXREXbOxyMsQ/hzyL 35QzqjVpPtMqHi3FNhaAJ7cftWFd7bqJSa6XseDxRPHJ862c/KpAyztmGjSc795h2XWg D5DA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1772743543; x=1773348343; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=e4agvXP1Arm60aoiJq/5lViJwtPGupVNivwxfviL/Gk=; b=wY57/RsAsch1lgdjo6sQtSkfFDHfdM8mgdfvVi+/W4hI78k4Q2bOLdZKqXexGBa376 7jP70y469wgm/iJHvXq9o5Jl8mKRG6U8QEwlXLCFUtR+gn9+jOindL0sceFBDzB3/C5C GA41Ba8k0Cbn7JQaVVnLTQItoz4MTWN1HatpZZzTkKb8dUyLUWXPQFSIh2duyJAGdK4j IAXVt7csI6UmjRAq8e2m3TJ3ca/5y0TaQK8oWXYrDz4S4xjXxp/hjIruKy+O+L3nlWzM C1cVf3CyNbpTqz2SvEQU5oZjxGBLczfJSQEi2LQLyM+gxXCzGIy0ckjPGyOyvmOPKYwV ELtw== X-Forwarded-Encrypted: i=1; AJvYcCWsB+/IcfVeNy5ntNBfrxUZPYzB8ejcFoJ4j1UZnIL1Z4nzpCGGCQFg0OeHoYuFwMPshZ5JMmr9ykwM9AsH@postgresql.org X-Gm-Message-State: AOJu0YxYeGXeJBnSWcR89iRllCg8bgwOfuJg+TRMPP3rTqBBspLcqvHX dvbZgQBLh7VYvl9vPEiChyf8HoZet1uN3O2clDDJ1BH7sGAUJiYgADmYZxnFT5hnBUGwHF1Vlvs M3eO2QYYKkd8md1a5QjYA+5NCfShHwPE= X-Gm-Gg: ATEYQzy6oXVTZy0hO7sMPJwnBkvyFHXrqZQwUH27WbkfOhbNEf15mBfysUohQYKzoki XaZ/L7d0aI/yKlr2IJU0Y8nSe2/fDGaMAR4cyE8aybfSpOSiqUtL0PhX2vR5YcXmQQ2Iih8kxbM szsz6SAwt+FKeT/1h3W2vxea58S8fAEVPWevHS8STK42jUFiS+44Lv8GVUmf9CMUjIHlAXrkEy4 OzX/BS2rOyBN0N+V5yvbsqS48RwuiF8rwGq2+yXR7vrkfq6rQWqFAaYgmgM0CpFydD2c+0s9RTO 0j4FilKTvRwAJJJbzGo5ZQOv0fc2+kZNLAlQpw== X-Received: by 2002:a05:6402:27cb:b0:658:1aed:476c with SMTP id 4fb4d7f45d1cf-6614256bd0amr2397227a12.1.1772743542730; Thu, 05 Mar 2026 12:45:42 -0800 (PST) MIME-Version: 1.0 References: <11A59C0C-A8C8-4642-8493-292D5DF8311D@yandex-team.ru> In-Reply-To: From: Madhav Madhusoodanan Date: Fri, 6 Mar 2026 02:15:31 +0530 X-Gm-Features: AaiRm51emVUFDE02B_M_MkAl74LhO1OU_R-gfKHYcd5vG3ImPfJhJXMZ9y9N2U0 Message-ID: Subject: Re: [WiP] B-tree page merge during vacuum to reduce index bloat To: Matthias van de Meent Cc: Kirk Wolak , Andrey Borodin , pgsql-hackers , Nikolay Samokhvalov Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Sun, Mar 1, 2026 at 4:15=E2=80=AFPM Madhav Madhusoodanan wrote: > > On Fri, Feb 27, 2026 at 5:37=E2=80=AFPM Matthias van de Meent > wrote: > > > > That is one part of the picture (the merging part), but it's missing a > > lot of details: > > - How do concurrent workloads (index scans, backwards index scans, > > index insertions) detect and recover from concurrent merges when they > > step through pages? > > - Do you have a theory or proof of correctness for the above? > > - Can this scheme be implemented without adding significant overhead > > to current workloads that don't benefit significantly from the new > > feature? > > > > One example for a problem with the given flow: where (and how) do you > > update the sibling pointers in pages A (to N from B) and D (to N from > > C)? > > You haven't explicitly locked pages A and D by the time you get to > > step 4, even though locking pages is critical to guarantee correct and > > safe operations. Simply locking them in step 4 isn't sufficient, as > > another process may have locked page A in preparation for a split, > > which would then be waiting to get a lock on page B. If you've already > > locked page B before locking page A, you'll have a deadlock. > > > > Additionally, how does this work when you find out in step 3 that your > > pages B and C each are linked to from a different parent page? You'd > > have to update both parent pages, but that would also require locking > > more than just the one parent page per level that currently gets > > locked during page deletion. > > Ohh I see, I haven't included the dynamics of such flows too. I might > have to think this through. > Thank you for pointing it out! > > Just to understand what we can control (and to help internalize the > constraints so that I can think through a strategy): to handle > transient merge states, are we intending to update core? Or is the > bloat reduction functionality intended to be abstracted into an > extension? > > I'm coming up with ways that might potentially update index amcheck > and the like. > > Thanks in advance! > > Madhav Some parts of the merge flow that I am coming up with are as follows (assuming tuples from index page B are migrated rightwards into C): 1. Leaving B's tuples as it is even after merge, to remove the possible risk of scans "skipping over" tuples. Essentially, the tuples then would be "copied" into C. 2. Marking pages B and C with flags similar to INCOMPLETE_SPLIT (say, MERGE_SRC and MERGE_DEST respectively) before the actual merge process, then marking the pages with another flag upon completion (MERGE_COMPLETE) so that other processes can handle transient merge states. 3. For example, scans that reach page B post-merge (MERGE_SRC + MERGE_COMPLETE) would be made to skip to the page to its right. 4. Updating VACUUM to handle post-merge cleanup (to remove pages such as B)= . Note: The underlying assumption is that B and C both are children of a single parent page P. For a start I'm picturing this flow only for sibling nodes, not for cousin nodes (B and C having different parent nodes). Would the above approach be alright? These are not exhaustive steps, I'm only raising these lines of thought so that I can fast-fail them (if there are any issues). I'd be happy to provide a doc of the exact flow I'm picturing (how to ensure TIDs are read exactly-once, how the merge process would hold up with other concurrent processes, etc), incase my thoughts seem promising. Thanks in advance! Madhav