Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1urNMH-0058Iy-98 for pgsql-hackers@arkaria.postgresql.org; Wed, 27 Aug 2025 21:07:06 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1urNME-00HW2H-W8 for pgsql-hackers@arkaria.postgresql.org; Wed, 27 Aug 2025 21:07:03 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1urNME-00HW0w-JN for pgsql-hackers@lists.postgresql.org; Wed, 27 Aug 2025 21:07:03 +0000 Received: from mail-wr1-x429.google.com ([2a00:1450:4864:20::429]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1urNMA-002Cmy-0E for pgsql-hackers@postgresql.org; Wed, 27 Aug 2025 21:07:00 +0000 Received: by mail-wr1-x429.google.com with SMTP id ffacd0b85a97d-3c79f0a606eso172043f8f.0 for ; Wed, 27 Aug 2025 14:06:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bowt-ie.20230601.gappssmtp.com; s=20230601; t=1756328815; x=1756933615; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=m2CTFql+hswR4JiEeXcLTKhCAzD6qlE6rU1pcDjI0j0=; b=TTwXve5s/hUAoM3KJr3SJjpgIVHwqHhPCJ9DBfnpXTvScALMRM4+2DF7WHD2zX+vik edtDnavCWLg0IT29omdV4WLDccb3e+esvIo8JcUogA47F6Y3F4htPxIfHb15ituV3aX9 dnvyMR5fERqUe1Nvx7KjsYYWsrfXtGxAq98janjBdtOdgIxV6Bcmia2EBRC5Nsc3dFlF hkeYeg2nhgyH46ygQtTkLFK4a8K1cZiIFRGt0iiOt8y0thl6iuVz9vJm0vac0jnm1z94 fEJg5Wu2jSOkd74046zCEAeLrKAgAn2HwsDVsADK1HC8OjOFbLVqKLCF2TxaKzfqvUUt zX9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1756328815; x=1756933615; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=m2CTFql+hswR4JiEeXcLTKhCAzD6qlE6rU1pcDjI0j0=; b=s1t3GjIhBnuNpnSo55J57voN8A3kxfaKXq50MUElAPfyPlhL05syRZtOgW3XHPZ3uW xnmjA8/85ycT9lp4NFvIKUZkHVIty33Y39JsiL0fbChhJRIwFqrqd03LVhsDzTXeC2tW VJwHHRrKiYk5J0xAFu70giPDuhgrQnE80eU8p7qR4XOPIaVBlQUyQNKatP348KWBpcvB eHNiWYGTjZssCTJft/HcsFp7vC+8edZsVx0KXXR23BUHgUDNukaAhMZw21CJlrLtddUc diZqilCVV14ABQ/C46zp2koitrrxS7RMR/25q9tbckncvNGakfTXkw5xirMJ0DJv4uiO rjPw== X-Gm-Message-State: AOJu0YxITUKGdio7VUhrkeQlz8xhfws6nQC8K/ne6xinkhubuLbo24Jj rW9UtbQo4pemawnZhHZvLeWsoyt3IV+/7pVlWWDrZiW5KAxL13d00tZAquS2XIin7PJ1rmOjlnd FHkdN+u72yDPd9mj/nv/x+Fos5qoMCYBLnjQ1XdU5kQ== X-Gm-Gg: ASbGncu0qO77niKRNmEiex8J2BYcr5y5+w/8nKzzGf4Z/v7M289t/mcacFSw4rWZWvu DL+o6pgDKPZbGkeVDyWfZHfaAUFVTquzBtpOt8VljuNyKXBh0ykuLcRfmCWtOMhjqWPQh08CGU8 NkanEl8eWuJ7/UgnPT8gmvNgAiIAw8c+Zqk39JQoPhzZdcP4jiiXKMDBZ06z2YS4OZCko5agtIS X5seDA= X-Google-Smtp-Source: AGHT+IFBS7BTXNa8iVBJwtpRlSeL8DY0C2rXI9znLCt/0Hh9kDSgx0qRkEKn1dIz08xSA2GfjmKM/JXF/DB/VDSkrOQ= X-Received: by 2002:a05:6000:401f:b0:3cb:50f6:35da with SMTP id ffacd0b85a97d-3cb50f639f1mr5214579f8f.29.1756328815165; Wed, 27 Aug 2025 14:06:55 -0700 (PDT) MIME-Version: 1.0 References: <11A59C0C-A8C8-4642-8493-292D5DF8311D@yandex-team.ru> In-Reply-To: <11A59C0C-A8C8-4642-8493-292D5DF8311D@yandex-team.ru> From: Peter Geoghegan Date: Wed, 27 Aug 2025 17:06:28 -0400 X-Gm-Features: Ac12FXwl8g4NyiQbKT_eeMNytJMil0oc2KOovlAJW3FL636-bT6CcbLFiY0PsxI Message-ID: Subject: Re: [WiP] B-tree page merge during vacuum to reduce index bloat To: Andrey Borodin Cc: pgsql-hackers , Kirk Wolak , Nikolay Samokhvalov Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Tue, Aug 26, 2025 at 5:40=E2=80=AFAM Andrey Borodin wrote: > *** Correctness: > > The implementation reuses existing locking, critical sections and WAL log= ging infrastructure. To handle cross-page dependencies during WAL replay, w= hen tuples are merged, the right sibling buffer is registered with `REGBUF_= FORCE_IMAGE`, this is a temporary workaround. I don't think that you can reuse existing locking for this. It needs to be totally reevaluated in light of the way that you've changed page deletion. In general, page deletion ends up being very complicated with the Lehman & Yao design -- even with the existing restrictions. It's a natural downside of the very permissive locking rules: there is no need for most searchers and inserters to ever hold more than a single buffer lock at a time. Searchers generally hold *no* locks when descending the index at points where they're "between levels". Almost anything can happen in the meantime -- including page deletion (it's just that the deleted page cannot be concurrently recycled right away, which is kind of a separate process). That original block number reference still has to work for these searchers, no matter what. It must at least not give them an irredeemably bad picture of what's going on; they might have to move right to deal with concurrent page deletion and page splits, but that's it. If you merge together underful pages like this, you change the key space boundaries between pages. I don't see a way to make that safe/an atomic operation, except by adding significantly more locking in the critical path of searchers. > *** Current Status & Questions: > > The patch successfully reduces index bloat and handles basic scenarios, b= ut we've identified some areas that need community input: Have you benchmarked it? I wouldn't assume that free-at-empty actually is worse than merging underfull pages. It's complicated, but see this paper for some counter-arguments: https://www.sciencedirect.com/science/article/pii/002200009390020W This old Dr. Dobbs article from the same authors gives a lighter summary of the same ideas: https://web.archive.org/web/20200811014238/https://www.drdobbs.com/reexamin= ing-b-trees/184408694?pgno=3D3 In any case, I believe that this optimization isn't widely implemented by other major RDBMSs, despite being a textbook technique that was known about and described early in the history of B-Trees. --=20 Peter Geoghegan