Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uquPG-00F25T-4F for pgsql-hackers@arkaria.postgresql.org; Tue, 26 Aug 2025 14:12:15 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uquPF-006qRl-Jc for pgsql-hackers@arkaria.postgresql.org; Tue, 26 Aug 2025 14:12:14 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uquPF-006qRd-6X for pgsql-hackers@lists.postgresql.org; Tue, 26 Aug 2025 14:12:13 +0000 Received: from mail-wm1-x32e.google.com ([2a00:1450:4864:20::32e]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1uquPD-001qqU-10 for pgsql-hackers@postgresql.org; Tue, 26 Aug 2025 14:12:12 +0000 Received: by mail-wm1-x32e.google.com with SMTP id 5b1f17b1804b1-45b55ed86b9so20326795e9.0 for ; Tue, 26 Aug 2025 07:12:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1756217530; x=1756822330; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=DUmUAdrLjg6RWQXTL55OSA+Bc4NXWYAxpdfnT13dBjo=; b=KfSUjhQunJJYIW7Ga1IyyockfOi3blMnBrGPgVoU0w/sXNYhr4EflbteT0hFeAs+ZX ptgtaxdnIgytp8bl7Ticad80ZD3jWpbT5Jh+rvNacSJ+NVoMqUH7BhTNgZX3ZSfj3uJ+ /HkGpd2hbJ8ImDrUrWvQOrtQMe5L0ty4QBGkd8pRpXnegO25QDkJvDHG1ao2O9XSF6iV TMHooWxPFergfMNIvz+1Nlkj05F4pDwBeyEeIvU2dSo7mdGYaCijJQ7W9FOj8Oj/8g7/ bmlWwFTDyxfB5EAcGDp1+p1VqI7c43OI2P1ycGiKHsN/4v445ZIWiK8guuJ1ZYBVxddX oVUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1756217530; x=1756822330; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=DUmUAdrLjg6RWQXTL55OSA+Bc4NXWYAxpdfnT13dBjo=; b=RYsENB4kNq8Y+h6+UpO4aroBO5uhv6WW8RQgROgp2DBz6A1KInhw1dBcoBn1dKdeNB +BdfJhbmTPW1XTK11I0Zy2Dd24Z7oKZlKSk9eRF/h4kssf5WIQVxgBysRMfeYwnqqL+N hDi6HjP7UBJBejvvdNH+Fx83Ur6YkIWVNoOwZR4F+lCHRUTCVRqREfMSJZGQ2ZHOMFbf R4T2QtsIugXCIT7eG7MDdLdD+CMRPj3GvQ05+40jCy6pugN7V7mwgw4+4CjcvyLDf8FY pvA2HxT+RIOcfzpSvkNgEw7BD/036+IMd1BFub+xd/pJDNMDfWoS2JknTNiJDrxhf6uQ kcAA== X-Forwarded-Encrypted: i=1; AJvYcCXMvBxkpqPQsGAApN3ZCYgJi8OYf6Y+N1XPqGoyzjcgq5Z73TPDWkq+ldpGUSeVSKNrRST8qfFBkGgsMt08@postgresql.org X-Gm-Message-State: AOJu0YzVEWNKUS5ac9Glhk0lZhkwL9BWeOxhthcXk3i61RF7p+je9mIF KX3Dp0IKJLZrUq9x6YJ6rMvi4Hofq4U9Q2rz3IJlfZF594NonpE+l/swm8wwtVcjUQe9NWHRDsf +Md66hCcXEgSH+r/t4M4VSI/40W9A7Do= X-Gm-Gg: ASbGncuiaVLfjISyXnc0PvY2RwyhCNwQXJwhJcBcdlst6cD5p9n7uwbYFpI9JzvhAhn xochxtgujxmj6++MIPLXJXRBeTriLZkdAalxgMJ4oClV8bwnPM1zcmTBJ11vwtVVrEPpzrtMLhK S6bsCbK0gqVRFMTvzPT8CcBlta3ecVmCl0D2hblZfgTa/zAjxpKiU8k1Mf3EeR6epN71EgTxAfe FtOblEA X-Google-Smtp-Source: AGHT+IGJYYuJ1L9GNjw5Qf7/mHkVMttEGTn7NZsKVSsKJRsodrdbSF/OnxMV4hi6Y2JcE6JcvJKvBAbdV3dDO9E9e80= X-Received: by 2002:a05:600c:3b25:b0:459:da89:b06 with SMTP id 5b1f17b1804b1-45b517b008dmr199625255e9.16.1756217530346; Tue, 26 Aug 2025 07:12:10 -0700 (PDT) MIME-Version: 1.0 References: <11A59C0C-A8C8-4642-8493-292D5DF8311D@yandex-team.ru> In-Reply-To: From: Kirk Wolak Date: Tue, 26 Aug 2025 10:11:43 -0400 X-Gm-Features: Ac12FXzheoN_IicmiIMJKwnUeLeaE2qYbNQamKyTtNbUy6eYFN2qx1g2amACp1A Message-ID: Subject: Re: [WiP] B-tree page merge during vacuum to reduce index bloat To: Matthias van de Meent Cc: Andrey Borodin , pgsql-hackers , Nikolay Samokhvalov Content-Type: multipart/alternative; boundary="0000000000000b7350063d4542ea" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000000b7350063d4542ea Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, Aug 26, 2025 at 6:33=E2=80=AFAM Matthias van de Meent < boekewurm+postgres@gmail.com> wrote: > On Tue, 26 Aug 2025 at 11:40, Andrey Borodin wrote= : > > > > Hi hackers, > > > > Together with Kirk and Nik we spent several online hacking sessions to > tackle index bloat problems [0,1,2]. As a result we concluded that B-tree > index page merge should help to keep an index from pressuring > shared_buffers. > > > > We are proposing a feature to automatically merge nearly-empty B-tree > leaf pages during VACUUM operations to reduce index bloat. This addresses= a > common issue where deleted tuples leave behind sparsely populated pages > that traditional page deletion cannot handle because they're not complete= ly > empty. > > > ... > I'm fairly sure there is a correctness issue here; I don't think you > correctly detect the two following cases: > > 1.) a page (P0) is scanned by a scan, finishes processing the results, > and releases its pin. It prepares to scan the next page of the scan > (P1). > 2.) a page (A) is split, with new right sibling page B, > 3.) and the newly created page B is merged into its right sibling C, > 4.) the scan continues on to the next page > > For backward scans, if page A is the same page as the one identified > with P1, the scan won't notice that tuples from P1 (aka A) have been > moved through B to P0 (C), causing the scan to skip processing for > those tuples. > For forward scans, if page A is the same page as the one identified > with P0, the scan won't notice that tuples from P0 (A) have been moved > through B to P1 (C), causing the scan to process those tuples twice in > the same scan, potentially duplicating results. > > NB: Currently, the only way for "merge" to happen is when the index > page is completely empty. This guarantees that there is no movement of > scan-visible tuples to pages we've already visited/are about to visit. > This invariant is used extensively to limit lock and pin coupling (and > thus: improve performance) in index scans; see e.g. in 1bd4bc85. This > patch will invalidate that invariant, and therefore it will require > (significantly) more work in the scan code (incl. nbtsearch.c) to > guarantee exactly-once results + no false negatives. > > Kind regards, > > Matthias van de Meent > Databricks > This was one of our concerns. We will review the patch mentioned. I do have a question, one of the IDEAS we discussed was to ADD a new page that combined the 2 pages. Making the deletion "feel" like a page split. This has the advantage of leaving the original 2 pages alone for anyone who is currently traversing. And like the page split, updating the links around while marking the pages for the new path. The downside to this approach is that we are "adding 1 page to then mark 2 pages as unused". Could you comment on this secondary approach? Thanks in advance! Kirk --0000000000000b7350063d4542ea Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Tue, Aug 26, 2025 at 6:33=E2=80=AFAM M= atthias van de Meent <= boekewurm+postgres@gmail.com> wrote:
On= Tue, 26 Aug 2025 at 11:40, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi hackers,
>
> Together with Kirk and Nik we spent several online hacking sessions to= tackle index bloat problems [0,1,2]. As a result we concluded that B-tree = index page merge should help to keep an index from pressuring shared_buffer= s.
>
> We are proposing a feature to automatically merge nearly-empty B-tree = leaf pages during VACUUM operations to reduce index bloat. This addresses a= common issue where deleted tuples leave behind sparsely populated pages th= at traditional page deletion cannot handle because they're not complete= ly empty.
>
...
I'm fairly sure there is a correctness issue here; I don't think yo= u
correctly detect the two following cases:

1.) a page (P0) is scanned by a scan, finishes processing the results,
and releases its pin. It prepares to scan the next page of the scan
(P1).
2.) a page (A) is split, with new right sibling page B,
3.) and the newly created page B is merged into its right sibling C,
4.) the scan continues on to the next page

For backward scans, if page A is the same page as the one identified
with P1, the scan won't notice that tuples from P1 (aka A) have been moved through B to P0 (C), causing the scan to skip processing for
those tuples.
For forward scans, if page A is the same page as the one identified
with P0, the scan won't notice that tuples from P0 (A) have been moved<= br> through B to P1 (C), causing the scan to process those tuples twice in
the same scan, potentially duplicating results.

NB: Currently, the only way for "merge" to happen is when the ind= ex
page is completely empty. This guarantees that there is no movement of
scan-visible tuples to pages we've already visited/are about to visit.<= br> This invariant is used extensively to limit lock and pin coupling (and
thus: improve performance) in index scans; see e.g. in 1bd4bc85. This
patch will invalidate that invariant, and therefore it will require
(significantly) more work in the scan code (incl. nbtsearch.c) to
guarantee exactly-once results + no false negatives.

Kind regards,

Matthias van de Meent
Databricks

This was one of our concerns= .=C2=A0 We will review the patch mentioned.
I do have a question,= one of the IDEAS we discussed was to ADD a new page that combined the 2 pa= ges.
Making the deletion "feel" like a page split.

This has the advantage of leaving the original 2 pages= alone for anyone who is currently traversing.
And like the page = split, updating the links around while marking the pages for the new path.<= /div>

The downside to this approach is that we are "= ;adding 1 page to then mark 2 pages as unused".

Could you comment on this secondary approach?

T= hanks in advance!

Kirk
=C2=A0
--0000000000000b7350063d4542ea--