MIME-Version: 1.0
References: <11A59C0C-A8C8-4642-8493-292D5DF8311D@yandex-team.ru>
In-Reply-To: <11A59C0C-A8C8-4642-8493-292D5DF8311D@yandex-team.ru>
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 26 Aug 2025 12:33:26 +0200
Message-ID: 
 <CAEze2Wg2a8LQDRocVPa7Df2qXQLrTz-xSu1k3xP3x_6ABVo1Jw@mail.gmail.com>
Subject: Re: [WiP] B-tree page merge during vacuum to reduce index bloat
To: Andrey Borodin <x4mmm@yandex-team.ru>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>,
 Kirk Wolak <wolakk@gmail.com>,
	Nikolay Samokhvalov <nik@postgres.ai>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAEze2Wg2a8LQDRocVPa7Df2qXQLrTz-xSu1k3xP3x_6ABVo1Jw%40mail.gmail.com>
Precedence: bulk

On Tue, 26 Aug 2025 at 11:40, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi hackers,
>
> Together with Kirk and Nik we spent several online hacking sessions to ta=
ckle index bloat problems [0,1,2]. As a result we concluded that B-tree ind=
ex page merge should help to keep an index from pressuring shared_buffers.
>
> We are proposing a feature to automatically merge nearly-empty B-tree lea=
f pages during VACUUM operations to reduce index bloat. This addresses a co=
mmon issue where deleted tuples leave behind sparsely populated pages that =
traditional page deletion cannot handle because they're not completely empt=
y.
>
> *** Implementation Overview:
>
> The patch adds a new `mergefactor` reloption (default 5%, range 0-50%) th=
at defines when a page becomes a candidate for merging. During vacuum, when=
 a page exceeds this threshold (e.g., 95% empty with default settings), we =
attempt to move the remaining tuples to the right sibling page and then del=
ete the source page using existing page deletion mechanisms.
>
> Key changes:
> - New `mergefactor` reloption for configurable merge thresholds
> - Detection logic in `btvacuumpage()` to identify merge candidates
> - Tuple movement implementation in `_bt_unlink_halfdead_page()`
> - WAL logging enhancements to handle cross-page dependencies during repla=
y
>
> The last part needs further improvements (it's simply REGBUF_FORCE_IMAGE =
for now), but I want to start a discussion and ask for known problems of th=
e approach.
>
> *** Correctness:
>
> The implementation reuses existing locking, critical sections and WAL log=
ging infrastructure. To handle cross-page dependencies during WAL replay, w=
hen tuples are merged, the right sibling buffer is registered with `REGBUF_=
FORCE_IMAGE`, this is a temporary workaround.
>

> 1. Backward Scan Correctness: The primary concern is ensuring backward sc=
ans work correctly when pages are being merged concurrently. Since we maint=
ain the same locking protocol as existing page deletion, I believe this sho=
uld be safe, but would appreciate expert review of the approach.

I'm fairly sure there is a correctness issue here; I don't think you
correctly detect the two following cases:

1.) a page (P0) is scanned by a scan, finishes processing the results,
and releases its pin. It prepares to scan the next page of the scan
(P1).
2.) a page (A) is split, with new right sibling page B,
3.) and the newly created page B is merged into its right sibling C,
4.) the scan continues on to the next page

For backward scans, if page A is the same page as the one identified
with P1, the scan won't notice that tuples from P1 (aka A) have been
moved through B to P0 (C), causing the scan to skip processing for
those tuples.
For forward scans, if page A is the same page as the one identified
with P0, the scan won't notice that tuples from P0 (A) have been moved
through B to P1 (C), causing the scan to process those tuples twice in
the same scan, potentially duplicating results.

NB: Currently, the only way for "merge" to happen is when the index
page is completely empty. This guarantees that there is no movement of
scan-visible tuples to pages we've already visited/are about to visit.
This invariant is used extensively to limit lock and pin coupling (and
thus: improve performance) in index scans; see e.g. in 1bd4bc85. This
patch will invalidate that invariant, and therefore it will require
(significantly) more work in the scan code (incl. nbtsearch.c) to
guarantee exactly-once results + no false negatives.

Kind regards,

Matthias van de Meent
Databricks