Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uqqzo-00E7ro-N0 for pgsql-hackers@arkaria.postgresql.org; Tue, 26 Aug 2025 10:33:46 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uqqzn-004k10-MA for pgsql-hackers@arkaria.postgresql.org; Tue, 26 Aug 2025 10:33:44 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uqqzn-004k0r-9u for pgsql-hackers@lists.postgresql.org; Tue, 26 Aug 2025 10:33:43 +0000 Received: from mail-lj1-x22f.google.com ([2a00:1450:4864:20::22f]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1uqqzl-001w7w-0t for pgsql-hackers@postgresql.org; Tue, 26 Aug 2025 10:33:43 +0000 Received: by mail-lj1-x22f.google.com with SMTP id 38308e7fff4ca-33667661214so23192051fa.2 for ; Tue, 26 Aug 2025 03:33:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1756204420; x=1756809220; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Do7Lnpd33EaaHr6dlwdQliQ9nON7BVqQRs4tZYrmvGc=; b=X28YLCLVxUsBeB0/85b+N30nMGEjvUH/GT+61vcXZxJUesC5eIplxloTEqF4ok3Zwh KRoFpCXCwDnubUHQACwUdeVIUyfFaiNlBA4ehwJsMsKOGE/TRtomHU0G7BNctTH+R/qi 7tab5XOG5qh/PJOJsFy7rtVhn1Mh9p2HEdjJc/SHZuUexmWJlFJb03AHgju+EZpwREXi 6G73q8cAxVljbsyQ3qeFBG5uSWUddc+kqn6xQq2y7cDOd60Wlwzb4SRh2SdSBD6TG9FH +Q9cco53NSb2Sqc7YM3G0OvezXDeB4BvCREpPhu+9N++ZcpiBRRoGgIpxmMqgm8+7tnN fCMA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1756204420; x=1756809220; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Do7Lnpd33EaaHr6dlwdQliQ9nON7BVqQRs4tZYrmvGc=; b=f4pR+fwVsEi/yDJFu+p16E2j/uhWzuE8tqsJ7tgqmxawDZsYKok88K8QAhyxTfoYob JNatQTIEuY9UV7Q1FCLTqWGaDazIjda1EI+lj6vfXXQHLCOz4bOG1bfimbHmzWr6ZW5h OGIkTWoqAAB9ohZCHkR5mEIxBrqED7Pq5jXi5b4GeovXGfzUtS7k9psZB0yk9yQ9c6zk gTy2yXaxfI8NRj0Df38klXhuyEyRXteuOD8czqizHktwmYj6IaANd1KHspQ0AqFWaSyR 57HuEnexE9bmFePtkgc3xcPd/gdFhCb9mtQu2ZKadzGAeY8MBHQC7nNUOibVNHN6HTWa jKWQ== X-Gm-Message-State: AOJu0Yyg16F/EMpxHmJwAWkxX24WZpUW7oLGr4JOiBReFgOQ4mG6KCcH XTcyT0TzM1X+lp52yz45ppmqS3KWKCmcsl+SsUEnPQA0BwU8Mui2k+7hAv09aYVHBrLvnaR0v0I KeFyzQxkAUpmuTv4Yt82Nu7s5kdRvWOk= X-Gm-Gg: ASbGnct+QiYIFlz4VNSewB9OhCHqlwqjxlzekVBPBO+3X813hhngwHI4LHcSi9fTuu+ +tnpigc4qXnfeJ1j7ZpIrw8B5Y3xlojVpj3xIGJl5BAIEFYYXOv+ypULOC6R91iW53fpU9CnCfH Ukhuwys3lEhZ1YqAuOQeUQxNR2cS3+wCYIppqI/Gy/G9YFzAG/Habs0gBcgKrVSu4FIl9+TLyUn zuwZOmbMASDO1H/ X-Google-Smtp-Source: AGHT+IHIT/rJfC090BRKXDPDhTbrxmP/HoIVfFJ8PlloV6S+xeD9tef6pveNAFxGxiNU7yYG7dDsjddroBLdn/QkJ4k= X-Received: by 2002:a05:651c:20cb:20b0:32a:6aa0:217b with SMTP id 38308e7fff4ca-33650f35b7bmr30701381fa.25.1756204419196; Tue, 26 Aug 2025 03:33:39 -0700 (PDT) MIME-Version: 1.0 References: <11A59C0C-A8C8-4642-8493-292D5DF8311D@yandex-team.ru> In-Reply-To: <11A59C0C-A8C8-4642-8493-292D5DF8311D@yandex-team.ru> From: Matthias van de Meent Date: Tue, 26 Aug 2025 12:33:26 +0200 X-Gm-Features: Ac12FXwOYz4ZIBgCaXPgIu5feY0Zp82e_F4e6_ysfvpk5zjgvpIKazgKo5dM7C0 Message-ID: Subject: Re: [WiP] B-tree page merge during vacuum to reduce index bloat To: Andrey Borodin Cc: pgsql-hackers , Kirk Wolak , Nikolay Samokhvalov Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Tue, 26 Aug 2025 at 11:40, Andrey Borodin wrote: > > Hi hackers, > > Together with Kirk and Nik we spent several online hacking sessions to ta= ckle index bloat problems [0,1,2]. As a result we concluded that B-tree ind= ex page merge should help to keep an index from pressuring shared_buffers. > > We are proposing a feature to automatically merge nearly-empty B-tree lea= f pages during VACUUM operations to reduce index bloat. This addresses a co= mmon issue where deleted tuples leave behind sparsely populated pages that = traditional page deletion cannot handle because they're not completely empt= y. > > *** Implementation Overview: > > The patch adds a new `mergefactor` reloption (default 5%, range 0-50%) th= at defines when a page becomes a candidate for merging. During vacuum, when= a page exceeds this threshold (e.g., 95% empty with default settings), we = attempt to move the remaining tuples to the right sibling page and then del= ete the source page using existing page deletion mechanisms. > > Key changes: > - New `mergefactor` reloption for configurable merge thresholds > - Detection logic in `btvacuumpage()` to identify merge candidates > - Tuple movement implementation in `_bt_unlink_halfdead_page()` > - WAL logging enhancements to handle cross-page dependencies during repla= y > > The last part needs further improvements (it's simply REGBUF_FORCE_IMAGE = for now), but I want to start a discussion and ask for known problems of th= e approach. > > *** Correctness: > > The implementation reuses existing locking, critical sections and WAL log= ging infrastructure. To handle cross-page dependencies during WAL replay, w= hen tuples are merged, the right sibling buffer is registered with `REGBUF_= FORCE_IMAGE`, this is a temporary workaround. > > 1. Backward Scan Correctness: The primary concern is ensuring backward sc= ans work correctly when pages are being merged concurrently. Since we maint= ain the same locking protocol as existing page deletion, I believe this sho= uld be safe, but would appreciate expert review of the approach. I'm fairly sure there is a correctness issue here; I don't think you correctly detect the two following cases: 1.) a page (P0) is scanned by a scan, finishes processing the results, and releases its pin. It prepares to scan the next page of the scan (P1). 2.) a page (A) is split, with new right sibling page B, 3.) and the newly created page B is merged into its right sibling C, 4.) the scan continues on to the next page For backward scans, if page A is the same page as the one identified with P1, the scan won't notice that tuples from P1 (aka A) have been moved through B to P0 (C), causing the scan to skip processing for those tuples. For forward scans, if page A is the same page as the one identified with P0, the scan won't notice that tuples from P0 (A) have been moved through B to P1 (C), causing the scan to process those tuples twice in the same scan, potentially duplicating results. NB: Currently, the only way for "merge" to happen is when the index page is completely empty. This guarantees that there is no movement of scan-visible tuples to pages we've already visited/are about to visit. This invariant is used extensively to limit lock and pin coupling (and thus: improve performance) in index scans; see e.g. in 1bd4bc85. This patch will invalidate that invariant, and therefore it will require (significantly) more work in the scan code (incl. nbtsearch.c) to guarantee exactly-once results + no false negatives. Kind regards, Matthias van de Meent Databricks