Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vIn97-00EZlh-0Z for pgsql-hackers@arkaria.postgresql.org; Tue, 11 Nov 2025 12:06:48 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vIn94-005t8f-0N for pgsql-hackers@arkaria.postgresql.org; Tue, 11 Nov 2025 12:06:46 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vIn93-005t8W-2b for pgsql-hackers@lists.postgresql.org; Tue, 11 Nov 2025 12:06:45 +0000 Received: from mail-lj1-x22b.google.com ([2a00:1450:4864:20::22b]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vIn91-0079Xe-1w for pgsql-hackers@postgresql.org; Tue, 11 Nov 2025 12:06:45 +0000 Received: by mail-lj1-x22b.google.com with SMTP id 38308e7fff4ca-36295d53a10so31451641fa.0 for ; Tue, 11 Nov 2025 04:06:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1762862802; x=1763467602; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=xX9FQeREfrY2m6AqPwCXmrphR0E7ZUw78P3v5vlimhw=; b=Vanvq3vn210/2jccsF7jv1FUWKULx4yui5CCcgxTDfFONaBIB5hGdbKSDMRbGMlOhm /8PA/hE0FT+CYVT++PlBY8bkzp5GZVE1zo2jjmvG2S2olhs1mDE+kSD3s6EjwgkLXgRk 4CS3fPHH8H+YuA92oYsyzJzwVtCe4TiJqnmGCKL4rerFz3fXeF6/Aosm3sRhTSGKAFLB zz5o08fsQimfX7bBsDJkntFOlVSEPazQcRb/FghHXRzXwhuh84Slf3Om6XIxB1EIgO3E /7NRtNuE5RweaewNE5cGKtoqg9KfCYrKzPtU3YQT+b/+qEzKSGXwj/bW2w0IcA1XpJYg QIJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1762862802; x=1763467602; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=xX9FQeREfrY2m6AqPwCXmrphR0E7ZUw78P3v5vlimhw=; b=SDvYO4UV4ajzgF4vlcfWCD0k8nMtLzC5qjORbI3irWJsdQHDgIJZW6XVIGIz6hjdo1 ImkqlpylImm3ldeUFoC0zgep3dJFhesUuWZ3bjE2uAHJW8f0T9wMToXRa9yoEUfheXX5 cUNbsY0dtfQTUPJS83uQMKOuBUA2C8U52YJVQd7H/j0XBn+tf4apWLQoA1YCMj0pZlKG l1J/rsvCqkewVxhvqEyUQe124sm+7WcPQVWh9Ibf0ngyZSGiL2TzrQiR3WmS2Pcyeg1P QFtLhz1rDxT/lk+XoWWuqlSHTT5lII2N74BBSDO/SlfRME1Q7hKh7EOPSdZWV+BIKxeZ KHEw== X-Forwarded-Encrypted: i=1; AJvYcCXRe6K52orJF3Ghyt3/TsiErhbvzpCYcfXjheCTYxqkXaJTiwVsfT63LAdy3/ufiEnli1TYGfjGEmDdDYIn@postgresql.org X-Gm-Message-State: AOJu0YxgTJSXaM3ZMPi8NV8VHRVZXn1ELCbxcAP0wnJnr19DIulwDo2V W1/9LqAOB6nYcjIFUL95EJUW+cyaBcCQ57ET2daAc6xHPt7mN09tndkYxy9dI/iOV8UC0DJqKYq n+0rptzylu+hn+4/dM1yJAwJe25ODNNY= X-Gm-Gg: ASbGncuh3t1eMru7o8ed3ciuJeF+MWbSmZaBy2S3fVru4JfmRJsADjEFhWCr+6Us5yl CqGl8dUwFnEVBElw05R40RHEf3Nro8o3E2fBLEOrbUwYXxp0TOjkpGh+ycLrZkpeHlEYICys9l3 q2iAnpnkRxT6Kst2M+GxHn6ue3PGwuD0Dwe7O/6i2nrNOEfj1oeE9UuqZDfyxDeVtQ0aFbIrs91 icIWbDQ4L13Mnoobmu8vXwsF5KQmm2dl1prEvENd3CKpCmVap1K0ZHv7u2ka48ov7tZidxUAJ3H njaaBFtCSEv0G/9FQFwD5Osl/IiPIj0v5uY2LjTnK5zzuki87vOAggfzY4Ydxo9Lsn3C8/ZXX0V 41xk= X-Google-Smtp-Source: AGHT+IHqhRuM1ehrxHNRCn8FqNsqYoPevt1Gq87PP1/8pXcGl69iy97xPrxf7b2+zbW8wNKAA/ZQ8rk7LCETEHwou2Y= X-Received: by 2002:a2e:7c13:0:b0:37a:39c1:721e with SMTP id 38308e7fff4ca-37a7b197cc7mr23423071fa.17.1762862802200; Tue, 11 Nov 2025 04:06:42 -0800 (PST) MIME-Version: 1.0 References: <11A59C0C-A8C8-4642-8493-292D5DF8311D@yandex-team.ru> <3F0FC1DC-3653-4F61-A46F-46FF81C14915@yandex-team.ru> In-Reply-To: From: Matthias van de Meent Date: Tue, 11 Nov 2025 13:06:30 +0100 X-Gm-Features: AWmQ_bnq4Dl13OlAEi6H2wa-F8_iqp_dTfq2UVtMXYzn-lXFA4LkHeBS7zUW4cc Message-ID: Subject: Re: [WiP] B-tree page merge during vacuum to reduce index bloat To: Andrey Borodin Cc: Peter Geoghegan , pgsql-hackers , Kirk Wolak , Nikolay Samokhvalov Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Sun, 31 Aug 2025 at 14:15, Andrey Borodin wrote: > > On 29 Aug 2025, at 13:39, Andrey Borodin wrote: > > > > I think to establish baseline for locking correctness we are going to s= tart from writing index scan tests, that fail with proposed merge patch and= pass on current HEAD. I want to observe that forward scan is showing dupli= cates and backward scan misses tuples. > > Well, that was unexpectedly easy. See patch 0001. It brings a test where = we create sparse tree, and injection point that will wait on a scan steppin= g into some middle leaf page. > Then the test invokes vacuum. There are ~35 leaf pages, most of them will= be merged into just a few pages. > As expected, both scans produce incorrect results. > t/008_btree_merge_scan_correctness.pl .. 1/? > # Failed test 'Forward scan returns correct count' > # at t/008_btree_merge_scan_correctness.pl line 132. > # got: '364' > # expected: '250' > > # Failed test 'Backward scan returns correct count' > # at t/008_btree_merge_scan_correctness.pl line 133. > # got: '142' > # expected: '250' > # Looks like you failed 2 tests of 2. > > > > From that we will try to design locking that does not affect performanc= e significantly, but allows to merge pages. Perhaps, we can design a way to= switch new index scans to "safe mode" during index vacuum and waiting for = existing scans to complete. > > What if we just abort a scan, that stepped on the page where tuples were = moved out? I don't think that's viable nor valid. We can't let vacuum processes abort active scans; it'd go against the principles of vacuum being a background process; and it'd be a recipe for disaster when it triggers in catalog scans. It'd probably also fail to detect the case where a non-backward index scan's just-accessed data was merged to the right, onto the page that the scan will access next (or, a backward scan's just-accessed data merged to the left). I think the issue isn't _just_ what to do when we detect that a page was merged (it's relatively easy to keep track of what data was present on a merged-now-empty page), but rather how we detect that pages got merged, and how we then work to return to a valid scan position. Preferably, we'd have a way that guarantees that a scan can not fail to notice that a keyspace with live tuples was concurrently merged onto a page, what keyspace this was, and whether that change was relevant to the currently active index scan; all whilst still obeying the other nbtree invariants that scans currently rely on. > I've prototype this approach, please see patch 0002. Maybe in future we w= ill improve locking protocol if we will observe high error rates. > Unfortunately, this approach leads to default mergefactor 0 instead of 5%= . > > What do you think? Should we add this to CF or the idea is too wild for a= review? Aborting queries to be able to merge pages is not a viable approach; -1 on that. It would make running VACUUM concurrently with other workloads unsafe, and that's not acceptable. Kind regards, Matthias van de Meent Databricks (https://databricks.com)