Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1vE3xj-0044Vx-Ks for pgsql-hackers@arkaria.postgresql.org; Wed, 29 Oct 2025 11:03:31 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1vE3xi-0006jF-Jl for pgsql-hackers@arkaria.postgresql.org; Wed, 29 Oct 2025 11:03:29 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1vE3xi-0006j5-6b for pgsql-hackers@lists.postgresql.org; Wed, 29 Oct 2025 11:03:29 +0000 Received: from mail-qt1-x82d.google.com ([2607:f8b0:4864:20::82d]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vE3xe-004N2R-2b for pgsql-hackers@lists.postgresql.org; Wed, 29 Oct 2025 11:03:28 +0000 Received: by mail-qt1-x82d.google.com with SMTP id d75a77b69052e-4ed0f3d4611so17357471cf.3 for ; Wed, 29 Oct 2025 04:03:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761735806; x=1762340606; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=qLST4nbO7XQJlo58ilzjIY0Qa7R+94Om/iHFzEEXBS4=; b=Y51fat5bJJzlrYnxYjojg0V90HRbAgpPnQOvsQYEfENyBkDJIvzn6j4/OWeLiFEM+1 lRIr9/QERfnWjTRonztnRyu6F/2EioN1pY69P8PjuxSJqqVqeuby/bgi84t/aVwENvQK eQjHjyYoBuxdTUaA5zFp3T/SWXz193pidS+J6ajJafbuCj4UqdDuLRNeQy0IIMcxlMCL Bm7gCRzVTgEIOJXHSHcorB1rDOmLjEHHPtiysq6AdhjsqF3272rkfNY5ffwPHPFpy9Y5 23dS6Lo02MlTBrnlPwfFLFW/+5ayZ/vwkEqto1LRfe+8V76sdMyNzCfWzb4N2Mmh5FBN DQAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761735806; x=1762340606; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=qLST4nbO7XQJlo58ilzjIY0Qa7R+94Om/iHFzEEXBS4=; b=d90zkkJGSdHbqYtqKYIwj0qQSLv8RfqQ81Wka7rjhaTcpbmMh5RQLbQIlWrgcokJ1F A/m5tb+Mb2aeHDq2tK09ta16kBa9vS3xZkmdJBbkY3XkKsgWQGTIL1D2g37oR5S7m9AB HpgfvOtTT70sOlcrrRoNZIL7JsffhW0J1kp2PvFSAjx94SHUeQCB00kKgqKrbpVK2Og2 X3+vFoZsdtUZ1k987DU1V/w+dOUSycv2ok90Pv9kf1pwzJNWPYYZtHPs2/UPP/7wg+G3 7YFt/loHeUoekgnXFmPaySsDrU4LcTNXPHCjOg6hlRwjT1MlBbaX42dymEwFHVrJD0qu WHMQ== X-Forwarded-Encrypted: i=1; AJvYcCXHsFga2THHqq42GSVKRz0AoWmjVC4mK9EUtEg/e01l05TmPvHt1pPCK5r6rTPejgBipvHzVZ5kwRGj06Eq@lists.postgresql.org X-Gm-Message-State: AOJu0YwyIRw/8Oq4E3EMqy3fbtYkuG5A0aLXxQajgNTKUjsh+oPkc47X TrwjiyRNIiBe80Pdxavnt0FFUsQRR4XMbqkXGV8Jt+SG4VM11b0o+ZyerwPIKhc9T6iyaXEQIFz 6KGmui0hHj3Ljh89eS/bAzE45iG6iDAw= X-Gm-Gg: ASbGnctcakJSdD57oOXhmTNbuL/4hA5LygKNTTNBIij3SdrjJchWKJirCp/0PoFYNtF UWhU6GE41gWkBCHRigoq/U+K3Yvb0I5EBf/id5otBEa6HRVyioPGckESfDcLdorYb53ma0URtg6 79D5s609N6hjCku6cpSxWz9FnXEaQu1w16//r+xhipu7CYqSs3YlPWtp9DCAsgQljSkVANADkqr ihgC1bzUCMm3dpGujMUQ5NV4cMUt8vL9GTAP3VoaYbpTll5yD4OawBQma/K4wDde7HM9w3Z+iaX 10DxPG8LhJgl3BWnTAE36gn/IUM= X-Google-Smtp-Source: AGHT+IEsydnqUO6NL/Vn/Ot2TImZNCd4V6m4QhsT2GK2qOhD3F/HtYlbeRto17p+uojah+UCbtaVsAnMl218WrDxQ18= X-Received: by 2002:a05:622a:a6cb:b0:4ed:18dd:fd4 with SMTP id d75a77b69052e-4ed18dd11cfmr8855961cf.19.1761735805797; Wed, 29 Oct 2025 04:03:25 -0700 (PDT) MIME-Version: 1.0 References: <2wk7jo4m4qwh5sn33pfgerdjfujebbccsmmlownybddbh6nawl@mdyyqpqzxjek> In-Reply-To: From: Kirill Reshke Date: Wed, 29 Oct 2025 16:03:14 +0500 X-Gm-Features: AWmQ_blFv-nUCe5odfRhIq59sZoLUNtLpaHz9kqktDhoQS9E35E3fgq64MhgoF8 Message-ID: Subject: Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access) To: Melanie Plageman Cc: Andres Freund , Robert Haas , Andrey Borodin , PostgreSQL Hackers , Heikki Linnakangas Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Wed, 15 Oct 2025 at 04:27, Melanie Plageman wrote: > > Thanks so much for the review! I've addressed all your feedback except > what is commented on inline below. > I've gone ahead and committed the preliminary patches that you thought > were ready to commit. > > Attached v18 is what remains. > > 0001 - 0003: refactoring > 0004 - 0006: finish eliminating XLOG_HEAP2_VISIBLE > 0007 - 0009: refactoring > 0010: Set VM on-access > 0011: Set prune xid on insert > 0012: Some refactoring for discussion > > For 0001, I got feedback heap_page_prune_and_freeze() has too many > arguments, so I tried to address that. I'm interested to know if folks > like this more. > > 0011 still needs a bit of investigation to understand fully if > anything else in the index-killtuples test needs to be changed to make > sure we have the same coverage. > > 0012 is sort of WIP. I got feedback heap_page_prune_and_freeze() was > too long and should be split up into helpers. I want to know if this > split makes sense. I can pull it down the patch stack if so. > > Only 0001 and 0012 are optional amongst the refactoring patches. The > others are required to make on-access VM-setting possible or viable. > > On Thu, Oct 9, 2025 at 2:18=E2=80=AFPM Andres Freund = wrote: > > > > > @@ -71,12 +84,12 @@ heap_xlog_prune_freeze(XLogReaderState *record) > > > } > > > - action =3D XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, > > > - = (xlrec.flags & XLHP_CLEANUP_LOCK) !=3D 0, > > > - = &buffer); > > > - if (action =3D=3D BLK_NEEDS_REDO) > > > + if (XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, > > > + = (xlrec.flags & XLHP_CLEANUP_LOCK) !=3D 0, > > > + = &buffer) =3D=3D BLK_NEEDS_REDO) > > > { > > > Page page =3D BufferGetPage(buffer); > > > OffsetNumber *redirected; > > > > Why move it around this way? > > Because there will be an action for the visibility map > XLogReadBufferForRedoExtended(). I could have renamed it heap_action, > but it is being used only in one place, so I preferred to just cut it > to avoid any confusion. > > > > Advance the page LSN > > > + * only if the record could include an FPI, since recov= ery skips > > > + * records <=3D the stamped LSN. Otherwise it might ski= p an earlier FPI > > > + * needed to repair a torn page. > > > + */ > > > > This is confusing, should probably just reference the stuff we did in t= he > > !recovery case. > > I fixed this and addressed all your feedback related to this before commi= tting. > > > > + if (do_prune || nplans > 0 || > > > + ((vmflags & VISIBILITYMAP_VALID_BITS) && XLogHi= ntBitIsNeeded())) > > > + PageSetLSN(page, lsn); > > > + > > > /* > > > * Note: we don't worry about updating the page's pruna= bility hints. > > > * At worst this will cause an extra prune cycle to occ= ur soon. > > > */ > > > > Not your fault, but that seems odd? Why aren't we just doing the right = thing? > > The comment dates back to 6f10eb2. I imagine no one ever bothered to > fuss with extracting the XID. You could change > heap_page_prune_execute() to return the right value -- though that's a > bit ugly since it is used in normal operation as well as recovery. > > > I wonder if the VM specific redo portion should be in a common helper? = Might > > not be enough code to worry though... > > I think it might be more code as a helper at this point. > > > > @@ -2860,6 +2867,29 @@ lazy_vacuum_heap_page(LVRelState *vacrel, Bloc= kNumber blkno, Buffer buffer, > > > VACUUM_ERRCB_P= HASE_VACUUM_HEAP, blkno, > > > InvalidOffsetN= umber); > > > > > > + /* > > > + * Before marking dead items unused, check whether the page wil= l become > > > + * all-visible once that change is applied. > > > > So the function is named _would_ but here you say will :) > > I thought about it more and still feel that this function name should > contain "would". From vacuum's perspective it is "will" -- because it > knows it will remove those dead items, but from the function's > perspective it is hypothetical. I changed the comment though. > > > > + if (heap_page_would_be_all_visible(vacrel, buffer, > > > + = deadoffsets, num_offsets, > > > + = &all_frozen, &visibility_cutoff_xid)) > > > + { > > > + vmflags |=3D VISIBILITYMAP_ALL_VISIBLE; > > > + if (all_frozen) > > > + { > > > + vmflags |=3D VISIBILITYMAP_ALL_FROZEN; > > > + Assert(!TransactionIdIsValid(visibility_cutoff_= xid)); > > > + } > > > + > > > + /* Take the lock on the vmbuffer before entering a crit= ical section */ > > > + LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE); > > > > It sure would be nice if we had documented the lock order between the h= eap > > page and the corresponding VM page anywhere. This is just doing what w= e did > > before, so it's not this patch's fault, but I did get worried about it = for a > > moment. > > Well, the comment above the visibilitymap_set* functions says what > expectations they have for the heap page being locked. > > > > +static bool > > > +heap_page_would_be_all_visible(LVRelState *vacrel, Buffer buf, > > > + OffsetNumber= *deadoffsets, > > > + int ndeadoff= sets, > > > + bool *all_fr= ozen, > > > + TransactionI= d *visibility_cutoff_xid) > > > +{ > > > Page page =3D BufferGetPage(buf); > > > Hm, what about an assert checking that matched_dead_count =3D=3D ndeado= ffsets at > > the end? > > I was going to put an Assert(ndeadoffsets <=3D matched_dead_count), but > then I started wondering if there is a way we could end up with fewer > dead items than we collected during phase I. > > I had thought about if we dropped an index and then did on-access > pruning -- but we don't allow setting LP_DEAD items LP_UNUSED in > on-access pruning. So, maybe this is safe... I can do a follow-on > commit to add the assert. But I'm just not 100% sure I've thought of > all the cases where we might end up with fewer dead items. > > > > During vacuum's first and third phases, we examine tuples' visibility > > > to determine if we can set the page all-visible in the visibility map= . > > > > > > Previously, this check compared tuple xmins against a single XID chos= en at > > > the start of vacuum (OldestXmin). We now use GlobalVisState, which al= so > > > enables future work to set the VM during on-access pruning, since ord= inary > > > queries have access to GlobalVisState but not OldestXmin. > > > > > > This also benefits vacuum directly: GlobalVisState may advance > > > during a vacuum, allowing more pages to become considered all-visible= . > > > In the rare case that it moves backward, VACUUM falls back to OldestX= min > > > to ensure we don=E2=80=99t attempt to freeze a dead tuple that wasn= =E2=80=99t yet > > > prunable according to the GlobalVisState. > > > > It could, but it currently won't advance in vacuum, right? > > I thought it was possible for it to advance when calling > heap_prune_satisfies_vacuum() -> > GlobalVisTestIsRemovableXid()->...GlobalVisUpdate(). This case isn't > going to be common, but some things can cause us to update it. > > We have talked about explicitly updating GlobalVisState more often > during vacuums of large tables. But I was under the impression that it > was at least possible for it to advance during vacuum now. > > - Melanie Hi! First of all, I rechecked v18 patches, they still cause WAL bytes reduction. In a no-index vacuum case my result is a 39% reduction in WAL bytes. Almost like in your first message. Here are my comments about code, I may be very nitpicky in minor details, sorry for that In 0003: get_conflict_xid function logic is bit strange for me, it assigns conflict_xid to some value, but in the very end we have > + /* >+ * We can omit the snapshot conflict horizon if we are not pruning or >+ * freezing any tuples and are setting an already all-visible page >+ * all-frozen in the VM. In this case, all of the tuples on the page must >+ * already be visible to all MVCC snapshots on the standby. >+ */ >+ if (!do_prune && !do_freeze && >+ do_set_vm && blk_already_av && set_blk_all_frozen) > + conflict_xid =3D InvalidTransactionId; I feel like we should move this check to the beginning of the function, and just return InvalidTransactionId in that if cond. in 0004: > + if (old_vmbits =3D=3D new_vmbits) > + { > + LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK); > + /* Unset so we don't emit WAL since no change occurred */ > + do_set_vm =3D false; > + } and then > END_CRIT_SECTION(); > + if (do_set_vm) > + LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK); > + So, in the heap_page_prune_and_freeze function we release buffer lock both inside and outside the crit section. As I understand, this is actually safe. I also looked in other xlog coding practices for other access methods (GiST, GIN, ....), and I can see that some of them release buffers before leaving crit sections and some of them after. But I still suggest to be in sync with 'Write-Ahead Log Coding' section of src/backend/access/transam/README, which says: 6. END_CRIT_SECTION() 7. Unlock and unpin the buffer(s). Let's be consistent in this at least in this single function context. In 0010: I'm not terribly convenient that adding SO_ALLOW_VM_SET to TAM ScanOptions is the right thing to do. Looks like VM bits are something that make sense for HEAP AM for not for any TAM. So, don't we break some layer of abstraction here? Would it be better for HEAP AM to set some flags in heap_beginscan? Overall 0001-0003 are mostly fine for me, 0004-0006 are the right thing to do IMHO, but maybe they need some more review from hackers. Other patches i did not review in a great detail, will return to this later --=20 Best regards, Kirill Reshke