Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vyL8J-0002GI-2I for pgsql-hackers@arkaria.postgresql.org; Fri, 06 Mar 2026 02:41:43 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vyL8G-002K8i-2m for pgsql-hackers@arkaria.postgresql.org; Fri, 06 Mar 2026 02:41:41 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vyL8G-002K8a-1S for pgsql-hackers@lists.postgresql.org; Fri, 06 Mar 2026 02:41:41 +0000 Received: from mail-pl1-x630.google.com ([2607:f8b0:4864:20::630]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vyL8E-00000000hhy-1ShC for pgsql-hackers@lists.postgresql.org; Fri, 06 Mar 2026 02:41:39 +0000 Received: by mail-pl1-x630.google.com with SMTP id d9443c01a7336-2ae4988e039so36236745ad.1 for ; Thu, 05 Mar 2026 18:41:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1772764897; x=1773369697; darn=lists.postgresql.org; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Hz+CpFjVimr2+52suHqdp3Y5/0dmVFSPudPTf59SOwE=; b=C2LVqEkx3enHvi1epK5CLI7AYRS2dK4S3f13H9RY5DgEgUXbS0qDXwbXIL+5Vcig4K gyL0E/LktLpPbRogv+HCA/DuVwVNx7EN4A9q0w9dQO+AexDxj7efhrjOVxtborXT2fDB KbZN19z1h0OAfpAjhBdN25TpQ1+lp1elJeZ3lH44/fARuE7K75FF8tiYem7EiYGIE2C4 2AV+SPMRbqdi3c0CHgVKOnkstegsX3xuWD/0c8jB6rxhrlP5tpAiBE1+XHVIyIXF9/X6 CG1Je+Devw1OQI1G1ownBEtMxhi32dsDANAE1u/yC2LjaV18JPLhnSHQUZ53u8bnhldI lYRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1772764897; x=1773369697; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-gg:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Hz+CpFjVimr2+52suHqdp3Y5/0dmVFSPudPTf59SOwE=; b=pqzfAJrHCveN1JaRFS5bSh0+ivuymHDeNjZv8NjwZ1u8i91G9Olm2FIcsYqGNjfjoV qut8SnJbh0H5Wavn9s96Ky7Yr+PHsKJbJ2ih+HIwo+QmuhTstC18hHS/B5bwQ3jiKoDp QXl/fB+jQeDwkAvYV9kKgVCqhrKU8QbtrD3tj7yXzG3Uee+wI1g0edAkZbDfbPobQMEF ECtHKQOJZCkT31lwkkAGUIXwP4YXUGaxQX4fvmuUgdz40Xw289Lv6NvXEvf5oNDlXUmU b0Bc8XXc7DpUVAgxRHMcydCH7oR0cISKdZXfPZPny6/5VaEas1GMDZJxAgF2c3VVuWXQ QScg== X-Forwarded-Encrypted: i=1; AJvYcCVNfLSFDjpX3Tm8xsFmEvNB437ZPGHhJcGC3NOOJniZ4cZSDCJkvX4tozTQ/Pq4RKhaeXVBYdi+7stAhbJV@lists.postgresql.org X-Gm-Message-State: AOJu0YymlMhmRx8wd5y0N/NNfqOs0sxd0Pm/3KoLoGq9skB8XKyPOcNR lOMnC+nEGue7qqub3NBfXdMHy00hk7pdsw5pk9EIuDW2QbLMYC9zqMNx X-Gm-Gg: ATEYQzyhFgpA4Pb9re0zB/zo4FN6bPirNrXZIOIA/Jm7KhxoxW3+xlUsFLUAs+ySbUY hY3VstOc4bOdRMkMaqAHlKi7tzZMEC5O+zA3e9uY05k6yQZX3/2ue2ddC1lAZafAHgYhNmS4Cn7 BiNwarfcyJ+TyeUVUIuinOMdszsUTPK/uqNqyVBnchujIXgIDFG+Pl09Iaw4rQ+AiwKxo0XgP2g dBbfmrsa+98i152qvfX6P46cWQO2xskTmo0RAM7Y++mudwZP+e91KVsxUX/IBeW91Vea/hlKBJp kuXNNUFm5NhhWAOwuway/Q+oJSkcO2mVbNzVu+9cqtoizkd5E7FEhpZo4MRx31vN3rwMn9YdFgj kcwQt/vYkFmonPNdIiDZ1m/QvPgdD0z3oOzJzufTM0jwu8Bjyff3C9QLMMrh3Ufj/JtLs+ghVig Ou5JWbl029/ANTLoTVb88edzxT9SnwygE= X-Received: by 2002:a17:902:dacb:b0:2ae:5a38:96bb with SMTP id d9443c01a7336-2ae75b0a0demr46236495ad.2.1772764896868; Thu, 05 Mar 2026 18:41:36 -0800 (PST) Received: from smtpclient.apple ([45.32.121.103]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2ae83e97924sm1439955ad.27.2026.03.05.18.41.31 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 05 Mar 2026 18:41:36 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3864.400.21\)) Subject: Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access) From: Chao Li In-Reply-To: <81DAFACF-7D55-4A84-ACB0-0425D1669DB4@gmail.com> Date: Fri, 6 Mar 2026 10:40:53 +0800 Cc: Andres Freund , Andrey Borodin , Kirill Reshke , Xuneng Zhou , Robert Haas , PostgreSQL Hackers , Heikki Linnakangas Content-Transfer-Encoding: quoted-printable Message-Id: References: <6BC5DBAB-6084-4BB8-8450-52E9648AB021@gmail.com> <7F5BCD7A-764D-4D8D-8E27-6F2CAAEA1CEE@gmail.com> <4379FDA3-9446-4E2C-9C15-32EFE8D4F31B@yandex-team.ru> <7ib3sa55sapwjlaz4sijbiq7iezna27kjvvvar4dpgkmadml6t@gfpkkwmdnepx> <81DAFACF-7D55-4A84-ACB0-0425D1669DB4@gmail.com> To: Melanie Plageman X-Mailer: Apple Mail (2.3864.400.21) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk > On Mar 5, 2026, at 16:52, Chao Li wrote: >=20 >=20 >=20 >> On Mar 4, 2026, at 16:59, Chao Li wrote: >>=20 >>=20 >>=20 >>> On Mar 3, 2026, at 23:52, Melanie Plageman = wrote: >>>=20 >>>=20 >>>> Otherwise, if prstate->pagefrz.FreezePageConflictXid is still = possibly be InvalidTransactionId, then the Assert should be changed to = something like: >>>>=20 >>>> Assert(prstate->pagefrz.FreezePageConflictXid =3D=3D = InvalidTransactionId || >>>> = TransactionIdPrecedesOrEquals(prstate->pagefrz.FreezePageConflictXid, = prstate->cutoffs->OldestXmin) >>>=20 >>> This is covered by TransactionIdPrecedesOrEquals because >>> InvalidTransactionId is 0. We assume that in many places throughout >>> the code. >>>=20 >>=20 >> I understood that TransactionIdPrecedesOrEquals(InvalidTransactionId, = prstate->cutoffs->OldestXmin) is true, but that would leave an = impression to code readers that prstate->pagefrz.FreezePageConflictXid = could not be InvalidTransactionId. Thus I think my version explicitly = tells that prstate->pagefrz.FreezePageConflictXid could be = InvalidTransactionId at the point. >>=20 >>=20 >>>> I will continue with 0005 tomorrow. >>>=20 >>=20 >> 4 - 0005 >> ``` >> * Caller must have pin on the buffer, and must *not* have a lock on = it. >> */ >> void >> -heap_page_prune_opt(Relation relation, Buffer buffer) >> +heap_page_prune_opt(Relation relation, Buffer buffer, Buffer = *vmbuffer) >> ``` >>=20 >> I don=E2=80=99t see why vmbuffer has to be of pointer type. Buffer = type is underlying int, I checked the last commit, vmbuffer only passes = in data into the function without passing out anything. >>=20 >> As we add the new parameter vmbuffer, though it=E2=80=99s not used in = this commit, I think it=E2=80=99d be better to update the header commit = to explain what this parameter will do. >>=20 >> 5 - 0006 >> ``` >> + * >> + * heap_fix_vm_corruption() makes changes to the VM and, = potentially, the heap >> + * page, but it does not need to be done in a critical section = because >> + * clearing the VM is not WAL-logged. >> + */ >> +static void >> +heap_fix_vm_corruption(PruneState *prstate, OffsetNumber offnum) >> ``` >>=20 >> Nit: why the last paragraph of the header comments uses the function = name instead of =E2=80=9Cthis function=E2=80=9D? Looks like a = copy-pasto. >>=20 >> 6 - 0006 >> ``` >> + if (prstate->lpdead_items > 0) >> + { >> + ereport(WARNING, >> + (errcode(ERRCODE_DATA_CORRUPTED), >> + errmsg("LP_DEAD item found on page marked as all-visible"), >> + errdetail("relation \"%s\", page %u, tuple %u", >> + RelationGetRelationName(prstate->relation), >> + prstate->block, offnum))); >> + } >> + else >> + { >> + ereport(WARNING, >> + (errcode(ERRCODE_DATA_CORRUPTED), >> + errmsg("tuple not visible to all found on page marked as = all-visible"), >> + errdetail("relation \"%s\", page %u, tuple %u", >> + RelationGetRelationName(prstate->relation), >> + prstate->block, offnum))); >> + } >> ``` >>=20 >> I recently just learned that a detail message should use complete = sentences, and end each with a period, and capitalize the first word of = sentences. See = https://www.postgresql.org/docs/current/error-style-guide.html. >>=20 >> 7 - 0006 >> ``` >> + else if (prstate->vmbits & VISIBILITYMAP_VALID_BITS) >> + { >> + /* >> + * As of PostgreSQL 9.2, the visibility map bit should never be set = if >> + * the page-level bit is clear. However, it's possible that the bit >> + * got cleared after heap_vac_scan_next_block() was called, so we = must >> + * recheck with buffer lock before concluding that the VM is = corrupt. >> + */ >> + ereport(WARNING, >> + (errcode(ERRCODE_DATA_CORRUPTED), >> + errmsg("page %u in \"%s\" is not marked all-visible but visibility = map bit is set", >> + prstate->block, >> + RelationGetRelationName(prstate->relation)))); >> + } >> ``` >>=20 >> The comment says =E2=80=9Cwe must recheck with buffer lock = before=E2=80=A6=E2=80=9D, but it only log a warning message. Is the = comment stale? >>=20 >> 8 - 0007 >> ``` >> +static void >> +heap_page_bypass_prune_freeze(PruneState *prstate, PruneFreezeResult = *presult) >> +{ >> + OffsetNumber maxoff =3D PageGetMaxOffsetNumber(prstate->page); >> + Page page =3D prstate->page; >> + >> + Assert(prstate->vmbits & VISIBILITYMAP_ALL_FROZEN || >> + (prstate->vmbits & VISIBILITYMAP_ALL_VISIBLE && >> + !prstate->attempt_freeze)); >> + >> + /* We'll fill in presult for the caller */ >> + memset(presult, 0, sizeof(PruneFreezeResult)); >> + >> + /* >> + * Since the page is all-visible, a count of the normal ItemIds on = the >> + * page should be sufficient for vacuum's live tuple count. >> + */ >> + for (OffsetNumber off =3D FirstOffsetNumber; >> + off <=3D maxoff; >> + off =3D OffsetNumberNext(off)) >> + { >> + if (ItemIdIsNormal(PageGetItemId(page, off))) >> + prstate->live_tuples++; >> + } >> + >> + presult->live_tuples =3D prstate->live_tuples; >> + >> + /* Clear any stale prune hint */ >> + if (TransactionIdIsValid(PageGetPruneXid(page))) >> + { >> + PageClearPrunable(page); >> + MarkBufferDirtyHint(prstate->buffer, true); >> + } >> + >> + presult->vmbits =3D prstate->vmbits; >> + >> + if (!PageIsEmpty(page)) >> + presult->hastup =3D true; >> +} >> ``` >>=20 >> * Given this function has done PageIsEmpty(page), that that is true, = we don=E2=80=99t need to count live_tuples, right? That could be a tiny = optimization. >> * I see heap_page_bypass_prune_freeze() is only called in one place = and immediately after prune_freeze_setup() and heap_fix_vm_corruption(), = so prstate->vmbits must be 0, so do we need to do presult->vmbits =3D = prstate->vmbits;? >> * Do we need to set all_visible and all_frozen to presult? >>=20 >> 0008 LGTM >>=20 >> I will continue with 0009 tomorrow. >>=20 >=20 > 9 - 0009 > =C2=B7=C2=B7=C2=B7 > + * Currently, only VACUUM performs freezing, but other callers may = in the > + * future. Other callers must initialize prstate.all_frozen to = false, > =C2=B7=C2=B7=C2=B7 >=20 > Nit: prstate.all_frozen -> prstate.set_all_frozen >=20 > I saw you have fixed this in 0010, but I think it=E2=80=99s better = also fix it here. >=20 > 10 - 0010 > ``` > + * Whether or not the page was newly set all-visible and all-frozen = during > + * phase I of vacuuming. > */ > - uint8 vmbits; > + BlockNumber new_all_visible_pages; > + BlockNumber new_all_visible_frozen_pages; > + BlockNumber new_all_frozen_pages; > ``` >=20 > These 3 fields are actually counts rather than pointers to blocks, = using type BlockNumber are quite confusing, though underlying = BlockNumber is uint32. I think they can be just int type. >=20 > 11 - 0010 > ``` > + BlockNumber new_all_visible_pages; > + BlockNumber new_all_visible_frozen_pages; > + BlockNumber new_all_frozen_pages; > ``` >=20 > I don=E2=80=99t see where these 3 fields are initialized. In = lazy_scan_prune(), presult is defined as: > ``` > PruneFreezeResult presult; > ``` > So, those fields will hold random values. >=20 > 12 - 0010 > ``` > + * conflict would ahve been handled in reaction to the WAL record = freezing > ``` >=20 > Nit: ahve -> have >=20 > 0011 LGTM >=20 > 13 - 0012 - bufmask.c > ``` > + * we don't mark the page all-visible. See = heap_xlog_prune_and_freeze() > + * for more details. > ``` >=20 > I don=E2=80=99t find a function named heap_xlog_prune_and_freeze(). >=20 > 14 - 0012 - heapam_xlog.c > ``` > + * same approach is taken when replaying XLOG_HEAP2_PRUNE* records = (see > + * heap_xlog_prune_and_freeze()). > ``` >=20 > Same as 13. >=20 > 0013 LGTM >=20 > I will try to finish the rest 5 commits tomorrow. >=20 15 - 0014 - execMain.c ``` @@ -3027,6 +3035,7 @@ EvalPlanQualStart(EPQState *epqstate, Plan = *planTree) rcestate->es_range_table_size =3D = parentestate->es_range_table_size; rcestate->es_relations =3D parentestate->es_relations; rcestate->es_rowmarks =3D parentestate->es_rowmarks; + rcestate->es_modified_relids =3D = parentestate->es_modified_relids; ``` Here it just assigns the BMS pointer to rcestate->es_modified_relids. I = am not sure if further bms_add_member() will still happen, if yes, it = might be safer to do bms_copy(parentestate->es_modified_relids), because = a further bms_add_member() may cause a new memory allocated and the old = pointer stale. 16 - 0014 - execUtils.c ``` for (rti =3D 1; rti <=3D estate->es_range_table_size; rti++) ``` Nit: I have seen several recent commits that performed cleanups to = switch to use for loop var like: ``` for (Index rti =3D 1; rti <=3D estate->es_range_table_size; rti++) ``` 17 - 0015 The commit message subject line says =E2=80=9CMake begin_scan() = functions take a flags argument=E2=80=9D, where begin_scan() seems = inaccurate, for example, table_index_fetch_begin() is not =E2=80=9Cbegin = scan=E2=80=9D. Otherwise 0015 LGTM. 18 - 0016 - tableam.h ``` /* unregister snapshot at scan end? */ SO_TEMP_SNAPSHOT =3D 1 << 9, + /* set if the query doesn't modify the relation */ + SO_HINT_REL_READ_ONLY =3D 1 << 10, } ScanOptions; ``` Nit: maybe add an empty line before the new flag. 19 - 0017 - heapam_handler.c ``` @@ -147,7 +147,8 @@ heapam_index_fetch_tuple(struct IndexFetchTableData = *scan, */ if (prev_buf !=3D hscan->xs_cbuf) heap_page_prune_opt(hscan->xs_base.rel, = hscan->xs_cbuf, - = &hscan->xs_vmbuffer); + = &hscan->xs_vmbuffer, + = hscan->modifies_base_rel); ``` This feels like a bug. heap_page_prune_opt takes the first parameter = rel_read_only, but hscan->modifies_base_rel means not read-only, so here = we should use =E2=80=9C!hscan->modifies_base_rel=E2=80=9D. Oh, when I read back your previous email, you have found this bug. 20 - 0018 In heap_insert(), you do: ``` + if (TransactionIdIsNormal(xid) && !(options & = HEAP_INSERT_FROZEN)) + PageSetPrunable(page, xid); ``` But in heap_multi_insert(), you do: ``` + if (!all_frozen_set && TransactionIdIsNormal(xid)) + PageSetPrunable(page, xid); ``` Is the option check " !(options & HEAP_INSERT_FROZEN))=E2=80=9D also = needed by heap_multi_insert? ~~ Done of this round review ~~ Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/