Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uwR0i-0026yH-9b for pgsql-hackers@arkaria.postgresql.org; Wed, 10 Sep 2025 20:01:44 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uwR0g-003FDw-DB for pgsql-hackers@arkaria.postgresql.org; Wed, 10 Sep 2025 20:01:42 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uwR0g-003FDm-2N for pgsql-hackers@lists.postgresql.org; Wed, 10 Sep 2025 20:01:42 +0000 Received: from mail-ej1-x633.google.com ([2a00:1450:4864:20::633]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1uwR0c-0003yM-0T for pgsql-hackers@lists.postgresql.org; Wed, 10 Sep 2025 20:01:42 +0000 Received: by mail-ej1-x633.google.com with SMTP id a640c23a62f3a-b042eb09948so1403514866b.3 for ; Wed, 10 Sep 2025 13:01:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1757534499; x=1758139299; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=DMcnb/j+chvtwtDPU7TdBDRV7fppT5bg9TyrwgIg1n0=; b=Gy/r01OzUJNsYI9Q0MH/1pZyCq1JCDxgawm0uXHnCQaIfVa15ELNi5jzaEvB6CH0nX POzKrepG2jCopQv1fsjpLVpwUupG6gKk9lSgRljC458QIRZg8KgdAFcXX9pqT/r4Nn9G Av5Y8iuZFSCpBpzzUOxpzbxZ05h9qKP8j75bLOC2KlkLK9BX7olCEoMqFs8hjvj9fOJs 8St9pxGzqztS3iAXGRPSE/9B0esOFx4/9+VMyJJDVaWPpfeTQzdgDiBVSxEXbS6mgpln 6zDTteu5OyI4N3SjaLJVOkZ4nWjZpvcEYDx98CAb/mQY/x4OLa2O8/A9ne0d5qn/Q9pS jGGw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1757534499; x=1758139299; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=DMcnb/j+chvtwtDPU7TdBDRV7fppT5bg9TyrwgIg1n0=; b=Ej1061MQdNOjGf4YUEFVBcG5hTWd3blRlhYr6XTbZf313k9RTRbcVKUAco9C2sADXz ng71qmy2JHjx41amjocmswuQRrA93GDEVl+HEExibEIKWFRy8qJdBeLoR0ZiCCv58W37 IGqFF/TawC4B8R/U6mC9GJ3s1JQsjlrjFz2DZGH6MN7lDX1RJML2P/xg+HjF4OVC7gXw pL+7iH/qZKewoWtkMwZrRKIMDLd4GaGZFw0tsZtqcLBT1C8S6rXz8jMLZgzDwaGUFyML cWh8ICFoZ4HHDlGgfkCUmfY37a3r2ARei9VoyoaaGAqWYvWv5rsvVfLG8zdfL8/KlUaC /T6Q== X-Forwarded-Encrypted: i=1; AJvYcCXOLVOo/fHpuclWJ0qZ9FUcM/RdiyY73TxmAz71sz7ksKbswjkhzLyKgs1ccUEcrAHqcOR6sHk7s2zTdgiN@lists.postgresql.org X-Gm-Message-State: AOJu0YxCLHVoFL9AlNT+rmygnHn1JjwziWzSftMnb80pdft/t3163Tiv CIL4H9O4b74xkBwM0nPJ9n2UyOTKX2j+3RWBiGf75gcR0d6VdeB8Uv1L0vD6UiwRY3UlvlOii5s vt2Eww+k91Zz8Oh/CzPhdEZx7a+6RVys= X-Gm-Gg: ASbGncsCa3HRwgaxU0QHTG+aiovBQRoN/ptM5ujmPsTEZvirdwsG8OnBuyaK19HHVEa STAfOZ3MDo3OjbdIn7bIDCObeLH+wcWP/p5Gf2D36zRP+UweGBDSRRgAYywn3z+mrZ8mmCY2FhV KOFyJRw7wCW0nB9RQLA7tM0xMqDFxsrC0334hGhgc0dkyXdaO1KURn5bBpMoxq6epOdAloHnJhn PSeVJu9elMdNQ09EInD X-Google-Smtp-Source: AGHT+IHjC/6RETLHYm5K+pVIKaOo9EzbtdZoVa0Ct+YvIJRVjwzm83NtarP1vdnHiPsXaUEgQIqu/4Rm0+6LXunjROo= X-Received: by 2002:a17:907:6e92:b0:b04:54dc:3048 with SMTP id a640c23a62f3a-b04b148f4damr1620834866b.21.1757534498418; Wed, 10 Sep 2025 13:01:38 -0700 (PDT) MIME-Version: 1.0 References: <87DD95AA-274F-4F4F-BAD9-7738E5B1F905@yandex-team.ru> In-Reply-To: From: Robert Haas Date: Wed, 10 Sep 2025 16:01:25 -0400 X-Gm-Features: AS18NWApfXxnt0z8yPoiRZ0Gd_-tFSy9yF3Vbh1mOlYKDLzDHnatgTfzCngDsJw Message-ID: Subject: Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access) To: Melanie Plageman Cc: Andres Freund , Kirill Reshke , Andrey Borodin , PostgreSQL Hackers , Heikki Linnakangas Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Tue, Sep 9, 2025 at 7:08=E2=80=AFPM Melanie Plageman wrote: > Fair. I've introduced new XLHP flags in attached v13. Hopefully it > puts an end to the horror. I suggest not renumbering all of the existing flags and just adding these new ones at the end. Less code churn and more likely to break in an obvious way if you mix up the two sets of flags. More on 0002: + set_heap_lsn =3D XLogHintBitIsNeeded() ? true : set_heap_lsn; Maybe just if (XLogHintBitIsNeeded) set_heap_lsn =3D true? I don't feel super-strongly that what you've done is bad but it looks weird to my eyes. + * If we released any space or line pointers or will be setting a page in + * the visibility map, measure the page's freespace to later update the "setting a page in the visibility map" seems a little muddled to me. You set bits, not pages. + * all-visible (or all-frozen, depending on the vacuum mode,) which is This comma placement is questionable. /* + * Note that the heap relation may have been dropped or truncated, leading + * us to skip updating the heap block due to the LSN interlock. However, + * even in that case, it's still safe to update the visibility map. Any + * WAL record that clears the visibility map bit does so before checking + * the page LSN, so any bits that need to be cleared will still be + * cleared. + * + * Note that the lock on the heap page was dropped above. In normal + * operation this would never be safe because a concurrent query could + * modify the heap page and clear PD_ALL_VISIBLE -- violating the + * invariant that PD_ALL_VISIBLE must be set if the corresponding bit in + * the VM is set. + * + * In recovery, we expect no other writers, so writing to the VM page + * without holding a lock on the heap page is considered safe enough. It + * is done this way when replaying xl_heap_visible records (see */ How many copies of this comment do you plan to end up with? The comment for log_heap_prune_and_freeze seems to be anticipating future w= ork. > > 0004. It is not clear to me why you need to get > > log_heap_prune_and_freeze to do the work here. Why can't > > log_newpage_buffer get the job done already? > > Well, I need something to emit the changes to the VM. I'm eliminating > all users of xl_heap_visible. Empty pages are the ones that benefit > the least from switching from xl_heap_visible -> xl_heap_prune. But, > if I don't transition them, we have to maintain all the > xl_heap_visible code (including visibilitymap_set() in its long form). > > As for log_newpage_buffer(), I could keep it if you think it is too > confusing to change log_heap_prune_and_freeze()'s API (by passing > force_heap_fpi) to handle this case, I can leave log_newpage_buffer() > there and then call log_heap_prune_and_freeze(). > > I just thought it seemed simple to avoid emitting the new page record > and the VM update record, so why not -- but I don't have strong > feelings. Yeah, I'm not sure what the right thing to do here is. I think I was again experiencing brain fade by forgetting that there is a heap page and a VM page and, of course, log_heap_newpage() probably isn't going to touch the latter. So that makes sense. On the other hand, we could only have one type of WAL record for every single operation in the system if we gave it enough flags, and force_heap_fpi seems suspiciously like a flag that turns this into a whole different kind of WAL record. > > 0005. It looks a little curious that you delete the > > identify-corruption logic from the end of the if-nest and add it to > > the beginning. Ceteris paribus, you'd expect that to be worse, since > > corruption is a rare case. > > On master, the two corruption cases are sandwiched between the normal > VM set cases. And I actually think doing it this way is brittle. If > you put the cases which set the VM first, you have to have completely > bulletproof the if statements guarding them to foreclose any possible > corruption case from entering because otherwise you will overwrite the > corruption you then try to detect. Hmm. In the current code, we first test (!all_visible_according_to_vm && presult.all_visible), then (all_visible_according_to_vm && !PageIsAllVisible(page) && visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) !=3D 0), and then (presult.lpdead_items > 0 && PageIsAllVisible(page)). The first and second can never coexist, because they require opposite values of all_visible_according_to_vm. The second and third cannot coexist because they require opposite values of PageIsAllVisible(page). It is not entirely obvious that the first and third tests couldn't both pass, but you'd have to have presult.all_visible and presult.lpdead_items > 0, and it's a bit hard to see how heap_page_prune_and_freeze() could ever allow that. Consider: if (prstate.all_visible && prstate.lpdead_items =3D=3D 0) { presult->all_visible =3D prstate.all_visible; presult->all_frozen =3D prstate.all_frozen; } else { presult->all_visible =3D false; presult->all_frozen =3D false; } ... presult->lpdead_items =3D prstate.lpdead_items; So I don't really think I'm persuaded that the current way is brittle. But that having been said, I agree with you that the order of the checks is kind of random, and I don't think it really matters that much for performance. What does matter is clarity. I feel like what I'd ideally like this logic to do is say: do we want the VM bit for the page to be set to all-frozen, just all-visible, or neither? Then push the VM bit to the correct state, dragging the page-level bit along behind. And the current logic sort of does that. It's roughly: 1. Should we go from not-all-visible to either all-visible or all-frozen? If yes, do so. 2. Should we go from either all-visible or all-frozen to not-all-visible? If yes, do so. 3. Should we go from either all-visible or all-frozen to not-all-visible for a different reason? If yes, do so. 4. Should we go from all-visible to all-frozen? If yes, do so. But what's weird is that all the tests are written differently, and we have two different reasons for going to not-all-visible, namely PD_ALL_VISIBLE-not-set and dead-items-on-page, whereas there's only one test for each of the other state-transitions, because the decision-making for those cases is fully completed at an earlier stage. I would kind of like to see this expressed in a way that first decides which state transition to make (forward-to-all-frozen, forward-to-all-visible, backward-to-all-visible, backward-to-not-all-visible, nothing) and then does the corresponding work. What you're doing instead is splitting half of those functions off into a helper function while keeping the other half where they are without cleaning up any of the logic. Now, maybe that's OK: I'm far from having grokked the whole patch set. But it is not any more clear than what we have now, IMHO, and perhaps even a bit less so. --=20 Robert Haas EDB: http://www.enterprisedb.com