Feedback-ID: id4a34324:Fastmail
Date: Mon, 2 Feb 2026 19:33:27 -0500
From: Andres Freund <andres@anarazel.de>
To: David Rowley <dgrowleyml@gmail.com>
Cc: Chao Li <li.evan.chao@gmail.com>, 
	PostgreSQL Developers <pgsql-hackers@lists.postgresql.org>
Subject: Re: More speedups for tuple deformation
Message-ID: <v6z545yozjtywghn5glujemu72z4i4ynadsc2xks4ejotdg7yl@4rry7ixwr4us>
References: <CAApHDvqhbJU_-yF3Hbf4VhX33qXtpeYv3MsvMPDMcDwGGLr9ZQ@mail.gmail.com>
 <rbskhk7scqbxqnaw4o6nh6na2ffcclg3cxn4d4cn5jfr2z7vv3@kadtz65meesb>
 <CAApHDvpDxDFatUskuOfuM7A3VESrx8U7MtYnU_HiB0QLAg94zg@mail.gmail.com>
 <pmik622adey6fnddivkt4uvkulvnc6rasmq3tcbrzeglx4hsn7@f3x6e2eph3w5>
 <rvlc7pb6zn4kydqovcqh72lf2qfcgs3qkj2seq7tcpvxyqwtqt@nrvv6lpehwwa>
 <CAApHDvo1i-ycAcWnK3L7ZASTuM8mW46kvRqMaUHD46HSuJmx7A@mail.gmail.com>
 <p2oo6g2fg5nxbsspbyeauxhbyyi3zsx3tt4prkgju3j5qjo7lf@rjlhxusxzam2>
 <CAApHDvpbntG7V3_EsZ+w-V=jU-y8rFmv9RB1EDJm4sxKno-4UA@mail.gmail.com>
 <e7sto7tk5dk5hfyvoocaddnxcngemcmfvbuh23l32w5cssaizy@znuphjqug7qe>
 <CAApHDvpuEbhvH1ViCZRz5vks+_bGbEnPoEdZYAZXK76_isb_+Q@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAApHDvpuEbhvH1ViCZRz5vks+_bGbEnPoEdZYAZXK76_isb_+Q@mail.gmail.com>
Archived-At: <https://www.postgresql.org/message-id/v6z545yozjtywghn5glujemu72z4i4ynadsc2xks4ejotdg7yl%404rry7ixwr4us>
Precedence: bulk

Hi,

On 2026-02-01 00:27:02 +1300, David Rowley wrote:
> On Sat, 31 Jan 2026 at 06:11, Andres Freund <andres@anarazel.de> wrote:
> > This is why I like the idea of keeping track of whether we can rely on NOT
> > NULL columns to be present (I think that means we're evaluating expressions
> > other than constraint checks for new rows). It allows the leading NOT NULL
> > fixed-width columns to be decoded without having to wait for a good chunk of
> > the computations above. That's a performance boon even if we later have
> > nullable or varlength columns.
> 
> I can look into this. As we both know, we can't apply this
> optimisation in every case as there are places in the code which form
> then deform tuples before NOT NULL constraints are checked.

Right.


> Perhaps the slot can store a flag to mention if the optimisation is valid to
> apply or not. It doesn't look like the flag can be part of the TupleDesc
> since we cache those in relcache.

I wonder if we should do it the other way round - use a special flag (and
perhaps tuple descriptor) iff we are evaluating "unsanitizes" tuples,
i.e. ones where the NOT NULLness might not yet be correct.


> I'm imagining that TupleDescFinalize() calculates another field which could
> be the max cached offset that's got a NOT NULL constraint and isn't
> attmissing. I think this will need another dedicated loop in
> slot_deform_heap_tuple() to loop up to that attribute before doing the
> firstNonCacheOffsetAttr loop.

I was imagining that we'd use the new value to enter the
firstNonCacheOffsetAttr loop without having to depend on
HeapTupleHeaderGetNatts() & HeapTupleHasNulls(). I.e. just use it to avoid the
dependency on having to have completed the memory fetch for the header.


> > > > Have you experimented setting isnull[] in a dedicated loop if there are nulls
> > > > and then in this loop just checking isnull[attnum]? Seems like that could
> > > > perhaps be combined with the work in first_null_attr() and be more efficient
> > > > than doing an att_isnull() separately for each column.
> > >
> > > Yes. I experiment with that quite a bit. I wasn't able to make it any
> > > faster than setting the isnull element in the same loop as the
> > > tts_values element. What I did try was having a dedicated tight loop
> > > like; for (int i = attnum; i < firstNullAttr; i++) isnull[i] = false;,
> > > but the compiler would always try to optimise that into an inlined
> > > memset which would result in poorly performing code in cases with a
> > > small number of columns due to the size and alignment prechecks.
> >
> > Yea, that kind of transformation is pretty annoying and makes little sense
> > here :(.
> >
> > I was thinking of actually computing the value of isnull[] based on the null
> > bitmap (as you also try below).
> 
> I've taken the code you posted in [1] to do this. Thanks for that. It
> works very well.

Nice!


> I made it so the tts_isnull array size is rounded up to the next multiple of
> 8.

Right, that's what I assumed we'd need.


> I've attached 3 graphs, which are now looking a bit better. The gcc
> results are not quite as good. There's still a small regression with 0
> extra column test, and overall, the results are not as impressive as
> clang's. I've not yet studied why.

I suspect it's due to gcc thinking it'd be a good idea to vectorize the
loop. I saw that happening on godbolt.

Are your results better if you use

#if defined(__clang__)
#define pg_nounroll _Pragma("clang loop unroll(disable)")
#define pg_novector _Pragma("clang loop vectorize(disable)")
#elif defined(__GNUC__)
#define pg_nounroll _Pragma("GCC unroll 0")
#define pg_novector _Pragma("GCC novector")
#else
#define pg_nounroll
#define pg_novector _Pragma("loop( no_vector )")
#endif

and put "pg_nounroll pg_novector" before the loop in populate_isnull_array()?
That improves both gcc and clang code generation substantially for me, but
with a considerably bigger improvement for gcc.

Compiler   Opt      Isns
gcc        -O2      165
clang      -O2      135
gcc        -O3      532
clang      -O3      135

Preventing vectorization & unrolling:
gcc        -O2      26
clang      -O2      25
gcc        -O3      26
clang      -O3      25


It's somewhat scary to see this level of code size increase in a case where
the compiler really has no information to think vectorizing really is
beneficial...


I'd expect it makes sense to combine the loop for first_null_attr() with the
one for populate_isnull_array().  It might also prevent gcc from trying to
vectorize the loop...


Greetings,

Andres Freund