Feedback-ID: id4a34324:Fastmail
Date: Fri, 30 Jan 2026 15:03:27 -0500
From: Andres Freund <andres@anarazel.de>
To: David Rowley <dgrowleyml@gmail.com>
Cc: Chao Li <li.evan.chao@gmail.com>, 
	PostgreSQL Developers <pgsql-hackers@lists.postgresql.org>
Subject: Re: More speedups for tuple deformation
Message-ID: <teu6emoapawtl4ovxb52lsrdanri7yjem7kgxpkldwl57w3s2a@rjmry6wne5xy>
References: <9A17C43D-7A28-4885-8974-555A40C9523E@gmail.com>
 <CAApHDvqhbJU_-yF3Hbf4VhX33qXtpeYv3MsvMPDMcDwGGLr9ZQ@mail.gmail.com>
 <rbskhk7scqbxqnaw4o6nh6na2ffcclg3cxn4d4cn5jfr2z7vv3@kadtz65meesb>
 <CAApHDvpDxDFatUskuOfuM7A3VESrx8U7MtYnU_HiB0QLAg94zg@mail.gmail.com>
 <pmik622adey6fnddivkt4uvkulvnc6rasmq3tcbrzeglx4hsn7@f3x6e2eph3w5>
 <rvlc7pb6zn4kydqovcqh72lf2qfcgs3qkj2seq7tcpvxyqwtqt@nrvv6lpehwwa>
 <CAApHDvo1i-ycAcWnK3L7ZASTuM8mW46kvRqMaUHD46HSuJmx7A@mail.gmail.com>
 <p2oo6g2fg5nxbsspbyeauxhbyyi3zsx3tt4prkgju3j5qjo7lf@rjlhxusxzam2>
 <CAApHDvpbntG7V3_EsZ+w-V=jU-y8rFmv9RB1EDJm4sxKno-4UA@mail.gmail.com>
 <e7sto7tk5dk5hfyvoocaddnxcngemcmfvbuh23l32w5cssaizy@znuphjqug7qe>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <e7sto7tk5dk5hfyvoocaddnxcngemcmfvbuh23l32w5cssaizy@znuphjqug7qe>
Archived-At: <https://www.postgresql.org/message-id/teu6emoapawtl4ovxb52lsrdanri7yjem7kgxpkldwl57w3s2a%40rjmry6wne5xy>
Precedence: bulk

Hi,

On 2026-01-30 12:11:44 -0500, Andres Freund wrote:
> > In the attached 0004 patch I've experimented with this again. This
> > time, I wrote a function that converts the null bitmap into the isnull
> > array using a lookup table.
>
> Oh. I was just thinking of something roughly like
>
>   int i nullbyte_i = attnum >> 3;
>   for (int nullcol = attnum; nullcol < natts; nullcol += 8)
>   {
>       bits8 nullbyte = bp[nullbyte_i++];
>
>       for (int onebyte = 0; onebyte < 8; onebyte++)
>       {
>           if (nullcol < natts)
>              tts_isnull[nullcol] = nullbyte & 0x01;
>           nullbyte >>= 1;
>           nullcol++;
>       }
>   }
>
> This isn't quite right, as we'd need to deal with starting to deform at a
> attribute that's not % 8 = 0 (I'd probably just do the whole byte even if we'd
> redo a few column)). And probably lots of other stuff.
>
> With a bit of care the inner loop should be unrollable with all the moves as
> conditional moves depending on nullcol < natts. Or such.
>
> Or we could just make sure tts_isnull is always sized to be divisible by 8,
> that'd presumably allow considerably better code to be generated.
>
>
> I'd hope this would be more efficient than doing
>
> 	static inline bool
> 	att_isnull(int ATT, const bits8 *BITS)
> 	{
> 		return !(BITS[ATT >> 3] & (1 << (ATT & 0x07)));
> 	}
>
> for each column.

I couldn't quite let go of this, thinking there had to be a non-simd way to do
this efficiently (by basically doing SWAR).  And I think there is:

We can multiply one byte of the null bitmap with a carefully chosen value that
spreads each bit into the higher bytes. E.g.

0b111 * (1 <<  0)           = 0b                111
0b111 * (1 <<  7)           = 0b        11_10000000
0b111 * ((1 << 0) | (1 << 7))
                            = 0b        11_10000111
0b111 * (1 << 14)           = 0b1_11000000_00000000
0b111 * ((1 << 0) | (1 << 7) | ( << 14))
                            = 0b1_11000011_10000111
...

Then, we can fairly easily mask out the unnecessary bits.


However, as maybe apparent above, this won't work for the input 0xff, as the
carry from each byte's multiplications will "overflow its byte" and flip the
next bit to 0. But that's not hard to handle:

a) A single branch would probably be fine?

b) Compute the low 4 bits and high 4 bits of the bitmap in parallel and or
   the result together. By just looking at 4 null bits, no carry overflow is
   possible. Due to being branchless, that's propbably better.


Something like

    /*
     * The bits array has 1s for values and 0s for NULLs. Bit-flip that to
     * get 1s for NULLs.
     */
    uint8_t nullbyte = ~bp[nullbyte_i++];
    /* 8 bytes where each byte is 0 or 1 depending on whether null bitmap is set */
    uint64_t isnull_8;

    /*
     * Multiplier ensuring that input bit 0 is reflected in output bit 0, input bit 1 at output bit 8, etc.
     * Other bits also will often be set and need to be masked away.
     */
    uint32_t spread_bits_32 = (1U << 0) | (1U << 7) | (1U << 14) | (1U << 21);
    uint64_t mask_bits_64 = 0x0101010101010101ULL;

    /* convert the lower 4 bits of null bitmap word into 32 bit int */
    isnull_8 = (nullbyte & 0xf) * spread_bits_32;
    /* convert the upper 4 bits of null bitmap word into 32 bit int, shift into the upper 32 bit */
    isnull_8 |= ((uint64_t)((nullbyte >> 4) * spread_bits_32)) << 32;

    /* mask out all the bogus bits (could also be done as a 32bit op?)*/
    isnull_8 &= mask_bits_64;

    memcpy(&tts_isnull[nullcol], &isnull_8, sizeof(isnull_8));


should, I think, be better than a 2kB array.  Perhaps not quite as fast as the
AVX512 method, but it should be decent on most hardware...

Greetings,

Andres Freund