Feedback-ID: id4a34324:Fastmail
Date: Wed, 25 Feb 2026 13:05:14 -0500
From: Andres Freund <andres@anarazel.de>
To: David Rowley <dgrowleyml@gmail.com>
Cc: John Naylor <johncnaylorls@gmail.com>, 
	Chao Li <li.evan.chao@gmail.com>, PostgreSQL Developers <pgsql-hackers@lists.postgresql.org>
Subject: Re: More speedups for tuple deformation
Message-ID: <mq6ddpgctt42srolsvo5kph2s6shfg62meb7i5fbg6n3s73zju@2n7gviiyga3h>
References: <p2oo6g2fg5nxbsspbyeauxhbyyi3zsx3tt4prkgju3j5qjo7lf@rjlhxusxzam2>
 <CAApHDvpbntG7V3_EsZ+w-V=jU-y8rFmv9RB1EDJm4sxKno-4UA@mail.gmail.com>
 <e7sto7tk5dk5hfyvoocaddnxcngemcmfvbuh23l32w5cssaizy@znuphjqug7qe>
 <CAApHDvpuEbhvH1ViCZRz5vks+_bGbEnPoEdZYAZXK76_isb_+Q@mail.gmail.com>
 <v6z545yozjtywghn5glujemu72z4i4ynadsc2xks4ejotdg7yl@4rry7ixwr4us>
 <CANWCAZabO1oj+khF+YNVpmkTQwRRyNJesbsBhRFL5emZJh3tow@mail.gmail.com>
 <lzgoxzbh2gel5w362revuwaecrsbjr44kjdzrewuejugcodkeq@ixymojwnylsy>
 <CAApHDvodSVBj3ypOYbYUCJX+NWL=VZs63RNBQ_FxB_F+6QXF-A@mail.gmail.com>
 <rbxc2qqhsvzxpukgd36caoa4ydgn5r22fxktxanrkn6nobg7j6@27b4vogohgu2>
 <CAApHDvpWQn8sXDYpSNNpieJW-UTG4Nf4TVjT8ew64L073hz-Fw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAApHDvpWQn8sXDYpSNNpieJW-UTG4Nf4TVjT8ew64L073hz-Fw@mail.gmail.com>
Archived-At: <https://www.postgresql.org/message-id/mq6ddpgctt42srolsvo5kph2s6shfg62meb7i5fbg6n3s73zju%402n7gviiyga3h>
Precedence: bulk

Hi,

On 2026-02-25 13:59:53 +1300, David Rowley wrote:
> On Wed, 25 Feb 2026 at 03:39, Andres Freund <andres@anarazel.de> wrote:
> > ISTM we should just merge 0004. In my testing it's a very clear win, without,
> > afaict, any downsides.
>
> I'd like to get them in in sequence as I believe 0004 buys back some
> extra overheads such as the Min()s in slot_deform_heap_tuple(). If I
> were to do 0004 first, then wait a while, it might look more like I'm
> introducing a small regression.

I'd rather get more stuff merged out of the way. We can handle a small
regression by comparing to 18 instead.  I think we tend to be too worried
about intra-master regressions, and not worried enough about intra branch
regressions.

But FWIW, I don't actually see any regressions due to the Min() on my hardware
as the patches stand.


I do actually see a small regression in some cases due to 0004, but that's
easy to fix:

	/* Fetch any missing attrs and raise an error if reqnatts is invalid */
	if (unlikely(attnum < reqnatts))
		slot_getmissingattrs(slot, attnum, reqnatts);
done:

	/* Save current offset for next execution */
	*offp = off;


This forces pushing a bunch of stack state before the call to
slot_getmissingattrs, doing the call, returning, restoring state from the
stack, just to then jump to the end of the function to set *offp which then
again does a bunch of stack restoring and returns.


If you instead make it something like:
	/* Fetch any missing attrs and raise an error if reqnatts is invalid */
	if (unlikely(attnum < reqnatts))
	{
		*offp = off;
		slot_getmissingattrs(slot, attnum, reqnatts);
		return;
	}
(or just duplicate the *offp = off in the goto done; case)

the overhead seems to be removed.


At least gcc is doing some truly weird shit in the
firstNonGuaranteed/firstNonCachedOffsetAttr loop "header" (i.e. just before
the first entrance to the loop) , which leads to the register pressure being
high, which leads to spilling on the stack, making the few-tuples case slower:


r9 is tts_nvalid, r11 is firstNonGuaranteedAttr, r15 is
slot->tts_tupleDescriptor, rbx is tts_values, -0x10(%rsp) is tts_isnull, rax
is tup.


            if (attnum < firstNonGuaranteedAttr)
            cmp    %r9d,%r11d
            jle    <tts_buffer_heap_getsomeattrs+0x3c0>
            lea    0x0(,%r9,8),%r12

    multiply r9/tts_nvalid by 8, store in r12

            sub    %r9d,%r11d

    r11=firstNonGuaranteedAttr-tts_nvalid

            xor    %ecx,%ecx
    Set ecx to 0

            lea    0x20(%r15,%r12,1),%rdx

    compute (char *) tupleDesc->compact_attrs + r12, i.e. the address of the first
    needed compact attribute

            add    %rbx,%r12

    compute (char*) tts_values + tts_nvalid * sizeof(Datum)

            mov    -0x10(%rsp),%rbx

    restore tts_isnull from stack

            lea    (%rbx,%r9,1),%rsi

    compute tts_isnull[tts_nvalid]

            jmp    <tts_buffer_heap_getsomeattrs+0xee>


Then the loop body:
    ...
            movb          $0x0,(%rsi,%rcx,1)
    set (tts_isnull + tts_nvalid)[i] = 0

            mov           %rdx,%r13
            movswl        (%rdx),%edi

    load cattr->attcacheoff into rdi/edi

            movzwl        0x2(%rdx),%r10d

    load attlen into r10

            mov           %rdi,%rbx
            add           %rax,%rdi

    rdi = tup + attcacheoff

    [bunch of attlen related branches omitted]

            movslq        (%rdi),%rdi
    store *(tup + off) in rdi

            mov           %rdi,(%r12,%rcx,8)
    store rdi in tts_values[i]

            lea           0x1(%rcx),%rdi

    increment i by one, in a weird way, store in rdi

            add           $0x8,%rdx

    increment cattr by 8

            cmp           %rdi,%r11

    check if i < (firstNonGuaranteedAttr-tts_nvalid)


I.e. the compiler creates an offset version of tts_values[tts_nvalid],
tts_isnull[tts_nvalid], which then creates register allocation pressure,
because later the original tts_values/tts_isnulll etc are accessed again and
thus the underlying registers are preserved.  And this is all for zero gain,
from what I can tell, because the acceses are still done with indexed
addressing  (like  mov           %rdi,(%r12,%rcx,8)), which would work just as
well if rcx were indexed based on attnum, not zero indexed within the loop.

I see about a 10% improvement if I dissuade the compiler from doing that by
adding
  __asm__ volatile ("" : "+r"(attnum) : :);

In the loop body.


I'm getting to the point where I'd like to just hand write the assembler for
this stupid function. Gah.


> > > I'm not getting great results from benchmarking the 0005 patch. I
> > > verified that gcc does access the array without calculating the
> > > element address from scratch each time and calculates it once, then
> > > increments the pointer by sizeof(CompactAttribute). See the attached
> > > .csv for the results on the 3 machines I tested on.
> >
> > FWIW, where I had seen that be rather beneficial is the TupleDescCompactAttr()
> > at the start of the various loops, where the compiler has little choice to
> > compute the address of the tupdesc->compact_attrs[firstNeededCol].  That
> > matters only when only deforming a small number of columns, of course.
>
> oh ok. I wasn't aware that LEA's scaling factor can only be 1,2 4 or
> 8. With the 8-byte struct, the compiler should be able to do the shift
> and add as one operation, whereas with the 16-byte struct would
> require a separate shift and add.
>
> Looking at the generated code, with 0004, I see:
>
>     1c79: 48 c1 e2 04          shl    rdx,0x4
>     1c7d: 48 8d 4c 15 20        lea    rcx,[rbp+rdx*1+0x20]
>
> whereas with 0005 I see:
>
>     1c6b: 4a 8d 1c dd 00 00 00 lea    rbx,[r11*8+0x0]
>
> Is that what you meant?

Yep.


> > > I've also resequenced the patches to make the deform_bench test module
> > > part of the 0001 patch. This makes it easier to test the performance
> > > of master.
> >
> > What are your thoughts about merging the deform_bench tooling?  I wonder if we
> > should have src/test/modules/benchmark_tools or such, so we can add a few more
> > micro-benchmarky tools over time?
>
> I'd like to see us give these tools a proper home. It helps lower the
> bar for anyone else who'd like to experiment at some future date, and
> also allows people to more easily test for performance regressions if
> they're forced to change related code. I've also got a tool that
> benchmarks the MemoryContext code which I keep in some local repo that
> I dig out from time to time. Given that, it's probably unlikely
> deform_bench would be the only extension in there if we did make a
> directory for these.

I'd probably lean towards one extensions with different functions for
different purposes, but that's boring details.


> On the otherhand, it does add some maintenance overhead, but IMO,
> helping to ensure various key routines are optimal is a worthy enough
> cause to make the maintenance overhead worthwhile.

Agreed.

Greetings,

Andres Freund