Re: More speedups for tuple deformation

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Andres Freund <[email protected]>
To: David Rowley <[email protected]>
Cc: John Naylor <[email protected]>
Cc: Chao Li <[email protected]>
Cc: PostgreSQL Developers <[email protected]>
Subject: Re: More speedups for tuple deformation
Date: Wed, 25 Feb 2026 15:29:01 -0500
Message-ID: <uhqul2ryci4tyg5ylddjrmf4kybzwb7m5z7rmurhhjp37vrn5f@zgxil7egr62n> (raw)
In-Reply-To: <mq6ddpgctt42srolsvo5kph2s6shfg62meb7i5fbg6n3s73zju@2n7gviiyga3h>
References: <CAApHDvpbntG7V3_EsZ+w-V=jU-y8rFmv9RB1EDJm4sxKno-4UA@mail.gmail.com>
	<e7sto7tk5dk5hfyvoocaddnxcngemcmfvbuh23l32w5cssaizy@znuphjqug7qe>
	<CAApHDvpuEbhvH1ViCZRz5vks+_bGbEnPoEdZYAZXK76_isb_+Q@mail.gmail.com>
	<v6z545yozjtywghn5glujemu72z4i4ynadsc2xks4ejotdg7yl@4rry7ixwr4us>
	<CANWCAZabO1oj+khF+YNVpmkTQwRRyNJesbsBhRFL5emZJh3tow@mail.gmail.com>
	<lzgoxzbh2gel5w362revuwaecrsbjr44kjdzrewuejugcodkeq@ixymojwnylsy>
	<CAApHDvodSVBj3ypOYbYUCJX+NWL=VZs63RNBQ_FxB_F+6QXF-A@mail.gmail.com>
	<rbxc2qqhsvzxpukgd36caoa4ydgn5r22fxktxanrkn6nobg7j6@27b4vogohgu2>
	<CAApHDvpWQn8sXDYpSNNpieJW-UTG4Nf4TVjT8ew64L073hz-Fw@mail.gmail.com>
	<mq6ddpgctt42srolsvo5kph2s6shfg62meb7i5fbg6n3s73zju@2n7gviiyga3h>

Hi,

On 2026-02-25 13:05:14 -0500, Andres Freund wrote:
> At least gcc is doing some truly weird shit in the
> firstNonGuaranteed/firstNonCachedOffsetAttr loop "header" (i.e. just before
> the first entrance to the loop) , which leads to the register pressure being
> high, which leads to spilling on the stack, making the few-tuples case slower:
>
> [ lots of stuff trimmed ]
> 
> I.e. the compiler creates an offset version of tts_values[tts_nvalid],
> tts_isnull[tts_nvalid], which then creates register allocation pressure,
> because later the original tts_values/tts_isnulll etc are accessed again and
> thus the underlying registers are preserved.  And this is all for zero gain,
> from what I can tell, because the acceses are still done with indexed
> addressing  (like  mov           %rdi,(%r12,%rcx,8)), which would work just as
> well if rcx were indexed based on attnum, not zero indexed within the loop.
> 
> I see about a 10% improvement if I dissuade the compiler from doing that by
> adding
>   __asm__ volatile ("" : "+r"(attnum) : :);
> 
> In the loop body.
> 
> 
> I'm getting to the point where I'd like to just hand write the assembler for
> this stupid function. Gah.

Huh.  It, at least partially, seems to be related to using an integer for
attnum et al. Due to us using -fwrapv, the compiler can't actually assume that
an attnum++ won't overflow. An overflow would make the loop trip counts a lot
more complicated.   Even with that I don't understand how it ends up
generating such crappy code, but since using size_t fixes it...

Greetings,

Andres Freund

view thread (30+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: More speedups for tuple deformation
  In-Reply-To: <uhqul2ryci4tyg5ylddjrmf4kybzwb7m5z7rmurhhjp37vrn5f@zgxil7egr62n>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox