Date: Fri, 30 Jan 2026 15:01:52 +0000
To: Andres Freund <andres@anarazel.de>, Matheus Alcantara <matheusssilv97@gmail.com>, "pgsql-hackers@lists.postgresql.org" <pgsql-hackers@lists.postgresql.org>
From: Pierre Ducroquet <p.psql@pinaraf.info>
Subject: Re: [PATCH] llvmjit: always add the simplifycfg pass
Message-ID: <pJBA_YlJwSojFSBFctsdfSOfoSv2cPS9u68eH1niIUFzYj8eImTRvNCx1jaKGbBsHMM2o6plKbQZlBcoLqG7GjK0scAeuior6SkmggWrmLs=@pinaraf.info>
In-Reply-To: <porx6mjfalwefma2f2d76hagxjin3xdgefjsklkzxsyit736ly@34ubivd7buw2>
References: <VS3dpR1Sf5jGnWwoFFJ-_x3GbW7fdmV0arzWPIDfrmbVzewifgu6DsQ7oDa-TAwRz9N2p817j3jGstHwfPOJJxOipbcp-nHdNj3zyxKvC4Q=@pinaraf.info> <DFVDQRXJX7QW.KLYVOJSQW08Y@gmail.com> <H9LI9Enj4-NPP6t2g1RB9KMGkkBwzWjQwfiSLHLOTnT7YUwVPYSu_pMHwQLwwzGQGp54DQcER-eLngTa1GzVjH5Q0addrvfalukYnszTjMY=@pinaraf.info> <DFVG0AV650GW.2CNS5CZ4OG788@gmail.com> <TyPuJ3RPE7iMOji1DSq1IIHlq_RtGBgG5YrJJyeocYIxWMnlx8EsW93R_qdb9uYYofmelyF4nrP1rao5RrWKYSUyi3BlSO9ZpzgHzUvsJH4=@pinaraf.info> <porx6mjfalwefma2f2d76hagxjin3xdgefjsklkzxsyit736ly@34ubivd7buw2>
Feedback-ID: 175348558:user:proton
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/pJBA_YlJwSojFSBFctsdfSOfoSv2cPS9u68eH1niIUFzYj8eImTRvNCx1jaKGbBsHMM2o6plKbQZlBcoLqG7GjK0scAeuior6SkmggWrmLs%3D%40pinaraf.info>
Precedence: bulk

Le jeudi 29 janvier 2026 =C3=A0 12:19 AM, Andres Freund <andres@anarazel.de=
> a =C3=A9crit=C2=A0:

> Hi,
>=20
> On 2026-01-28 07:56:46 +0000, Pierre Ducroquet wrote:
>=20
> > Here is a rebased version of the patch with a rewrite of the comment. T=
hank
> > you again for your previous review. FYI, I've tried adding other passes=
 but
> > none had a similar benefits over cost ratio. The benefits could rather =
be in
> > changing from O3 to an extensive list of passes.
>=20
>=20
> I agree that we should have a better list of passes. I'm a bit worried th=
at
> having an explicit list of passes that we manage ourselves is going to be
> somewhat of a pain to maintain across llvm versions, but ...
>=20
> WRT passes that might be worth having even with -O0 - running duplicate
> function merging early on could be quite useful, particularly because we =
won't
> inline the deform routines anyway.
>=20
> > > I did some benchmarks on some TPCH queries (1 and 4) and I got these
> > > results. Note that for these tests I set jit_optimize_above_cost=3D10=
00000
> > > so that it force to use the default<O0> pass with simplifycfg.
>=20
>=20
> FYI, you can use -1 to just disble it, instead of having to rely on a spe=
cific
> cost.
>=20
> > > Master Q1:
> > > Timing: Generation 1.553 ms (Deform 0.573 ms), Inlining 0.052 ms, Opt=
imization 95.571 ms, Emission 58.941 ms, Total 156.116 ms
> > > Execution Time: 38221.318 ms
> > >=20
> > > Patch Q1:
> > > Timing: Generation 1.477 ms (Deform 0.534 ms), Inlining 0.040 ms, Opt=
imization 95.364 ms, Emission 58.046 ms, Total 154.927 ms
> > > Execution Time: 38257.797 ms
> > >=20
> > > Master Q4:
> > > Timing: Generation 0.836 ms (Deform 0.309 ms), Inlining 0.086 ms, Opt=
imization 5.098 ms, Emission 6.963 ms, Total 12.983 ms
> > > Execution Time: 19512.134 ms
> > >=20
> > > Patch Q4:
> > > Timing: Generation 0.802 ms (Deform 0.294 ms), Inlining 0.090 ms, Opt=
imization 5.234 ms, Emission 6.521 ms, Total 12.648 ms
> > > Execution Time: 16051.483 ms
> > >=20
> > > For Q4 I see a small increase on Optimization phase but we have a goo=
d
> > > performance improvement on execution time. For Q1 the results are alm=
ost
> > > the same.
>=20
>=20
> These queries are all simple enough that I'm not sure this is a particula=
rly
> good benchmark for optimization speed. In particular, the deform routines
> don't have to deal with a lot of columns and there aren't a lot of functi=
ons
> (although I guess that shouldn't really matter WRT simplifycfg).
>=20

simplifycfg seems to do more things on the deforming functions than I antic=
ipated initially, explaining the performance benefits. I've written patches=
 to our C code to generate better IR, but I discovered quite a puzzle.
The biggest gain I see on the generated amd64 code for a very simple query =
(SELECT * FROM demo WHERE a =3D 42) with simplifycfg is that it prevents sp=
illing on the stack and it does what mem2reg was supposed to be doing.


Running opt -debug-pass-manager on a deform function, I get:
- with default<O0>,mem2reg

Running pass: AnnotationRemarksPass on deform_0_1 (56 instructions)
Running analysis: TargetLibraryAnalysis on deform_0_1
Running pass: PromotePass on deform_0_1 (56 instructions)
Running analysis: DominatorTreeAnalysis on deform_0_1
Running analysis: AssumptionAnalysis on deform_0_1
Running analysis: TargetIRAnalysis on deform_0_1

deform_0_1:                             # @deform_0_1
        .cfi_startproc
# %bb.0:                                # %entry
        movq    24(%rdi), %rax
        movq    %rax, -48(%rsp)                 # 8-byte Spill
        movq    32(%rdi), %rax
        movq    %rax, -40(%rsp)                 # 8-byte Spill
        movq    %rdi, %rax
        addq    $4, %rax
        movq    %rax, -32(%rsp)                 # 8-byte Spill
        movq    %rdi, %rax
        addq    $6, %rax
        movq    %rax, -24(%rsp)                 # 8-byte Spill
        movq    %rdi, %rax
        addq    $72, %rax
        movq    %rax, -16(%rsp)                 # 8-byte Spill
...


- with default<O0>,simplifycfg

Running pass: AnnotationRemarksPass on deform_0_1 (56 instructions)
Running analysis: TargetLibraryAnalysis on deform_0_1
Running pass: SimplifyCFGPass on deform_0_1 (56 instructions)
Running analysis: TargetIRAnalysis on deform_0_1
Running analysis: AssumptionAnalysis on deform_0_1

deform_0_1:                             # @deform_0_1
        .cfi_startproc
# %bb.0:                                # %entry
        movq    24(%rdi), %rax
        movq    32(%rdi), %rsi
        movq    64(%rdi), %rcx
        movq    16(%rcx), %rcx
        movzbl  22(%rcx), %edx
        movslq  %edx, %rdx
        addq    %rdx, %rcx
        movl    72(%rdi), %edx
...

- with default<O0>,simplifycfg,mem2reg

Running pass: SimplifyCFGPass on deform_0_1 (56 instructions)
Running analysis: TargetIRAnalysis on deform_0_1
Running analysis: AssumptionAnalysis on deform_0_1
Running pass: PromotePass on deform_0_1 (46 instructions)
Running analysis: DominatorTreeAnalysis on deform_0_1

deform_0_1:                             # @deform_0_1
        .cfi_startproc
# %bb.0:                                # %entry
        movq    24(%rdi), %rax
        movq    32(%rdi), %rsi
        movq    64(%rdi), %rcx
        movq    16(%rcx), %rcx
        movzbl  22(%rcx), %edx
        movb    $0, (%rsi)
...


So even when running only simplifycfg, the stack allocation goes away.
I am trying to figure that one out, but I suspect we are no longer doing th=
e optimizations we thought we were doing with mem2reg only, hence the (surp=
rising) speed gains with simplifycfg.


Note:=20
Ubuntu LLVM version 19.1.7
  Optimized build.
  Default target: x86_64-pc-linux-gnu
  Host CPU: znver5