Re: [PATCH] llvmjit: always add the simplifycfg pass

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: [PATCH] llvmjit: always add the simplifycfg pass
8+ messages / 3 participants
[nested] [flat]

* Re: [PATCH] llvmjit: always add the simplifycfg pass
@ 2026-01-22 19:54 Matheus Alcantara <[email protected]>
  2026-01-22 20:27 ` Re: [PATCH] llvmjit: always add the simplifycfg pass Pierre Ducroquet <[email protected]>
  0 siblings, 1 reply; 8+ messages in thread

From: Matheus Alcantara @ 2026-01-22 19:54 UTC (permalink / raw)
  To: Pierre Ducroquet <[email protected]>; [email protected] <[email protected]>

Hi,

On 07/01/26 12:08, Pierre Ducroquet wrote:
> Hi
> 
> While reading the code generated by llvmjit, I realized the number of LLVM basic blocks used in tuple deforming was directly visible in the generated assembly code with the following code:
> 0x723382b781c1: jmp 0x723382b781c3
> 0x723382b781c3: jmp 0x723382b781eb
> 0x723382b781c5: mov -0x20(%rsp),%rax
> 0x723382b781..: ... .....
> 0x723382b781e7: mov %cx,(%rax)
> 0x723382b781ea: ret
> 0x723382b781eb: jmp 0x723382b781ed
> 0x723382b781ed: jmp 0x723382b781ef
> 0x723382b781ef: jmp 0x723382b781f1
> 0x723382b781f1: jmp 0x723382b781f3
> 0x723382b781f3: mov -0x30(%rsp),%rax
> 0x723382b781..: ... ......
> 0x723382b78208: mov %rcx,(%rax)
> 0x723382b7820b: jmp 0x723382b781c5
> That's a lot of useless jumps, and LLVM has a specific pass to get rid of these. The attached patch modifies the llvmjit code to always call this pass, even below jit_optimize_above_cost.
> 
> On a basic benchmark (a simple select * from table where f = 42), this optimization saved 7ms of runtime while using only 0.1 ms of extra optimization time.
> 

The patch needs a rebase due to e5d99b4d9ef.

You've added the "simplifycfg" only when the "jit_optimize_above_cost"
is not triggered which will use the default<O0> and mem2reg passes, the
default<O3> pass already include "simplifycfg"?

With e5d99b4d9ef being committed, should we add "simplifycfg" when
PGJIT_INLINE bit is set since it also use the default<O0> and mem2reg
passes?

--
Matheus Alcantara
EDB: https://www.enterprisedb.com






^ permalink  raw  reply  [nested|flat] 8+ messages in thread

* Re: [PATCH] llvmjit: always add the simplifycfg pass
  2026-01-22 19:54 Re: [PATCH] llvmjit: always add the simplifycfg pass Matheus Alcantara <[email protected]>
@ 2026-01-22 20:27 ` Pierre Ducroquet <[email protected]>
  2026-01-22 21:40   ` Re: [PATCH] llvmjit: always add the simplifycfg pass Matheus Alcantara <[email protected]>
  0 siblings, 1 reply; 8+ messages in thread

From: Pierre Ducroquet @ 2026-01-22 20:27 UTC (permalink / raw)
  To: Matheus Alcantara <[email protected]>; +Cc: [email protected] <[email protected]>

Le jeudi 22 janvier 2026 à 8:54 PM, Matheus Alcantara <[email protected]> a écrit :

> Hi,
> 
> On 07/01/26 12:08, Pierre Ducroquet wrote:
> 
> > Hi
> > 
> > While reading the code generated by llvmjit, I realized the number of LLVM basic blocks used in tuple deforming was directly visible in the generated assembly code with the following code:
> > 0x723382b781c1: jmp 0x723382b781c3
> > 0x723382b781c3: jmp 0x723382b781eb
> > 0x723382b781c5: mov -0x20(%rsp),%rax
> > 0x723382b781..: ... .....
> > 0x723382b781e7: mov %cx,(%rax)
> > 0x723382b781ea: ret
> > 0x723382b781eb: jmp 0x723382b781ed
> > 0x723382b781ed: jmp 0x723382b781ef
> > 0x723382b781ef: jmp 0x723382b781f1
> > 0x723382b781f1: jmp 0x723382b781f3
> > 0x723382b781f3: mov -0x30(%rsp),%rax
> > 0x723382b781..: ... ......
> > 0x723382b78208: mov %rcx,(%rax)
> > 0x723382b7820b: jmp 0x723382b781c5
> > That's a lot of useless jumps, and LLVM has a specific pass to get rid of these. The attached patch modifies the llvmjit code to always call this pass, even below jit_optimize_above_cost.
> > 
> > On a basic benchmark (a simple select * from table where f = 42), this optimization saved 7ms of runtime while using only 0.1 ms of extra optimization time.
> 
> 
> The patch needs a rebase due to e5d99b4d9ef.
> 
> You've added the "simplifycfg" only when the "jit_optimize_above_cost"
> is not triggered which will use the default<O0> and mem2reg passes, the
> 
> default<O3> pass already include "simplifycfg"?
> 
> 
> With e5d99b4d9ef being committed, should we add "simplifycfg" when
> PGJIT_INLINE bit is set since it also use the default<O0> and mem2reg
> 
> passes?

Hi

Thank you, here is a rebased version of the patch.
To answer your questions:
- O3 already includes simplifycfg, so no need to modify O3
- any code generated by our llvmjit provider, esp. tuple deforming, is heavily dependent on simplifycfg, so when O0 is the basis we should always add this pass



Attachments:

  [text/x-patch] 0001-llvmjit-always-use-the-simplifycfg-pass.patch (2.6K, 2-0001-llvmjit-always-use-the-simplifycfg-pass.patch)
  download | inline diff:
From cb5cb74461ac9407c16903bfa9d2855f4e76918e Mon Sep 17 00:00:00 2001
From: Pierre Ducroquet <[email protected]>
Date: Wed, 7 Jan 2026 15:43:19 +0100
Subject: [PATCH] llvmjit: always use the simplifycfg pass

The simplifycfg pass will remove empty or unreachable LLVM basic blocks,
and merge blocks together when possible.
This is important because the tuple  deforming code will generate a lot of
basic blocks, and previously with O0 we did not run this pass, thus creating
this kind of (amd64) machine code:
   0x723382b781c1:      jmp    0x723382b781c3
   0x723382b781c3:      jmp    0x723382b781eb
   0x723382b781c5:      mov    -0x20(%rsp),%rax
   0x723382b781..:      ...    .....
   0x723382b781e7:      mov    %cx,(%rax)
   0x723382b781ea:      ret
   0x723382b781eb:      jmp    0x723382b781ed
   0x723382b781ed:      jmp    0x723382b781ef
   0x723382b781ef:      jmp    0x723382b781f1
   0x723382b781f1:      jmp    0x723382b781f3
   0x723382b781f3:      mov    -0x30(%rsp),%rax
   0x723382b781..:      ...    ......
   0x723382b78208:      mov    %rcx,(%rax)
   0x723382b7820b:      jmp    0x723382b781c5

This is not efficient at all, and triggering the simplifycfg pass ends up
tacking a few hundreds micro seconds while possibly saving much more time
during execution. On a basic benchmark, I saved 7ms on query runtime while
using 0.2ms on extra JIT compilation overhead
---
 src/backend/jit/llvm/llvmjit.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 2e8aa4749db..c22f83e97cf 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -633,6 +633,11 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 	{
 		/* we rely on mem2reg heavily, so emit even in the O0 case */
 		LLVMAddPromoteMemoryToRegisterPass(llvm_fpm);
+		/*
+		 * the tuple deforming generates a lot of basic blocks,
+		 * simplify them even with O0
+		 */
+		LLVMAddCFGSimplificationPass(llvm_fpm);
 	}
 
 	LLVMPassManagerBuilderPopulateFunctionPassManager(llvm_pmb, llvm_fpm);
@@ -676,10 +681,10 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 		passes = "default<O3>";
 	else if (context->base.flags & PGJIT_INLINE)
 		/* if doing inlining, but no expensive optimization, add inline pass */
-		passes = "default<O0>,mem2reg,inline";
+		passes = "default<O0>,mem2reg,simplifycfg,inline";
 	else
 		/* default<O0> includes always-inline pass */
-		passes = "default<O0>,mem2reg";
+		passes = "default<O0>,mem2reg,simplifycfg";
 
 	options = LLVMCreatePassBuilderOptions();
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 8+ messages in thread

* Re: [PATCH] llvmjit: always add the simplifycfg pass
  2026-01-22 19:54 Re: [PATCH] llvmjit: always add the simplifycfg pass Matheus Alcantara <[email protected]>
  2026-01-22 20:27 ` Re: [PATCH] llvmjit: always add the simplifycfg pass Pierre Ducroquet <[email protected]>
@ 2026-01-22 21:40   ` Matheus Alcantara <[email protected]>
  2026-01-28 07:56     ` Re: [PATCH] llvmjit: always add the simplifycfg pass Pierre Ducroquet <[email protected]>
  0 siblings, 1 reply; 8+ messages in thread

From: Matheus Alcantara @ 2026-01-22 21:40 UTC (permalink / raw)
  To: Pierre Ducroquet <[email protected]>; +Cc: [email protected] <[email protected]>

On Thu Jan 22, 2026 at 5:27 PM -03, Pierre Ducroquet wrote:
>> The patch needs a rebase due to e5d99b4d9ef.
>> 
>> You've added the "simplifycfg" only when the "jit_optimize_above_cost"
>> is not triggered which will use the default<O0> and mem2reg passes, the
>> 
>> default<O3> pass already include "simplifycfg"?
>> 
>> 
>> With e5d99b4d9ef being committed, should we add "simplifycfg" when
>> PGJIT_INLINE bit is set since it also use the default<O0> and mem2reg
>> 
>> passes?
>
> Hi
>
> Thank you, here is a rebased version of the patch.
> To answer your questions:
> - O3 already includes simplifycfg, so no need to modify O3
> - any code generated by our llvmjit provider, esp. tuple deforming, is heavily dependent on simplifycfg, so when O0 is the basis we should always add this pass

Thanks for confirming.

I did some benchmarks on some TPCH queries (1 and 4) and I got these
results. Note that for these tests I set jit_optimize_above_cost=1000000
so that it force to use the default<O0> pass with simplifycfg.

Master Q1:
    Timing: Generation 1.553 ms (Deform 0.573 ms), Inlining 0.052 ms, Optimization 95.571 ms, Emission 58.941 ms, Total 156.116 ms
    Execution Time: 38221.318 ms

Patch Q1:
    Timing: Generation 1.477 ms (Deform 0.534 ms), Inlining 0.040 ms, Optimization 95.364 ms, Emission 58.046 ms, Total 154.927 ms
    Execution Time: 38257.797 ms

Master Q4:
    Timing: Generation 0.836 ms (Deform 0.309 ms), Inlining 0.086 ms, Optimization 5.098 ms, Emission 6.963 ms, Total 12.983 ms
    Execution Time: 19512.134 ms

Patch Q4:
    Timing: Generation 0.802 ms (Deform 0.294 ms), Inlining 0.090 ms, Optimization 5.234 ms, Emission 6.521 ms, Total 12.648 ms
    Execution Time: 16051.483 ms


For Q4 I see a small increase on Optimization phase but we have a good
performance improvement on execution time. For Q1 the results are almost
the same.

I did not find any major regression using simplifycfg pass and I think
that it make sense to enable since it generate better IR code for LLVM
to compile without too much costs. +1 for this patch.

Perhaps we could merge the comments on if/else block to include the
simplifycfg, what do you think?

+       /*
+        * Determine the LLVM pass pipeline to use. For OPT3 we use the standard
+        * suite. For lower optimization levels, we explicitly include mem2reg to
+        * promote stack variables, simplifycfg to clean up the control flow , and
+        * optionally the inliner if the flag is set. Note that default<O0> already
+        * includes the always-inline pass.
+        */
        if (context->base.flags & PGJIT_OPT3)
                passes = "default<O3>";
        else if (context->base.flags & PGJIT_INLINE)
-               /* if doing inlining, but no expensive optimization, add inline pass */
                passes = "default<O0>,mem2reg,simplifycfg,inline";
        else
-               /* default<O0> includes always-inline pass */
                passes = "default<O0>,mem2reg,simplifycfg";

--
Matheus Alcantara
EDB: https://www.enterprisedb.com







^ permalink  raw  reply  [nested|flat] 8+ messages in thread

* Re: [PATCH] llvmjit: always add the simplifycfg pass
  2026-01-22 19:54 Re: [PATCH] llvmjit: always add the simplifycfg pass Matheus Alcantara <[email protected]>
  2026-01-22 20:27 ` Re: [PATCH] llvmjit: always add the simplifycfg pass Pierre Ducroquet <[email protected]>
  2026-01-22 21:40   ` Re: [PATCH] llvmjit: always add the simplifycfg pass Matheus Alcantara <[email protected]>
@ 2026-01-28 07:56     ` Pierre Ducroquet <[email protected]>
  2026-01-28 12:37       ` Re: [PATCH] llvmjit: always add the simplifycfg pass Matheus Alcantara <[email protected]>
  2026-01-28 23:19       ` Re: [PATCH] llvmjit: always add the simplifycfg pass Andres Freund <[email protected]>
  0 siblings, 2 replies; 8+ messages in thread

From: Pierre Ducroquet @ 2026-01-28 07:56 UTC (permalink / raw)
  To: Matheus Alcantara <[email protected]>; +Cc: [email protected] <[email protected]>

Hi

Here is a rebased version of the patch with a rewrite of the comment.
Thank you again for your previous review.
FYI, I've tried adding other passes but none had a similar benefits over cost ratio. The benefits could rather be in changing from O3 to an extensive list of passes.


Le jeudi 22 janvier 2026 à 10:41 PM, Matheus Alcantara <[email protected]> a écrit :

> On Thu Jan 22, 2026 at 5:27 PM -03, Pierre Ducroquet wrote:
> 
> > > The patch needs a rebase due to e5d99b4d9ef.
> > > 
> > > You've added the "simplifycfg" only when the "jit_optimize_above_cost"
> > > is not triggered which will use the default<O0> and mem2reg passes, the
> > > 
> > > default<O3> pass already include "simplifycfg"?
> > > 
> > > With e5d99b4d9ef being committed, should we add "simplifycfg" when
> > > PGJIT_INLINE bit is set since it also use the default<O0> and mem2reg
> > > 
> > > passes?
> > 
> > Hi
> > 
> > Thank you, here is a rebased version of the patch.
> > To answer your questions:
> > - O3 already includes simplifycfg, so no need to modify O3
> > - any code generated by our llvmjit provider, esp. tuple deforming, is heavily dependent on simplifycfg, so when O0 is the basis we should always add this pass
> 
> 
> Thanks for confirming.
> 
> I did some benchmarks on some TPCH queries (1 and 4) and I got these
> results. Note that for these tests I set jit_optimize_above_cost=1000000
> so that it force to use the default<O0> pass with simplifycfg.
> 
> 
> Master Q1:
> Timing: Generation 1.553 ms (Deform 0.573 ms), Inlining 0.052 ms, Optimization 95.571 ms, Emission 58.941 ms, Total 156.116 ms
> Execution Time: 38221.318 ms
> 
> Patch Q1:
> Timing: Generation 1.477 ms (Deform 0.534 ms), Inlining 0.040 ms, Optimization 95.364 ms, Emission 58.046 ms, Total 154.927 ms
> Execution Time: 38257.797 ms
> 
> Master Q4:
> Timing: Generation 0.836 ms (Deform 0.309 ms), Inlining 0.086 ms, Optimization 5.098 ms, Emission 6.963 ms, Total 12.983 ms
> Execution Time: 19512.134 ms
> 
> Patch Q4:
> Timing: Generation 0.802 ms (Deform 0.294 ms), Inlining 0.090 ms, Optimization 5.234 ms, Emission 6.521 ms, Total 12.648 ms
> Execution Time: 16051.483 ms
> 
> 
> For Q4 I see a small increase on Optimization phase but we have a good
> performance improvement on execution time. For Q1 the results are almost
> the same.
> 
> I did not find any major regression using simplifycfg pass and I think
> that it make sense to enable since it generate better IR code for LLVM
> to compile without too much costs. +1 for this patch.
> 
> Perhaps we could merge the comments on if/else block to include the
> simplifycfg, what do you think?
> 
> + /*
> + * Determine the LLVM pass pipeline to use. For OPT3 we use the standard
> + * suite. For lower optimization levels, we explicitly include mem2reg to
> + * promote stack variables, simplifycfg to clean up the control flow , and
> + * optionally the inliner if the flag is set. Note that default<O0> already
> 
> + * includes the always-inline pass.
> + */
> if (context->base.flags & PGJIT_OPT3)
> 
> passes = "default<O3>";
> 
> else if (context->base.flags & PGJIT_INLINE)
> 
> - /* if doing inlining, but no expensive optimization, add inline pass */
> passes = "default<O0>,mem2reg,simplifycfg,inline";
> 
> else
> - /* default<O0> includes always-inline pass */
> 
> passes = "default<O0>,mem2reg,simplifycfg";
> 
> 
> --
> Matheus Alcantara
> EDB: https://www.enterprisedb.com
>

Attachments:

  [text/x-patch] 0001-llvmjit-always-use-the-simplifycfg-pass.patch (3.1K, 2-0001-llvmjit-always-use-the-simplifycfg-pass.patch)
  download | inline diff:
From 4f75fcc65137a757afac980dd9fb9718bc8dc6eb Mon Sep 17 00:00:00 2001
From: Pierre Ducroquet <[email protected]>
Date: Wed, 7 Jan 2026 15:43:19 +0100
Subject: [PATCH 1/2] llvmjit: always use the simplifycfg pass

The simplifycfg pass will remove empty or unreachable LLVM basic blocks,
and merge blocks together when possible.
This is important because the tuple  deforming code will generate a lot of
basic blocks, and previously with O0 we did not run this pass, thus creating
this kind of (amd64) machine code:
   0x723382b781c1:      jmp    0x723382b781c3
   0x723382b781c3:      jmp    0x723382b781eb
   0x723382b781c5:      mov    -0x20(%rsp),%rax
   0x723382b781..:      ...    .....
   0x723382b781e7:      mov    %cx,(%rax)
   0x723382b781ea:      ret
   0x723382b781eb:      jmp    0x723382b781ed
   0x723382b781ed:      jmp    0x723382b781ef
   0x723382b781ef:      jmp    0x723382b781f1
   0x723382b781f1:      jmp    0x723382b781f3
   0x723382b781f3:      mov    -0x30(%rsp),%rax
   0x723382b781..:      ...    ......
   0x723382b78208:      mov    %rcx,(%rax)
   0x723382b7820b:      jmp    0x723382b781c5

This is not efficient at all, and triggering the simplifycfg pass ends up
tacking a few hundreds micro seconds while possibly saving much more time
during execution. On a basic benchmark, I saved 7ms on query runtime while
using 0.2ms on extra JIT compilation overhead
---
 src/backend/jit/llvm/llvmjit.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 2e8aa4749db..491968d8b12 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -633,6 +633,11 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 	{
 		/* we rely on mem2reg heavily, so emit even in the O0 case */
 		LLVMAddPromoteMemoryToRegisterPass(llvm_fpm);
+		/*
+		 * the tuple deforming generates a lot of basic blocks,
+		 * simplify them even with O0
+		 */
+		LLVMAddCFGSimplificationPass(llvm_fpm);
 	}
 
 	LLVMPassManagerBuilderPopulateFunctionPassManager(llvm_pmb, llvm_fpm);
@@ -672,14 +677,21 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 	LLVMErrorRef err;
 	const char *passes;
 
+	/*
+	 * Determine the LLVM pass pipeline to use.
+	 * For OPT3 we use the standard suite.
+	 * For lower optimization levels, we explicitly include:
+	 * - mem2reg to promote stack variables,
+	 * - simplifycfg to clean up the control flow
+	 * When the inliner flag is set, the inline pass is added. Note that
+	 * default<O0> already includes the always-inline pass.
+	 */
 	if (context->base.flags & PGJIT_OPT3)
 		passes = "default<O3>";
 	else if (context->base.flags & PGJIT_INLINE)
-		/* if doing inlining, but no expensive optimization, add inline pass */
-		passes = "default<O0>,mem2reg,inline";
+		passes = "default<O0>,mem2reg,simplifycfg,inline";
 	else
-		/* default<O0> includes always-inline pass */
-		passes = "default<O0>,mem2reg";
+		passes = "default<O0>,mem2reg,simplifycfg";
 
 	options = LLVMCreatePassBuilderOptions();
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 8+ messages in thread

* Re: [PATCH] llvmjit: always add the simplifycfg pass
  2026-01-22 19:54 Re: [PATCH] llvmjit: always add the simplifycfg pass Matheus Alcantara <[email protected]>
  2026-01-22 20:27 ` Re: [PATCH] llvmjit: always add the simplifycfg pass Pierre Ducroquet <[email protected]>
  2026-01-22 21:40   ` Re: [PATCH] llvmjit: always add the simplifycfg pass Matheus Alcantara <[email protected]>
  2026-01-28 07:56     ` Re: [PATCH] llvmjit: always add the simplifycfg pass Pierre Ducroquet <[email protected]>
@ 2026-01-28 12:37       ` Matheus Alcantara <[email protected]>
  1 sibling, 0 replies; 8+ messages in thread

From: Matheus Alcantara @ 2026-01-28 12:37 UTC (permalink / raw)
  To: Pierre Ducroquet <[email protected]>; +Cc: [email protected] <[email protected]>

On 28/01/26 04:56, Pierre Ducroquet wrote:
> Hi
> 
> Here is a rebased version of the patch with a rewrite of the comment.
> Thank you again for your previous review.
> FYI, I've tried adding other passes but none had a similar benefits over cost ratio. The benefits could rather be in changing from O3 to an extensive list of passes.
> 

Thanks, the patch looks good.

--
Matheus Alcantara
EDB: https://www.enterprisedb.com






^ permalink  raw  reply  [nested|flat] 8+ messages in thread

* Re: [PATCH] llvmjit: always add the simplifycfg pass
  2026-01-22 19:54 Re: [PATCH] llvmjit: always add the simplifycfg pass Matheus Alcantara <[email protected]>
  2026-01-22 20:27 ` Re: [PATCH] llvmjit: always add the simplifycfg pass Pierre Ducroquet <[email protected]>
  2026-01-22 21:40   ` Re: [PATCH] llvmjit: always add the simplifycfg pass Matheus Alcantara <[email protected]>
  2026-01-28 07:56     ` Re: [PATCH] llvmjit: always add the simplifycfg pass Pierre Ducroquet <[email protected]>
@ 2026-01-28 23:19       ` Andres Freund <[email protected]>
  2026-01-30 15:01         ` Re: [PATCH] llvmjit: always add the simplifycfg pass Pierre Ducroquet <[email protected]>
  1 sibling, 1 reply; 8+ messages in thread

From: Andres Freund @ 2026-01-28 23:19 UTC (permalink / raw)
  To: Pierre Ducroquet <[email protected]>; +Cc: Matheus Alcantara <[email protected]>; [email protected] <[email protected]>

Hi,

On 2026-01-28 07:56:46 +0000, Pierre Ducroquet wrote:
> Here is a rebased version of the patch with a rewrite of the comment.  Thank
> you again for your previous review.  FYI, I've tried adding other passes but
> none had a similar benefits over cost ratio. The benefits could rather be in
> changing from O3 to an extensive list of passes.

I agree that we should have a better list of passes. I'm a bit worried that
having an explicit list of passes that we manage ourselves is going to be
somewhat of a pain to maintain across llvm versions, but ...

WRT passes that might be worth having even with -O0 - running duplicate
function merging early on could be quite useful, particularly because we won't
inline the deform routines anyway.


> > I did some benchmarks on some TPCH queries (1 and 4) and I got these
> > results. Note that for these tests I set jit_optimize_above_cost=1000000
> > so that it force to use the default<O0> pass with simplifycfg.

FYI, you can use -1 to just disble it, instead of having to rely on a specific
cost.

> > 
> > Master Q1:
> > Timing: Generation 1.553 ms (Deform 0.573 ms), Inlining 0.052 ms, Optimization 95.571 ms, Emission 58.941 ms, Total 156.116 ms
> > Execution Time: 38221.318 ms
> > 
> > Patch Q1:
> > Timing: Generation 1.477 ms (Deform 0.534 ms), Inlining 0.040 ms, Optimization 95.364 ms, Emission 58.046 ms, Total 154.927 ms
> > Execution Time: 38257.797 ms
> > 
> > Master Q4:
> > Timing: Generation 0.836 ms (Deform 0.309 ms), Inlining 0.086 ms, Optimization 5.098 ms, Emission 6.963 ms, Total 12.983 ms
> > Execution Time: 19512.134 ms
> > 
> > Patch Q4:
> > Timing: Generation 0.802 ms (Deform 0.294 ms), Inlining 0.090 ms, Optimization 5.234 ms, Emission 6.521 ms, Total 12.648 ms
> > Execution Time: 16051.483 ms
> > 
> > 
> > For Q4 I see a small increase on Optimization phase but we have a good
> > performance improvement on execution time. For Q1 the results are almost
> > the same.

These queries are all simple enough that I'm not sure this is a particularly
good benchmark for optimization speed. In particular, the deform routines
don't have to deal with a lot of columns and there aren't a lot of functions
(although I guess that shouldn't really matter WRT simplifycfg).


Greetings,

Andres Freund






^ permalink  raw  reply  [nested|flat] 8+ messages in thread

* Re: [PATCH] llvmjit: always add the simplifycfg pass
  2026-01-22 19:54 Re: [PATCH] llvmjit: always add the simplifycfg pass Matheus Alcantara <[email protected]>
  2026-01-22 20:27 ` Re: [PATCH] llvmjit: always add the simplifycfg pass Pierre Ducroquet <[email protected]>
  2026-01-22 21:40   ` Re: [PATCH] llvmjit: always add the simplifycfg pass Matheus Alcantara <[email protected]>
  2026-01-28 07:56     ` Re: [PATCH] llvmjit: always add the simplifycfg pass Pierre Ducroquet <[email protected]>
  2026-01-28 23:19       ` Re: [PATCH] llvmjit: always add the simplifycfg pass Andres Freund <[email protected]>
@ 2026-01-30 15:01         ` Pierre Ducroquet <[email protected]>
  2026-03-11 22:01           ` Re: [PATCH] llvmjit: always add the simplifycfg pass Matheus Alcantara <[email protected]>
  0 siblings, 1 reply; 8+ messages in thread

From: Pierre Ducroquet @ 2026-01-30 15:01 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; Matheus Alcantara <[email protected]>; [email protected] <[email protected]>

Le jeudi 29 janvier 2026 à 12:19 AM, Andres Freund <[email protected]> a écrit :

> Hi,
> 
> On 2026-01-28 07:56:46 +0000, Pierre Ducroquet wrote:
> 
> > Here is a rebased version of the patch with a rewrite of the comment. Thank
> > you again for your previous review. FYI, I've tried adding other passes but
> > none had a similar benefits over cost ratio. The benefits could rather be in
> > changing from O3 to an extensive list of passes.
> 
> 
> I agree that we should have a better list of passes. I'm a bit worried that
> having an explicit list of passes that we manage ourselves is going to be
> somewhat of a pain to maintain across llvm versions, but ...
> 
> WRT passes that might be worth having even with -O0 - running duplicate
> function merging early on could be quite useful, particularly because we won't
> inline the deform routines anyway.
> 
> > > I did some benchmarks on some TPCH queries (1 and 4) and I got these
> > > results. Note that for these tests I set jit_optimize_above_cost=1000000
> > > so that it force to use the default<O0> pass with simplifycfg.
> 
> 
> FYI, you can use -1 to just disble it, instead of having to rely on a specific
> cost.
> 
> > > Master Q1:
> > > Timing: Generation 1.553 ms (Deform 0.573 ms), Inlining 0.052 ms, Optimization 95.571 ms, Emission 58.941 ms, Total 156.116 ms
> > > Execution Time: 38221.318 ms
> > > 
> > > Patch Q1:
> > > Timing: Generation 1.477 ms (Deform 0.534 ms), Inlining 0.040 ms, Optimization 95.364 ms, Emission 58.046 ms, Total 154.927 ms
> > > Execution Time: 38257.797 ms
> > > 
> > > Master Q4:
> > > Timing: Generation 0.836 ms (Deform 0.309 ms), Inlining 0.086 ms, Optimization 5.098 ms, Emission 6.963 ms, Total 12.983 ms
> > > Execution Time: 19512.134 ms
> > > 
> > > Patch Q4:
> > > Timing: Generation 0.802 ms (Deform 0.294 ms), Inlining 0.090 ms, Optimization 5.234 ms, Emission 6.521 ms, Total 12.648 ms
> > > Execution Time: 16051.483 ms
> > > 
> > > For Q4 I see a small increase on Optimization phase but we have a good
> > > performance improvement on execution time. For Q1 the results are almost
> > > the same.
> 
> 
> These queries are all simple enough that I'm not sure this is a particularly
> good benchmark for optimization speed. In particular, the deform routines
> don't have to deal with a lot of columns and there aren't a lot of functions
> (although I guess that shouldn't really matter WRT simplifycfg).
> 

simplifycfg seems to do more things on the deforming functions than I anticipated initially, explaining the performance benefits. I've written patches to our C code to generate better IR, but I discovered quite a puzzle.
The biggest gain I see on the generated amd64 code for a very simple query (SELECT * FROM demo WHERE a = 42) with simplifycfg is that it prevents spilling on the stack and it does what mem2reg was supposed to be doing.


Running opt -debug-pass-manager on a deform function, I get:
- with default<O0>,mem2reg

Running pass: AnnotationRemarksPass on deform_0_1 (56 instructions)
Running analysis: TargetLibraryAnalysis on deform_0_1
Running pass: PromotePass on deform_0_1 (56 instructions)
Running analysis: DominatorTreeAnalysis on deform_0_1
Running analysis: AssumptionAnalysis on deform_0_1
Running analysis: TargetIRAnalysis on deform_0_1

deform_0_1:                             # @deform_0_1
        .cfi_startproc
# %bb.0:                                # %entry
        movq    24(%rdi), %rax
        movq    %rax, -48(%rsp)                 # 8-byte Spill
        movq    32(%rdi), %rax
        movq    %rax, -40(%rsp)                 # 8-byte Spill
        movq    %rdi, %rax
        addq    $4, %rax
        movq    %rax, -32(%rsp)                 # 8-byte Spill
        movq    %rdi, %rax
        addq    $6, %rax
        movq    %rax, -24(%rsp)                 # 8-byte Spill
        movq    %rdi, %rax
        addq    $72, %rax
        movq    %rax, -16(%rsp)                 # 8-byte Spill
...



- with default<O0>,simplifycfg

Running pass: AnnotationRemarksPass on deform_0_1 (56 instructions)
Running analysis: TargetLibraryAnalysis on deform_0_1
Running pass: SimplifyCFGPass on deform_0_1 (56 instructions)
Running analysis: TargetIRAnalysis on deform_0_1
Running analysis: AssumptionAnalysis on deform_0_1

deform_0_1:                             # @deform_0_1
        .cfi_startproc
# %bb.0:                                # %entry
        movq    24(%rdi), %rax
        movq    32(%rdi), %rsi
        movq    64(%rdi), %rcx
        movq    16(%rcx), %rcx
        movzbl  22(%rcx), %edx
        movslq  %edx, %rdx
        addq    %rdx, %rcx
        movl    72(%rdi), %edx
...

- with default<O0>,simplifycfg,mem2reg

Running pass: SimplifyCFGPass on deform_0_1 (56 instructions)
Running analysis: TargetIRAnalysis on deform_0_1
Running analysis: AssumptionAnalysis on deform_0_1
Running pass: PromotePass on deform_0_1 (46 instructions)
Running analysis: DominatorTreeAnalysis on deform_0_1

deform_0_1:                             # @deform_0_1
        .cfi_startproc
# %bb.0:                                # %entry
        movq    24(%rdi), %rax
        movq    32(%rdi), %rsi
        movq    64(%rdi), %rcx
        movq    16(%rcx), %rcx
        movzbl  22(%rcx), %edx
        movb    $0, (%rsi)
...


So even when running only simplifycfg, the stack allocation goes away.
I am trying to figure that one out, but I suspect we are no longer doing the optimizations we thought we were doing with mem2reg only, hence the (surprising) speed gains with simplifycfg.


Note: 
Ubuntu LLVM version 19.1.7
  Optimized build.
  Default target: x86_64-pc-linux-gnu
  Host CPU: znver5







^ permalink  raw  reply  [nested|flat] 8+ messages in thread

* Re: [PATCH] llvmjit: always add the simplifycfg pass
  2026-01-22 19:54 Re: [PATCH] llvmjit: always add the simplifycfg pass Matheus Alcantara <[email protected]>
  2026-01-22 20:27 ` Re: [PATCH] llvmjit: always add the simplifycfg pass Pierre Ducroquet <[email protected]>
  2026-01-22 21:40   ` Re: [PATCH] llvmjit: always add the simplifycfg pass Matheus Alcantara <[email protected]>
  2026-01-28 07:56     ` Re: [PATCH] llvmjit: always add the simplifycfg pass Pierre Ducroquet <[email protected]>
  2026-01-28 23:19       ` Re: [PATCH] llvmjit: always add the simplifycfg pass Andres Freund <[email protected]>
  2026-01-30 15:01         ` Re: [PATCH] llvmjit: always add the simplifycfg pass Pierre Ducroquet <[email protected]>
@ 2026-03-11 22:01           ` Matheus Alcantara <[email protected]>
  0 siblings, 0 replies; 8+ messages in thread

From: Matheus Alcantara @ 2026-03-11 22:01 UTC (permalink / raw)
  To: Pierre Ducroquet <[email protected]>; Andres Freund <[email protected]>; [email protected] <[email protected]>

On 30/01/26 12:01, Pierre Ducroquet wrote:
> Le jeudi 29 janvier 2026 à 12:19 AM, Andres Freund <[email protected]> a écrit :
> 
>> Hi,
>>
>> On 2026-01-28 07:56:46 +0000, Pierre Ducroquet wrote:
>>
>>> Here is a rebased version of the patch with a rewrite of the comment. Thank
>>> you again for your previous review. FYI, I've tried adding other passes but
>>> none had a similar benefits over cost ratio. The benefits could rather be in
>>> changing from O3 to an extensive list of passes.
>>
>>
>> I agree that we should have a better list of passes. I'm a bit worried that
>> having an explicit list of passes that we manage ourselves is going to be
>> somewhat of a pain to maintain across llvm versions, but ...
>>
>> WRT passes that might be worth having even with -O0 - running duplicate
>> function merging early on could be quite useful, particularly because we won't
>> inline the deform routines anyway.
>>
>>>> I did some benchmarks on some TPCH queries (1 and 4) and I got these
>>>> results. Note that for these tests I set jit_optimize_above_cost=1000000
>>>> so that it force to use the default<O0> pass with simplifycfg.
>>
>> ...
>>
>> These queries are all simple enough that I'm not sure this is a particularly
>> good benchmark for optimization speed. In particular, the deform routines
>> don't have to deal with a lot of columns and there aren't a lot of functions
>> (although I guess that shouldn't really matter WRT simplifycfg).
>>
> 
> simplifycfg seems to do more things on the deforming functions than I anticipated initially, explaining the performance benefits. I've written patches to our C code to generate better IR, but I discovered quite a puzzle.
> The biggest gain I see on the generated amd64 code for a very simple query (SELECT * FROM demo WHERE a = 42) with simplifycfg is that it prevents spilling on the stack and it does what mem2reg was supposed to be doing.
> 
> 
> Running opt -debug-pass-manager on a deform function, I get:
> - with default<O0>,mem2reg
> 
> Running pass: AnnotationRemarksPass on deform_0_1 (56 instructions)
> Running analysis: TargetLibraryAnalysis on deform_0_1
> Running pass: PromotePass on deform_0_1 (56 instructions)
> Running analysis: DominatorTreeAnalysis on deform_0_1
> Running analysis: AssumptionAnalysis on deform_0_1
> Running analysis: TargetIRAnalysis on deform_0_1
> 
> deform_0_1:                             # @deform_0_1
>          .cfi_startproc
> # %bb.0:                                # %entry
>          movq    24(%rdi), %rax
>          movq    %rax, -48(%rsp)                 # 8-byte Spill
>          movq    32(%rdi), %rax
>          movq    %rax, -40(%rsp)                 # 8-byte Spill
>          movq    %rdi, %rax
>          addq    $4, %rax
>          movq    %rax, -32(%rsp)                 # 8-byte Spill
>          movq    %rdi, %rax
>          addq    $6, %rax
>          movq    %rax, -24(%rsp)                 # 8-byte Spill
>          movq    %rdi, %rax
>          addq    $72, %rax
>          movq    %rax, -16(%rsp)                 # 8-byte Spill
> ...
> 
> 
> 
> - with default<O0>,simplifycfg
> 
> Running pass: AnnotationRemarksPass on deform_0_1 (56 instructions)
> Running analysis: TargetLibraryAnalysis on deform_0_1
> Running pass: SimplifyCFGPass on deform_0_1 (56 instructions)
> Running analysis: TargetIRAnalysis on deform_0_1
> Running analysis: AssumptionAnalysis on deform_0_1
> 
> deform_0_1:                             # @deform_0_1
>          .cfi_startproc
> # %bb.0:                                # %entry
>          movq    24(%rdi), %rax
>          movq    32(%rdi), %rsi
>          movq    64(%rdi), %rcx
>          movq    16(%rcx), %rcx
>          movzbl  22(%rcx), %edx
>          movslq  %edx, %rdx
>          addq    %rdx, %rcx
>          movl    72(%rdi), %edx
> ...
> 
> - with default<O0>,simplifycfg,mem2reg
> 
> Running pass: SimplifyCFGPass on deform_0_1 (56 instructions)
> Running analysis: TargetIRAnalysis on deform_0_1
> Running analysis: AssumptionAnalysis on deform_0_1
> Running pass: PromotePass on deform_0_1 (46 instructions)
> Running analysis: DominatorTreeAnalysis on deform_0_1
> 
> deform_0_1:                             # @deform_0_1
>          .cfi_startproc
> # %bb.0:                                # %entry
>          movq    24(%rdi), %rax
>          movq    32(%rdi), %rsi
>          movq    64(%rdi), %rcx
>          movq    16(%rcx), %rcx
>          movzbl  22(%rcx), %edx
>          movb    $0, (%rsi)
> ...
> 
> 
> So even when running only simplifycfg, the stack allocation goes away.
> I am trying to figure that one out, but I suspect we are no longer doing the optimizations we thought we were doing with mem2reg only, hence the (surprising) speed gains with simplifycfg.
> 

I did some tests to compare the IR output with different pass
combinations. Using a query that deforms 6 columns, the raw IR generates
trivial empty blocks like this:

block.attr.0.attcheckalign:           ; preds = %block.attr.0.start
  br label %block.attr.0.align

block.attr.0.align:                   ; preds = %block.attr.0.attcheckalign
  br label %block.attr.0.store

block.attr.0.store:                   ; preds = %block.attr.0.align
  %26 = load i64, ptr %v_offp, align 8
  %27 = getelementptr i8, ptr %v_tupdata_base, i64 %26
  ...

With mem2reg only, the alloca is promoted but these empty blocks remain:

block.attr.0.attcheckalign:           ; preds = %block.attr.0.start
  br label %block.attr.0.align

block.attr.0.align:                   ; preds = %block.attr.0.attcheckalign
  br label %block.attr.0.store

block.attr.0.store:                   ; preds = %block.attr.0.align
  %25 = getelementptr i8, ptr %v_tupdata_base, i64 0
  ...

With simplifycfg only, trivial blocks are merged but alloca is not
promoted:

block.attr.0.start:                               ; preds = %block.attr.0.attcheckattno
  %21 = getelementptr i8, ptr %8, i32 0
  %attnullbyte = load i8, ptr %21, align 1
  %22 = and i8 %attnullbyte, 1
  %attisnull = icmp eq i8 %22, 0
  %23 = and i1 %hasnulls, %attisnull
  br i1 %23, label %block.attr.0.attisnull, label %block.attr.0.store

block.attr.0.store:                               ; preds = %block.attr.0.start
  %26 = load i64, ptr %v_offp, align 8
  %27 = getelementptr i8, ptr %v_tupdata_base, i64 %26
  ...

After mem2reg,simplifycfg the trivial blocks are merged and block.attr.0.start
branches directly to block.attr.0.store:

block.attr.0.start:                   ; preds = %block.attr.0.attcheckattno
  %20 = getelementptr i8, ptr %8, i32 0
  %attnullbyte = load i8, ptr %20, align 1
  %21 = and i8 %attnullbyte, 1
  %attisnull = icmp eq i8 %21, 0
  %22 = and i1 %hasnulls, %attisnull
  br i1 %22, label %block.attr.0.attisnull, label %block.attr.0.store

block.attr.0.store:                   ; preds = %block.attr.0.start
  %25 = getelementptr i8, ptr %v_tupdata_base, i64 0
  ...

As the simplifycfg[1] may remove basic blocks and eliminate PHI nodes,
perhaps this enables more registers to be used and avoid stack
allocations? It seems to me that the stack allocation going away on your
example may be a side-effect of the simpler CFG allowing better register
allocation. However, I think that mem2reg is still needed since
simplifycfg alone doesn't promote allocas, the two passes complement
each other.

What do you think?

[1] https://llvm.org/docs/Passes.html#simplifycfg-simplify-the-cfg

--
Matheus Alcantara
EDB: https://www.enterprisedb.com





^ permalink  raw  reply  [nested|flat] 8+ messages in thread

end of thread, other threads:[~2026-03-11 22:01 UTC | newest]

Thread overview: 8+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-01-22 19:54 Re: [PATCH] llvmjit: always add the simplifycfg pass Matheus Alcantara <[email protected]>
2026-01-22 20:27 ` Pierre Ducroquet <[email protected]>
2026-01-22 21:40   ` Matheus Alcantara <[email protected]>
2026-01-28 07:56     ` Pierre Ducroquet <[email protected]>
2026-01-28 12:37       ` Matheus Alcantara <[email protected]>
2026-01-28 23:19       ` Andres Freund <[email protected]>
2026-01-30 15:01         ` Pierre Ducroquet <[email protected]>
2026-03-11 22:01           ` Matheus Alcantara <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox