Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vlq0Z-004VZ5-37 for pgsql-hackers@arkaria.postgresql.org; Fri, 30 Jan 2026 15:02:05 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vlq0X-005Bsu-2k for pgsql-hackers@arkaria.postgresql.org; Fri, 30 Jan 2026 15:02:02 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vlq0X-005Bsg-13 for pgsql-hackers@lists.postgresql.org; Fri, 30 Jan 2026 15:02:02 +0000 Received: from mail-4399.protonmail.ch ([185.70.43.99]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vlq0U-000B7H-17 for pgsql-hackers@lists.postgresql.org; Fri, 30 Jan 2026 15:02:01 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pinaraf.info; s=protonmail; t=1769785315; x=1770044515; bh=wqeTje6DNxcAY6EsOlQ4nILUB6oxaMIXhzlPc3kmmo4=; h=Date:To:From:Subject:Message-ID:In-Reply-To:References: Feedback-ID:From:To:Cc:Date:Subject:Reply-To:Feedback-ID: Message-ID:BIMI-Selector; b=Kzrd2T0erDr/NQKyXc/UCyqox5WSGtdk9mpwnCihF5wq/rsNEkq4glDHBChvq4GAU AskaOq7j3pVk6GbUn8o5zza496DurdsYd8o46Nvnrc4NGRsj7YuJBZVeGFdVdYGtck TS7gSJs+Lud2csND9ulWY7kCjKrYhbJq5E3PVvRDYxdTez5ipDifCFuZ2fVJf+iXgy N3fMI9BXoOsiLZeaFD4IdNVIdYsLP7nVXvVciu28MfNJq0sj2GioUuGmOyuO+0bD9G HA6lDvdL4J2mGX3u7+ADjYuC1h0Daj2kUvabXcleGbt1UXWZ0ZBk8RxI09KsEu390/ OD/s5e2qqKlAQ== Date: Fri, 30 Jan 2026 15:01:52 +0000 To: Andres Freund , Matheus Alcantara , "pgsql-hackers@lists.postgresql.org" From: Pierre Ducroquet Subject: Re: [PATCH] llvmjit: always add the simplifycfg pass Message-ID: In-Reply-To: References: Feedback-ID: 175348558:user:proton X-Pm-Message-ID: 2bde487fe51f26ddc6fa2cf722d81dc2285b80bb MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Le jeudi 29 janvier 2026 =C3=A0 12:19 AM, Andres Freund a =C3=A9crit=C2=A0: > Hi, >=20 > On 2026-01-28 07:56:46 +0000, Pierre Ducroquet wrote: >=20 > > Here is a rebased version of the patch with a rewrite of the comment. T= hank > > you again for your previous review. FYI, I've tried adding other passes= but > > none had a similar benefits over cost ratio. The benefits could rather = be in > > changing from O3 to an extensive list of passes. >=20 >=20 > I agree that we should have a better list of passes. I'm a bit worried th= at > having an explicit list of passes that we manage ourselves is going to be > somewhat of a pain to maintain across llvm versions, but ... >=20 > WRT passes that might be worth having even with -O0 - running duplicate > function merging early on could be quite useful, particularly because we = won't > inline the deform routines anyway. >=20 > > > I did some benchmarks on some TPCH queries (1 and 4) and I got these > > > results. Note that for these tests I set jit_optimize_above_cost=3D10= 00000 > > > so that it force to use the default pass with simplifycfg. >=20 >=20 > FYI, you can use -1 to just disble it, instead of having to rely on a spe= cific > cost. >=20 > > > Master Q1: > > > Timing: Generation 1.553 ms (Deform 0.573 ms), Inlining 0.052 ms, Opt= imization 95.571 ms, Emission 58.941 ms, Total 156.116 ms > > > Execution Time: 38221.318 ms > > >=20 > > > Patch Q1: > > > Timing: Generation 1.477 ms (Deform 0.534 ms), Inlining 0.040 ms, Opt= imization 95.364 ms, Emission 58.046 ms, Total 154.927 ms > > > Execution Time: 38257.797 ms > > >=20 > > > Master Q4: > > > Timing: Generation 0.836 ms (Deform 0.309 ms), Inlining 0.086 ms, Opt= imization 5.098 ms, Emission 6.963 ms, Total 12.983 ms > > > Execution Time: 19512.134 ms > > >=20 > > > Patch Q4: > > > Timing: Generation 0.802 ms (Deform 0.294 ms), Inlining 0.090 ms, Opt= imization 5.234 ms, Emission 6.521 ms, Total 12.648 ms > > > Execution Time: 16051.483 ms > > >=20 > > > For Q4 I see a small increase on Optimization phase but we have a goo= d > > > performance improvement on execution time. For Q1 the results are alm= ost > > > the same. >=20 >=20 > These queries are all simple enough that I'm not sure this is a particula= rly > good benchmark for optimization speed. In particular, the deform routines > don't have to deal with a lot of columns and there aren't a lot of functi= ons > (although I guess that shouldn't really matter WRT simplifycfg). >=20 simplifycfg seems to do more things on the deforming functions than I antic= ipated initially, explaining the performance benefits. I've written patches= to our C code to generate better IR, but I discovered quite a puzzle. The biggest gain I see on the generated amd64 code for a very simple query = (SELECT * FROM demo WHERE a =3D 42) with simplifycfg is that it prevents sp= illing on the stack and it does what mem2reg was supposed to be doing. Running opt -debug-pass-manager on a deform function, I get: - with default,mem2reg Running pass: AnnotationRemarksPass on deform_0_1 (56 instructions) Running analysis: TargetLibraryAnalysis on deform_0_1 Running pass: PromotePass on deform_0_1 (56 instructions) Running analysis: DominatorTreeAnalysis on deform_0_1 Running analysis: AssumptionAnalysis on deform_0_1 Running analysis: TargetIRAnalysis on deform_0_1 deform_0_1: # @deform_0_1 .cfi_startproc # %bb.0: # %entry movq 24(%rdi), %rax movq %rax, -48(%rsp) # 8-byte Spill movq 32(%rdi), %rax movq %rax, -40(%rsp) # 8-byte Spill movq %rdi, %rax addq $4, %rax movq %rax, -32(%rsp) # 8-byte Spill movq %rdi, %rax addq $6, %rax movq %rax, -24(%rsp) # 8-byte Spill movq %rdi, %rax addq $72, %rax movq %rax, -16(%rsp) # 8-byte Spill ... - with default,simplifycfg Running pass: AnnotationRemarksPass on deform_0_1 (56 instructions) Running analysis: TargetLibraryAnalysis on deform_0_1 Running pass: SimplifyCFGPass on deform_0_1 (56 instructions) Running analysis: TargetIRAnalysis on deform_0_1 Running analysis: AssumptionAnalysis on deform_0_1 deform_0_1: # @deform_0_1 .cfi_startproc # %bb.0: # %entry movq 24(%rdi), %rax movq 32(%rdi), %rsi movq 64(%rdi), %rcx movq 16(%rcx), %rcx movzbl 22(%rcx), %edx movslq %edx, %rdx addq %rdx, %rcx movl 72(%rdi), %edx ... - with default,simplifycfg,mem2reg Running pass: SimplifyCFGPass on deform_0_1 (56 instructions) Running analysis: TargetIRAnalysis on deform_0_1 Running analysis: AssumptionAnalysis on deform_0_1 Running pass: PromotePass on deform_0_1 (46 instructions) Running analysis: DominatorTreeAnalysis on deform_0_1 deform_0_1: # @deform_0_1 .cfi_startproc # %bb.0: # %entry movq 24(%rdi), %rax movq 32(%rdi), %rsi movq 64(%rdi), %rcx movq 16(%rcx), %rcx movzbl 22(%rcx), %edx movb $0, (%rsi) ... So even when running only simplifycfg, the stack allocation goes away. I am trying to figure that one out, but I suspect we are no longer doing th= e optimizations we thought we were doing with mem2reg only, hence the (surp= rising) speed gains with simplifycfg. Note:=20 Ubuntu LLVM version 19.1.7 Optimized build. Default target: x86_64-pc-linux-gnu Host CPU: znver5