Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vvJGF-00D6Hx-2y for pgsql-hackers@arkaria.postgresql.org; Wed, 25 Feb 2026 18:05:24 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vvJGC-0087Qt-1g for pgsql-hackers@arkaria.postgresql.org; Wed, 25 Feb 2026 18:05:20 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vvJGB-0087Qk-1C for pgsql-hackers@lists.postgresql.org; Wed, 25 Feb 2026 18:05:20 +0000 Received: from fout-b5-smtp.messagingengine.com ([202.12.124.148]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1vvJG7-000000019pq-437F for pgsql-hackers@lists.postgresql.org; Wed, 25 Feb 2026 18:05:18 +0000 Received: from phl-compute-02.internal (phl-compute-02.internal [10.202.2.42]) by mailfout.stl.internal (Postfix) with ESMTP id A5A8C1D001BD; Wed, 25 Feb 2026 13:05:15 -0500 (EST) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-02.internal (MEProxy); Wed, 25 Feb 2026 13:05:15 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm3; t=1772042715; x=1772129115; bh=57GUAWZpn0 vdvCiVEIeCHvOEru8PEnhzgyMgFB7Q+3E=; b=YICB3W4TSz4C2u2twPORz47+HJ 02zpxnJGkNneEXkRkhQBnFMyfvhjA2XubU+XcBM8rVgbj1CTnEp7NrHpgFbh5EW5 eB3q/ft5xpjxlGpa+JeiM2u41K+KEB4CLDLeix79Z2iEgHvxGdeJWF3IVVA/qjz3 AmMHPQhSx7E7rdeXtr6qOLuuE5Dbky+1LMvHiLQz4ig5xxbSDNlOKNqlrR906l0C KvS2p3mlRSpmUXaPRCHzmCZgVMFTNOVB/iO9TjKZLNOfB86u2WbB3LsFjT9cyrLw FMZ45EoHrcWHlx3alLktr97ZHwaa/sn7KcL3nO8osgf9SF+pvMGI+d5OdVgQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t= 1772042715; x=1772129115; bh=57GUAWZpn0vdvCiVEIeCHvOEru8PEnhzgyM gFB7Q+3E=; b=W2dYvK7MV5b47yIW6Nw5ypylblCMjjtVhePOcP/kD/LJrnMYBuW I3rmwo7wUwh7m/p7PW4VDfkswkzPzMfzLFxbyg9yNhIk5lAQUhmjM4fSbu92T19H /Xor8VqDzjhJlkpZXuCP+S+JO5P1xVmENHYJOhrOFF+RdEr8veZC8jbRDmk5y3ps W84LW5eqTXnF6YcAWk4jgDjhHRGOM1rleCyF3/0r/VFBaBJtYJPzoXmFleE+6jAK v8vkxs9LvXOQQWCivWYhb3RVp9ksRa20OK2XSFZ99D/sfdlrh8xkh8Xe+Hoa+goD ld+dsFOUm+vcLtm/odY3xfOFwK9LNFxGzDg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddvgeefjeejucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucenucfjughrpeffhffvvefukfhfgggtuggjsehttdfstd dttddvnecuhfhrohhmpeetnhgurhgvshcuhfhrvghunhguuceorghnughrvghssegrnhgr rhgriigvlhdruggvqeenucggtffrrghtthgvrhhnpeeffffgledvffegtdevlefgtdeggf fhvdekgfegteeiveejkeetudelveejhfeugeenucevlhhushhtvghrufhiiigvpedtnecu rfgrrhgrmhepmhgrihhlfhhrohhmpegrnhgurhgvshesrghnrghrrgiivghlrdguvgdpnh gspghrtghpthhtohepgedpmhhouggvpehsmhhtphhouhhtpdhrtghpthhtohepughgrhho fihlvgihmhhlsehgmhgrihhlrdgtohhmpdhrtghpthhtohepjhhohhhntghnrgihlhhorh hlshesghhmrghilhdrtghomhdprhgtphhtthhopehlihdrvghvrghnrdgthhgrohesghhm rghilhdrtghomhdprhgtphhtthhopehpghhsqhhlqdhhrggtkhgvrhhssehlihhsthhsrd hpohhsthhgrhgvshhqlhdrohhrgh X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 25 Feb 2026 13:05:15 -0500 (EST) Date: Wed, 25 Feb 2026 13:05:14 -0500 From: Andres Freund To: David Rowley Cc: John Naylor , Chao Li , PostgreSQL Developers Subject: Re: More speedups for tuple deformation Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2026-02-25 13:59:53 +1300, David Rowley wrote: > On Wed, 25 Feb 2026 at 03:39, Andres Freund wrote: > > ISTM we should just merge 0004. In my testing it's a very clear win, without, > > afaict, any downsides. > > I'd like to get them in in sequence as I believe 0004 buys back some > extra overheads such as the Min()s in slot_deform_heap_tuple(). If I > were to do 0004 first, then wait a while, it might look more like I'm > introducing a small regression. I'd rather get more stuff merged out of the way. We can handle a small regression by comparing to 18 instead. I think we tend to be too worried about intra-master regressions, and not worried enough about intra branch regressions. But FWIW, I don't actually see any regressions due to the Min() on my hardware as the patches stand. I do actually see a small regression in some cases due to 0004, but that's easy to fix: /* Fetch any missing attrs and raise an error if reqnatts is invalid */ if (unlikely(attnum < reqnatts)) slot_getmissingattrs(slot, attnum, reqnatts); done: /* Save current offset for next execution */ *offp = off; This forces pushing a bunch of stack state before the call to slot_getmissingattrs, doing the call, returning, restoring state from the stack, just to then jump to the end of the function to set *offp which then again does a bunch of stack restoring and returns. If you instead make it something like: /* Fetch any missing attrs and raise an error if reqnatts is invalid */ if (unlikely(attnum < reqnatts)) { *offp = off; slot_getmissingattrs(slot, attnum, reqnatts); return; } (or just duplicate the *offp = off in the goto done; case) the overhead seems to be removed. At least gcc is doing some truly weird shit in the firstNonGuaranteed/firstNonCachedOffsetAttr loop "header" (i.e. just before the first entrance to the loop) , which leads to the register pressure being high, which leads to spilling on the stack, making the few-tuples case slower: r9 is tts_nvalid, r11 is firstNonGuaranteedAttr, r15 is slot->tts_tupleDescriptor, rbx is tts_values, -0x10(%rsp) is tts_isnull, rax is tup. if (attnum < firstNonGuaranteedAttr) cmp %r9d,%r11d jle lea 0x0(,%r9,8),%r12 multiply r9/tts_nvalid by 8, store in r12 sub %r9d,%r11d r11=firstNonGuaranteedAttr-tts_nvalid xor %ecx,%ecx Set ecx to 0 lea 0x20(%r15,%r12,1),%rdx compute (char *) tupleDesc->compact_attrs + r12, i.e. the address of the first needed compact attribute add %rbx,%r12 compute (char*) tts_values + tts_nvalid * sizeof(Datum) mov -0x10(%rsp),%rbx restore tts_isnull from stack lea (%rbx,%r9,1),%rsi compute tts_isnull[tts_nvalid] jmp Then the loop body: ... movb $0x0,(%rsi,%rcx,1) set (tts_isnull + tts_nvalid)[i] = 0 mov %rdx,%r13 movswl (%rdx),%edi load cattr->attcacheoff into rdi/edi movzwl 0x2(%rdx),%r10d load attlen into r10 mov %rdi,%rbx add %rax,%rdi rdi = tup + attcacheoff [bunch of attlen related branches omitted] movslq (%rdi),%rdi store *(tup + off) in rdi mov %rdi,(%r12,%rcx,8) store rdi in tts_values[i] lea 0x1(%rcx),%rdi increment i by one, in a weird way, store in rdi add $0x8,%rdx increment cattr by 8 cmp %rdi,%r11 check if i < (firstNonGuaranteedAttr-tts_nvalid) I.e. the compiler creates an offset version of tts_values[tts_nvalid], tts_isnull[tts_nvalid], which then creates register allocation pressure, because later the original tts_values/tts_isnulll etc are accessed again and thus the underlying registers are preserved. And this is all for zero gain, from what I can tell, because the acceses are still done with indexed addressing (like mov %rdi,(%r12,%rcx,8)), which would work just as well if rcx were indexed based on attnum, not zero indexed within the loop. I see about a 10% improvement if I dissuade the compiler from doing that by adding __asm__ volatile ("" : "+r"(attnum) : :); In the loop body. I'm getting to the point where I'd like to just hand write the assembler for this stupid function. Gah. > > > I'm not getting great results from benchmarking the 0005 patch. I > > > verified that gcc does access the array without calculating the > > > element address from scratch each time and calculates it once, then > > > increments the pointer by sizeof(CompactAttribute). See the attached > > > .csv for the results on the 3 machines I tested on. > > > > FWIW, where I had seen that be rather beneficial is the TupleDescCompactAttr() > > at the start of the various loops, where the compiler has little choice to > > compute the address of the tupdesc->compact_attrs[firstNeededCol]. That > > matters only when only deforming a small number of columns, of course. > > oh ok. I wasn't aware that LEA's scaling factor can only be 1,2 4 or > 8. With the 8-byte struct, the compiler should be able to do the shift > and add as one operation, whereas with the 16-byte struct would > require a separate shift and add. > > Looking at the generated code, with 0004, I see: > > 1c79: 48 c1 e2 04 shl rdx,0x4 > 1c7d: 48 8d 4c 15 20 lea rcx,[rbp+rdx*1+0x20] > > whereas with 0005 I see: > > 1c6b: 4a 8d 1c dd 00 00 00 lea rbx,[r11*8+0x0] > > Is that what you meant? Yep. > > > I've also resequenced the patches to make the deform_bench test module > > > part of the 0001 patch. This makes it easier to test the performance > > > of master. > > > > What are your thoughts about merging the deform_bench tooling? I wonder if we > > should have src/test/modules/benchmark_tools or such, so we can add a few more > > micro-benchmarky tools over time? > > I'd like to see us give these tools a proper home. It helps lower the > bar for anyone else who'd like to experiment at some future date, and > also allows people to more easily test for performance regressions if > they're forced to change related code. I've also got a tool that > benchmarks the MemoryContext code which I keep in some local repo that > I dig out from time to time. Given that, it's probably unlikely > deform_bench would be the only extension in there if we did make a > directory for these. I'd probably lean towards one extensions with different functions for different purposes, but that's boring details. > On the otherhand, it does add some maintenance overhead, but IMO, > helping to ensure various key routines are optimal is a worthy enough > cause to make the maintenance overhead worthwhile. Agreed. Greetings, Andres Freund