Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vluiO-005Exx-0N for pgsql-hackers@arkaria.postgresql.org; Fri, 30 Jan 2026 20:03:36 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vluiK-0063Ay-2H for pgsql-hackers@arkaria.postgresql.org; Fri, 30 Jan 2026 20:03:33 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vluiJ-0063Aj-26 for pgsql-hackers@lists.postgresql.org; Fri, 30 Jan 2026 20:03:33 +0000 Received: from fout-b7-smtp.messagingengine.com ([202.12.124.150]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vluiH-000DNF-18 for pgsql-hackers@lists.postgresql.org; Fri, 30 Jan 2026 20:03:31 +0000 Received: from phl-compute-10.internal (phl-compute-10.internal [10.202.2.50]) by mailfout.stl.internal (Postfix) with ESMTP id 5843A1D00067; Fri, 30 Jan 2026 15:03:28 -0500 (EST) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-10.internal (MEProxy); Fri, 30 Jan 2026 15:03:28 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm3; t=1769803408; x=1769889808; bh=TCVzYoFuEy DjuhTMv8tohPqquJxzglfgbC9TBn7EdK0=; b=eaWvZonHRKaj8dnBD6Tg24TeqA aGFDSS6rWyc7kp9BgWwd9l9ltRlgke3/kMSMoBzi/DQ0LEZCIx/X2Zm5m+wXgx4l iXZ/X65bgjGtyZmHlXcrttLUucMNdQW4u8iA9Qhau8z2Rsf+m0Axw+wk1YGVwuN4 RPf3iUGEVTR+aPHWbK/9D7+GWqHm7h3CyVHJPHbg3SqW0CojVU+3v/gpYEGfSWSY CIChqj8vQ9iiJfGBbgR9KSd8/C2XUp+YvrO+yK7pgxDFzGdUAGfDWbkdhhDgoIo1 I1LoCZ94zTKgAcWNLONbt3hrEYiagGKrk1C8Xv4xb3Wp3WDfLHkaYlKD+pQg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t= 1769803408; x=1769889808; bh=TCVzYoFuEyDjuhTMv8tohPqquJxzglfgbC9 TBn7EdK0=; b=HCJ46p4fqgmqwUht0+4D5n1QK4ahEG9Y2XA+/1HPWADeIjYTK85 1wt8pnvhz/7TGJINIs2yOI4T0tHQ3NXAXCUbZlp8ivEoV4IkXpwSxNbPLTiFBp2a pEhNmSrGZNBEAEZzVcfVW/F239aT3fvgabaVEzQC16deiLiVoonAGRvEqWgOMRPy VqlYKfV9Z2ev8n8401dBDW9/q68aBXiirGtT+fmbp2DycrJ410lpr1/HC3mQHdnd RGIdRhnpkTdrvpe9fDBCzSATEBpbwhnstUaFS5mNFW3lP+GsBIoIMMPcnqfqdMfa dK1r30Qdyx0Ne8sax8C/7u/k3X49UpjQxSQ== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgdduieelleegucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucenucfjughrpeffhffvvefukfhfgggtuggjsehttdfstd dttddvnecuhfhrohhmpeetnhgurhgvshcuhfhrvghunhguuceorghnughrvghssegrnhgr rhgriigvlhdruggvqeenucggtffrrghtthgvrhhnpeeffffgledvffegtdevlefgtdeggf fhvdekgfegteeiveejkeetudelveejhfeugeenucevlhhushhtvghrufhiiigvpedtnecu rfgrrhgrmhepmhgrihhlfhhrohhmpegrnhgurhgvshesrghnrghrrgiivghlrdguvgdpnh gspghrtghpthhtohepfedpmhhouggvpehsmhhtphhouhhtpdhrtghpthhtohepughgrhho fihlvgihmhhlsehgmhgrihhlrdgtohhmpdhrtghpthhtoheplhhirdgvvhgrnhdrtghhrg hosehgmhgrihhlrdgtohhmpdhrtghpthhtohepphhgshhqlhdqhhgrtghkvghrsheslhhi shhtshdrphhoshhtghhrvghsqhhlrdhorhhg X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Fri, 30 Jan 2026 15:03:27 -0500 (EST) Date: Fri, 30 Jan 2026 15:03:27 -0500 From: Andres Freund To: David Rowley Cc: Chao Li , PostgreSQL Developers Subject: Re: More speedups for tuple deformation Message-ID: References: <9A17C43D-7A28-4885-8974-555A40C9523E@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2026-01-30 12:11:44 -0500, Andres Freund wrote: > > In the attached 0004 patch I've experimented with this again. This > > time, I wrote a function that converts the null bitmap into the isnull > > array using a lookup table. > > Oh. I was just thinking of something roughly like > > int i nullbyte_i = attnum >> 3; > for (int nullcol = attnum; nullcol < natts; nullcol += 8) > { > bits8 nullbyte = bp[nullbyte_i++]; > > for (int onebyte = 0; onebyte < 8; onebyte++) > { > if (nullcol < natts) > tts_isnull[nullcol] = nullbyte & 0x01; > nullbyte >>= 1; > nullcol++; > } > } > > This isn't quite right, as we'd need to deal with starting to deform at a > attribute that's not % 8 = 0 (I'd probably just do the whole byte even if we'd > redo a few column)). And probably lots of other stuff. > > With a bit of care the inner loop should be unrollable with all the moves as > conditional moves depending on nullcol < natts. Or such. > > Or we could just make sure tts_isnull is always sized to be divisible by 8, > that'd presumably allow considerably better code to be generated. > > > I'd hope this would be more efficient than doing > > static inline bool > att_isnull(int ATT, const bits8 *BITS) > { > return !(BITS[ATT >> 3] & (1 << (ATT & 0x07))); > } > > for each column. I couldn't quite let go of this, thinking there had to be a non-simd way to do this efficiently (by basically doing SWAR). And I think there is: We can multiply one byte of the null bitmap with a carefully chosen value that spreads each bit into the higher bytes. E.g. 0b111 * (1 << 0) = 0b 111 0b111 * (1 << 7) = 0b 11_10000000 0b111 * ((1 << 0) | (1 << 7)) = 0b 11_10000111 0b111 * (1 << 14) = 0b1_11000000_00000000 0b111 * ((1 << 0) | (1 << 7) | ( << 14)) = 0b1_11000011_10000111 ... Then, we can fairly easily mask out the unnecessary bits. However, as maybe apparent above, this won't work for the input 0xff, as the carry from each byte's multiplications will "overflow its byte" and flip the next bit to 0. But that's not hard to handle: a) A single branch would probably be fine? b) Compute the low 4 bits and high 4 bits of the bitmap in parallel and or the result together. By just looking at 4 null bits, no carry overflow is possible. Due to being branchless, that's propbably better. Something like /* * The bits array has 1s for values and 0s for NULLs. Bit-flip that to * get 1s for NULLs. */ uint8_t nullbyte = ~bp[nullbyte_i++]; /* 8 bytes where each byte is 0 or 1 depending on whether null bitmap is set */ uint64_t isnull_8; /* * Multiplier ensuring that input bit 0 is reflected in output bit 0, input bit 1 at output bit 8, etc. * Other bits also will often be set and need to be masked away. */ uint32_t spread_bits_32 = (1U << 0) | (1U << 7) | (1U << 14) | (1U << 21); uint64_t mask_bits_64 = 0x0101010101010101ULL; /* convert the lower 4 bits of null bitmap word into 32 bit int */ isnull_8 = (nullbyte & 0xf) * spread_bits_32; /* convert the upper 4 bits of null bitmap word into 32 bit int, shift into the upper 32 bit */ isnull_8 |= ((uint64_t)((nullbyte >> 4) * spread_bits_32)) << 32; /* mask out all the bogus bits (could also be done as a 32bit op?)*/ isnull_8 &= mask_bits_64; memcpy(&tts_isnull[nullcol], &isnull_8, sizeof(isnull_8)); should, I think, be better than a 2kB array. Perhaps not quite as fast as the AVX512 method, but it should be decent on most hardware... Greetings, Andres Freund