Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w2gfN-000VYP-25 for pgsql-hackers@arkaria.postgresql.org; Wed, 18 Mar 2026 02:29:50 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w2gfM-006him-1r for pgsql-hackers@arkaria.postgresql.org; Wed, 18 Mar 2026 02:29:48 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w2gfM-006hid-0S for pgsql-hackers@lists.postgresql.org; Wed, 18 Mar 2026 02:29:48 +0000 Received: from mail-ed1-x52e.google.com ([2a00:1450:4864:20::52e]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w2gfI-00000000Hm1-0XbB for pgsql-hackers@postgresql.org; Wed, 18 Mar 2026 02:29:46 +0000 Received: by mail-ed1-x52e.google.com with SMTP id 4fb4d7f45d1cf-66174cf4549so973523a12.0 for ; Tue, 17 Mar 2026 19:29:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1773800984; cv=none; d=google.com; s=arc-20240605; b=WqsJwqhXHVUQ6YFJJDiXJ41cKnd8rNChwkvN8pp2RNn4a5OZAcAC2+c+wgm+LYYgJE IHK2VgJrFdHROM8bqIOuldyJ4vGd62/lZ/Odj+UrXCO8QMvedTlfJmgVblkbKRYTEouA wT9UcCmbcC6VHmlLLs150IPAB/v9nU1OQnPPyiu0x8GdiESFb8AJTqlSh/O110ppQVrW 0Ixrx1dDEPxAEXd5hroNqNFJv3zyPr8zFDtoMNj3FgNiAfxXzPH7zem4nGuuuPptzuqt P15/f8w7UFfyPr+sN39R/BSfCkzUtGx/IWjubpixTyIz0FgQLjHORaiPFbGECNn6m985 RDsw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=UaeapquntvKkia2M+9vATp+sTFnjPfJe4dORDY8T+3w=; fh=GnR2A8P7jgZiByOHr2LKruRg16nf/Jh/5OUvf+KZfyc=; b=CI4Q9mbj0ZdNK1Rtf/XaAv8P7YBe0nTfFONsAnRHtZ6luZ1bweSc1JqeTXIfdZyR6S XzSvCRuboV+WFvuEBWbMCuDWt87fLmZrNC/phwJ4ubdA5cepnF5O/zg7weg0U7Fby+D7 qI9aRmVkXhJULGgmuW02osHo6MgBq74G7B/OWNmo7KpGD27jIcQR6LB1a0hObmyjSQAK lHiRnkDB6TlW+hm47tktTSRe5WpGynV2QFyelOSVWwDk9Y0PjzQewnFH1B7pu3RgwLwI bgSsPIkZwRsF4W4apk5+muD9SeTDdeuGhCV6hAw56HpHDFi9uy9VPlJC+B0aJKBPUI/h yn9g==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=esi.dz; s=google; t=1773800984; x=1774405784; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=UaeapquntvKkia2M+9vATp+sTFnjPfJe4dORDY8T+3w=; b=TDH0I19wXQdr7TWgST6wtin652rUKnopRXx6tQdIQGTIz+SkgkdI743+G59IoPuYhg sPg1t1zIHTon0N+EN0MsGuB+f+DFFp0BPpNv+f4I1HjWIGvKlJLvNMQF7lAZtnUigwNT 45OvdLIRzrNnMSJyr2PkAmqFv2P7eB/vyzXwAzn2ecX+OYXviGExE3YNgrjP7v1kEqLn AgIFVyQ81WzaAaxjZfG0MiWejToOhsk+sWAtMRXeN35OAwJFp0ZWM9akvxfLUqqnxHlC xYEURJnYilq/NUo7TUbyJPGorncWSIlNKtMg5YToGE0msRqVd6d8KHYfy6foq9iUtK/6 +Ocg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1773800984; x=1774405784; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=UaeapquntvKkia2M+9vATp+sTFnjPfJe4dORDY8T+3w=; b=qEtNanZMtN9JWaZoUXYkd/g+mZe2PPSg/y8DGVPwO/F2I/kk+CaaquJF0GrQRzBtBb QUEDm5LqGNpoooRw0ApxtrnIBtBCM32++ORslOSJazaDVpeWnqiJlOFjXk9wTPiz6xrw RopmaCQZB4IJVcGBxtvDa3MOXKMWM5U95FIwAnPn75RXKNIbXpCfcbpXoCVxx2u7s6qI CuGafBHGhaejsJ2U/e6lgJhFHy1o2KUWmB3T/dLducGTgj6yxuM0mzJ2z02xeeIsZhCH 7YGVO8z7+cNb2hxNvT45nDGUwD5Yu1JJA5fptHpstj6M1oqnB9ACUUmid4dSKwTSi17g 5q+Q== X-Forwarded-Encrypted: i=1; AJvYcCU9zbzgz609JzQLpxZ3NY3ttvLq5vfi7uxIcRDHmj3n6Iwqz4Y192yRgvpcDesVEYEGfEGx8HuJNRVPnpdW@postgresql.org X-Gm-Message-State: AOJu0YywNvdM4LOijC7gBitJ0ptHU301/woc0CXenpt+mOYT7eHn1T8u OgXSyU8NGpRAD5SYanBxmyhDRsOFrtcUa+98fp3DnCXSvnnYn+gmpZeRiTtMsS6mlwYMFwPleqo dOH1R/zQRq2oUTmw+yAQBjyZuF5pm9oFDluu5NzeN X-Gm-Gg: ATEYQzw7ZiGMAnLsPt90GRMFwIQtFQ7MVbOPrkixxHsNpf2s3iPMp5QdUTneLgkvukq Js2aLqSmNNkAiLX8hPXJVPTIIo4kUvQcml2PRtIW3LPSkv6H/PNgcUWLi+h0oMB3ll4M1gdvk2H x0O9dWDZ0ovQ/KkjA7FLxXKO4np5CPCiQMWGBi8dmkl4RtxDvjl+WQ9DB2m4aFeF2FyivpKlVZ4 kfGrVkmiXB3dd8cIg8qcpEp4+vOa6J0kJemQevgiJhvpRqkMZ7wAPOi7Y3xGDfzV+XOF1GzQ+Qo byP6oBk/2tymDUfqSZiFRwH+eUGOMKnt8zarKaP2+saaAhazTfFSArTNzW1Wt5Ym5H4rAzJNRZl LGK0GNAsSf61fiAqVnmFPKA04YPQsGBXnXNoF2A== X-Received: by 2002:a05:6402:1473:b0:667:88b6:d0a1 with SMTP id 4fb4d7f45d1cf-667b272dc0emr1103898a12.6.1773800984037; Tue, 17 Mar 2026 19:29:44 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: KAZAR Ayoub Date: Wed, 18 Mar 2026 03:29:32 +0100 X-Gm-Features: AaiRm53oK03GU00L17W1iDitNYnihtWjfU4RlQOxtyx6WUsCzgPQcaIsHFn9FFo Message-ID: Subject: Re: Speed up COPY TO text/CSV parsing using SIMD To: Nathan Bossart Cc: Andres Freund , Pg Hackers , Neil Conway , Manni Wood , Andrew Dunstan , Shinya Kato , Mark Wong , Nazir Bilal Yavuz Content-Type: multipart/alternative; boundary="0000000000008e5652064d433988" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000008e5652064d433988 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Mar 18, 2026 at 12:02=E2=80=AFAM KAZAR Ayoub wrot= e: > On Tue, Mar 17, 2026 at 7:49=E2=80=AFPM Nathan Bossart > wrote: > >> On Sat, Mar 14, 2026 at 11:43:38PM +0100, KAZAR Ayoub wrote: >> > Just a small concern about where some varlenas have a larger binary si= ze >> > than its text representation ex: >> > SELECT pg_column_size(to_tsvector('SIMD is GOOD')); >> > pg_column_size >> > ---------------- >> > 32 >> > >> > its text representation is less than sizeof(Vector8) so currently v3 >> would >> > enter SIMD path and exit out just from the beginning (two extra >> branches) >> > because it does this: >> > + if (TupleDescAttr(tup_desc, attnum - 1)->attlen =3D=3D -1 && >> > + VARSIZE_ANY_EXHDR(DatumGetPointer(value)) > sizeof(Vector8)) >> > >> > I thought maybe we could do * 2 or * 4 its binary size, depends on the >> type >> > really but this is just a proposition if this case is something >> concerning. >> >> Can we measure the impact of this? How likely is this case? >> > I'll respond to this separately in a different email. > My example was already incorrect (the text representation is lexems and positions, not the text we read as it is, its lossy), anyways the point still holds. If we have some json(b) column like : {"key1":"val1","key2":"val2"}, for CSV format this would immediately exit the SIMD path because of quote character, for json(b) this is going to be always the case. I measured the overhead of exiting the SIMD path a lot (8 million times for one COPY TO command), i only found 3% regression for this case, sometimes 2%. For cases where we do a false commitment on SIMD because we read a binary size >=3D sizeof(Vector8), which i found very niche too, the short circuit = to scalar each time is even more negligible (the above CSV JSON case is the absolute worst case). So I don't think any of this should be a concern. Regards, Ayoub --0000000000008e5652064d433988 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Wed, Mar 18, 2026 at 12:02=E2=80=AFAM = KAZAR Ayoub <ma_kazar@esi.dz> = wrote:
<= div dir=3D"ltr">On Tue, Mar 17, 2026 at 7:49=E2=80=AFPM Nathan Bossart <= nathandbossar= t@gmail.com> wrote:
On Sat, Mar 14, 2026 at 11:43:38PM +0= 100, KAZAR Ayoub wrote:
> Just a small concern about where some varlenas have a larger binary si= ze
> than its text representation ex:
> SELECT pg_column_size(to_tsvector('SIMD is GOOD'));
>=C2=A0 pg_column_size
> ----------------
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 32
>
> its text representation is less than sizeof(Vector8) so currently v3 w= ould
> enter SIMD path and exit out just from the beginning (two extra branch= es)
> because it does this:
> + if (TupleDescAttr(tup_desc, attnum - 1)->attlen =3D=3D -1 &&a= mp;
> + VARSIZE_ANY_EXHDR(DatumGetPointer(value)) > sizeof(Vector8))
>
> I thought maybe we could do * 2 or * 4 its binary size, depends on the= type
> really but this is just a proposition if this case is something concer= ning.

Can we measure the impact of this?=C2=A0 How likely is this case?
I'll respond to this separately=C2=A0in a different email.<= /div>
My example was already incorrect (= the text representation is lexems=C2=A0and positions, not the text we read = as it is, its lossy), anyways the point still holds.
If we have s= ome json(b) column like :=C2=A0{"key1":"val1","key= 2":"val2"}, for CSV format this would immediately exit the S= IMD path because of quote character, for json(b) this is going to be always= the case.
I measured=C2=A0the overhead of exiting=C2=A0the SIMD = path a lot (8 million times for one COPY TO command), i only found 3% regre= ssion for this case, sometimes 2%.

For cases where= we do a false commitment on SIMD because we read a binary size >=3D siz= eof(Vector8), which i found very niche too, the=C2=A0short circuit to scala= r each time is even more negligible (the above CSV JSON case is the absolut= e worst case).
So I don't think any of this should be a conce= rn.


Regards,
Ayoub
<= /div>
--0000000000008e5652064d433988--