MIME-Version: 1.0
References: 
 <CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig@mail.gmail.com>
 <CAN55FZ247JdiT8Sd1SRiyOJxk3Ei=pDCL4kpdP=HqLRjOhKf1Q@mail.gmail.com>
 <CAN55FZ2AxiwSah7TiQoMB==r=JKT0bOtooCB7ov4xRrGkVmJ1A@mail.gmail.com>
 <CAOzEurR5nFt=-SijfU7y0BHVcrT6RG9ovvdVfKt_uBZfEQew9w@mail.gmail.com>
 <CAOzEurSqgA69er9SzhPnXwmsVpO7-piUOuOy3dXcHOi__nSQcg@mail.gmail.com>
 <CA+K2RumC79NwWxBdofHOYo8SCSs0YCJic05Du=xOszRmoPf9FA@mail.gmail.com>
 <CAN55FZ0houfWHn8_MEEefhprZvc33jr07GrBYo+Bp2yw=TVnKA@mail.gmail.com>
 <CA+K2Ru=jHuz_Wpgar4Sobtxeb33qxx=o59ToOhZ=vpmkMqErnA@mail.gmail.com>
 <CAN55FZ1J+6eM=F5GreWEBMJcNV_gifYyYY1b6xpYzun=nWPhMQ@mail.gmail.com>
 <CAN55FZ109W90Ux_EBEqkkU2TyNqBNhdhN_1XPRGo3iiZ2L9b=A@mail.gmail.com>
 <8615c983-1662-43b4-b0c9-49d194ac33aa@dunslane.net>
 <CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com>
 <CA+K2RunFNDMxCWMX3PFSBa_r6REVwfEekaKHwg1C8KYYGePsnA@mail.gmail.com>
 <CAN55FZ3e31ddFyf7XHW5G3ytuQwcXpetsb3wkx6q9oSp_zekhQ@mail.gmail.com>
In-Reply-To: 
 <CAN55FZ3e31ddFyf7XHW5G3ytuQwcXpetsb3wkx6q9oSp_zekhQ@mail.gmail.com>
From: KAZAR Ayoub <ma_kazar@esi.dz>
Date: Tue, 21 Oct 2025 08:17:01 +0200
Message-ID: 
 <CA+K2RumH-b=3-v0rfQ-oAbuQFxY8JLSSpVhmaJn+gRnX3t1_vg@mail.gmail.com>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
To: Nazir Bilal Yavuz <byavuz81@gmail.com>,
	"nathandbossart@gmail.com" <nathandbossart@gmail.com>,
	"ants.aasma@cybertec.at" <ants.aasma@cybertec.at>
Cc: Andrew Dunstan <andrew@dunslane.net>,
 Shinya Kato <shinya11.kato@gmail.com>,
	pgsql-hackers@postgresql.org
Content-Type: multipart/alternative; boundary="00000000000092db450641a52696"
Archived-At: 
 <https://www.postgresql.org/message-id/CA%2BK2RumH-b%3D3-v0rfQ-oAbuQFxY8JLSSpVhmaJn%2BgRnX3t1_vg%40mail.gmail.com>
Precedence: bulk

--00000000000092db450641a52696
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Sat, Oct 18, 2025 at 10:01=E2=80=AFPM Nazir Bilal Yavuz <byavuz81@gmail.=
com>
wrote:

> Thank you so much for doing this! The results look nice, do you think
> there are any other benchmarks that might be interesting to try?
>

> > I'm also trying the idea of doing SIMD inside quotes with prefix XOR
> using carry less multiplication avoiding the slow path in all cases even
> with weird looking input, but it needs to take into consideration the
> availability of PCLMULQDQ instruction set with <wmmintrin.h> and here we
> go, it quickly starts to become dirty OR we can wait for the decision to
> start requiring x86-64-v2 or v3 which has SSE4.2 and AVX2.
>
> I can not quite picture this, would you mind sharing a few examples or
> patches?
>
The idea aims to avoid stopping at characters that are not actually special
in their position (inside quote, escaped ..etc)
This is done by creating a lot of masks from the original chunk, masks
like: quote_mask, escape_mask, odd escape sequences mask ; from these we
can deduce which quotes are not special to stop at
Then for inside quotes, we aim to know which characters in our chunk are
inside quotes (also keeping in track the previous chunk's quote state) and
there's a clever/fast way to do it [1].
After this you start to match with LF and CR ..etc, all this while
maintaining the state of what you've seen (the annoying part).
At the end you only reach the scalar path advancing by the position of
first real special character that requires special treatment.

However, after trying to implement this on the existing pipeline way of
COPY command [2] (broken hopeless try, but has the idea), It becomes very
unreasonable for a lot of reasons:
- It is very challenging to correctly handle commas inside quoted fields,
and tracking quoted vs. unquoted state (especially across chunk boundaries,
or with escaped quotes) ....
- Using carry less multiplication (CLMUL) for prefix xor on a 16 bytes
chunk is overkill for some architectures where PCLMULQDQ latency is high
[3][4] to a point where it performs worse than an unrolled shifts + xor (5
cycles).
- It starts to feel that handling these cases is inherently scalar, doing
all that work for a 16 bytes chunk would be unreasonable since it's not
free, compared to a simple help using SIMD and heuristic of Nazir which is
way nicer in general.

Currently we are at 200-400Mbps which isn't that terrible compared to
production and non production grade parsers (of course we don't only parse
in our case), also we are using SSE2 only so theoretically if we add
support for avx later on we'll have even better numbers.
Maybe more micro optimizations to the current heuristic can squeeze it more=
.


[1]
https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-ca=
rry-less-multiply-pclmulqdq/
[2]
https://github.com/AyoubKaz07/postgres/commit/73c6ecfedae4cce5c3f375fd6074b=
1ca9dfe1daf
[3] https://agner.org/optimize/instruction_tables.pdf
[4] https://www.uops.info/table.html

Regards,
Ayoub Kazar.

--00000000000092db450641a52696
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=
=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Sat, Oct 18, 2025=
 at 10:01=E2=80=AFPM Nazir Bilal Yavuz &lt;<a href=3D"mailto:byavuz81@gmail=
.com" target=3D"_blank">byavuz81@gmail.com</a>&gt; wrote:<br></div><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px=
 solid rgb(204,204,204);padding-left:1ex">
Thank you so much for doing this! The results look nice, do you think<br>
there are any other benchmarks that might be interesting to try?<br></block=
quote><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;b=
order-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
&gt; I&#39;m also trying the idea of doing SIMD inside quotes with prefix X=
OR using carry less multiplication avoiding the slow path in all cases even=
 with weird looking input, but it needs to take into consideration the avai=
lability of PCLMULQDQ instruction set with &lt;wmmintrin.h&gt; and here we =
go, it quickly starts to become dirty OR we can wait for the decision to st=
art requiring x86-64-v2 or v3 which has SSE4.2 and AVX2.<br>
<br>
I can not quite picture this, would you mind sharing a few examples or patc=
hes?<br></blockquote><div>The idea aims to avoid stopping at characters tha=
t are not actually special in their position=C2=A0(inside quote, escaped ..=
etc)<br></div><div>This is done by creating a lot of masks from the origina=
l chunk, masks like: quote_mask, escape_mask, odd escape sequences mask ; f=
rom these we can deduce which quotes are not special to stop at<br></div><d=
iv>Then for inside quotes, we aim to know which characters in our chunk are=
 inside quotes (also keeping in track the previous chunk&#39;s quote state)=
 and there&#39;s a clever/fast way to do it [1].<br></div><div>After this y=
ou start to match with LF and CR ..etc, all this while maintaining the stat=
e of what you&#39;ve seen (the annoying part).</div><div>At the end you onl=
y reach the scalar path advancing by the position of first real special cha=
racter that requires special treatment.<br><br></div><div>However, after tr=
ying to implement this on the existing pipeline way of COPY command [2] (br=
oken hopeless=C2=A0try, but has the idea), It becomes very unreasonable for=
 a lot of reasons:<br></div><div>- It is very challenging to correctly hand=
le=C2=A0commas inside quoted fields, and tracking quoted vs. unquoted state=
=20
(especially across chunk boundaries, or with escaped quotes) ....<br></div>=
<div>- Using carry less multiplication (CLMUL) for prefix xor on a 16 bytes=
 chunk is overkill for some architectures where=C2=A0PCLMULQDQ latency is h=
igh [3][4] to a point where it performs worse than an unrolled shifts=C2=A0=
+ xor (5 cycles).</div><div>- It starts to feel that handling these cases i=
s inherently scalar, doing all that work for a 16 bytes chunk would be unre=
asonable since it&#39;s=C2=A0not free, compared to a simple help using SIMD=
 and heuristic of Nazir which is way nicer in general.<br></div><div><br></=
div><div>Currently we are at 200-400Mbps which isn&#39;t that terrible comp=
ared to production and non production grade parsers (of course we don&#39;t=
 only parse in our case), also we are using SSE2 only so theoretically if w=
e add support for avx later on we&#39;ll have even better numbers.<br></div=
><div>Maybe more micro optimizations to the current heuristic can squeeze i=
t more.</div><div><br></div><div><br>[1] <a href=3D"https://branchfree.org/=
2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmu=
lqdq/">https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-=
with-carry-less-multiply-pclmulqdq/<br></a>[2] <a href=3D"https://github.co=
m/AyoubKaz07/postgres/commit/73c6ecfedae4cce5c3f375fd6074b1ca9dfe1daf">http=
s://github.com/AyoubKaz07/postgres/commit/73c6ecfedae4cce5c3f375fd6074b1ca9=
dfe1daf</a></div><div>[3]=C2=A0<a href=3D"https://agner.org/optimize/instru=
ction_tables.pdf">https://agner.org/optimize/instruction_tables.pdf</a></di=
v><div>[4]=C2=A0<a href=3D"https://www.uops.info/table.html">https://www.uo=
ps.info/table.html</a></div><div><br></div><div>Regards,<br></div><div>Ayou=
b Kazar.</div></div></div>
</div>

--00000000000092db450641a52696--