MIME-Version: 1.0
References: <aPkvi5P7kpA8oQKc@nathan>
 <5d81fbbb-7609-4445-9bc4-8af211fb7674@dunslane.net>
 <CAKWEB6qdyhN3EoUNAK23etXX-kXH-_79NNbTsKqtF1g1WkuaBQ@mail.gmail.com>
 <CA+K2RumMC+avYGSX-AWNeod3w+XOGHrVPz8HiqkvJj7AZ5tZXA@mail.gmail.com>
 <CAKWEB6pev=pNVi4qDYWS50N=YFrKRbjH1h=5F1bXpnK7WR5CYg@mail.gmail.com>
 <aRue0D4QQkUf2B_N@nathan>
 <CAOzEurTHCGL-Txqf5rxMsPgTF=dTCOsr=uhJdXebqjEJy-0L7g@mail.gmail.com>
 <CAN55FZ0+JZvKYVCnJqLhHaWF9eBGmTaF1BCEpttxw1aT3G_+Qw@mail.gmail.com>
 <8e226753-57af-489a-bfbe-caa23dd71286@dunslane.net>
 <CAN55FZ1XF=R7F7B__gq04rp2nQnJqs1yfExEXo4riWc68+Pe0w@mail.gmail.com>
 <aR4wDwNdLc5TmcQq@nathan>
 <CA+K2Rump8NoMRZRZ2r4jHXUJwByasy_c3_b0oaO+TLkSbMD-jw@mail.gmail.com>
 <CAKWEB6rLxPVtN4ffZ3CMTL518zhk_BWzzBt6ZE2oUSaErdphxA@mail.gmail.com>
 <CAKWEB6oO4gQd+UJBrU=uuUTE8Hv7GMznjMouvn0Lskr52UqjhQ@mail.gmail.com>
 <CAN55FZ0Nd9FL=aDSjOTJTeFAn8VNrZgWG+WbcHR+R7GkDMvUyw@mail.gmail.com>
 <CAN55FZ1fwKgGo2wEie1w2M2jzJko6cMi1NWD05Xm47_L9a3D+g@mail.gmail.com>
In-Reply-To: 
 <CAN55FZ1fwKgGo2wEie1w2M2jzJko6cMi1NWD05Xm47_L9a3D+g@mail.gmail.com>
From: Manni Wood <manni.wood@enterprisedb.com>
Date: Tue, 9 Dec 2025 16:13:02 -0600
Message-ID: 
 <CAKWEB6oZdQhhBV3ojHLBwjQgKzfDw0fkqncurt9oi7vNsq41ww@mail.gmail.com>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
To: Bilal Yavuz <byavuz81@gmail.com>
Cc: KAZAR Ayoub <ma_kazar@esi.dz>, Nathan Bossart <nathandbossart@gmail.com>,
	Andrew Dunstan <andrew@dunslane.net>, Shinya Kato <shinya11.kato@gmail.com>,
	PostgreSQL-development <pgsql-hackers@postgresql.org>
Content-Type: multipart/alternative; boundary="000000000000c22d6806458c37b5"
Archived-At: 
 <https://www.postgresql.org/message-id/CAKWEB6oZdQhhBV3ojHLBwjQgKzfDw0fkqncurt9oi7vNsq41ww%40mail.gmail.com>
Precedence: bulk

--000000000000c22d6806458c37b5
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Tue, Dec 9, 2025 at 7:40=E2=80=AFAM Bilal Yavuz <byavuz81@gmail.com> wro=
te:

> Hi,
>
> On Sat, 6 Dec 2025 at 10:55, Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > Hi,
> >
> > On Sat, 6 Dec 2025 at 04:40, Manni Wood <manni.wood@enterprisedb.com>
> wrote:
> > > Hello, all.
> > >
> > > Andrew, I tried your suggestion of just reading the first chunk of th=
e
> copy file to determine if SIMD is worth using. Attached are v4 versions o=
f
> the patches showing a first attempt at doing that.
> >
> > Thank you for doing this!
> >
> > > I attached test.sh.txt to show how I've been testing, with 5 million
> lines of the various copy file variations introduced by Ayub Kazar.
> > >
> > > The text copy with no special chars is 30% faster. The CSV copy with
> no special chars is 48% faster. The text with 1/3rd escapes is 3% slower.
> The CSV with 1/3rd quotes is 0.27% slower.
> > >
> > > This set of patches follows the simplest suggestion of just testing
> the first N lines (actually first N bytes) of the file and then deciding
> whether or not to enable SIMD. This set of patches does not follow Andrew=
's
> later suggestion of maybe checking again every million lines or so.
> >
> > My input-generation script is not ready to share yet, but the inputs
> > follow this format: text_${n}.input, where n represents the number of
> > normal characters before the delimiter. For example:
> >
> > n =3D 0 -> "\n\n\n\n\n..." (no normal characters)
> > n =3D 1 -> "a\n..." (1 normal character before the delimiter)
> > ...
> > n =3D 5 -> "aaaaa\n..."
> > =E2=80=A6 continuing up to n =3D 32.
> >
> > Each line has 4096 chars and there are a total of 100000 lines in each
> > input file.
> >
> > I only benchmarked the text format. I compared the latest heuristic I
> > shared [1] with the current method. The benchmarks show roughly a ~16%
> > regression at the worst case (n =3D 2), with regressions up to n =3D 5.
> > For the remaining values, performance was similar.
>
> I tried to improve the v4 patchset. My changes are:
>
> 1 - I changed CopyReadLineText() to an inline function and sent the
> use_simd variable as an argument to get help from inlining.
>
> 2 - A main for loop in the CopyReadLineText() function is called many
> times, so I moved the use_simd check to the CopyReadLine() function.
>
> 3 - Instead of 'bytes_processed', I used 'chars_processed' because
> cstate->bytes_processed is increased before we process them and this
> can cause wrong results.
>
> 4 - Because of #2 and #3, instead of having
> 'SPECIAL_CHAR_SIMD_THRESHOLD', I used the ratio of 'chars_processed /
> special_chars_encountered' to determine whether we want to use SIMD.
>
> 5 - cstate->special_chars_encountered is incremented wrongly for the
> CSV case. It is not incremented for the quote and escape delimiters. I
> moved all increments of cstate->special_chars_encountered to the
> central place and tried to optimize it but it still causes a
> regression as it creates one more branching.
>
> With these changes, I am able to decrease the regression to %10 from
> %16. Regression decreases to %7 if I modify #5 for the only text input
> but I did not do that.
>
> My changes are in the 0003.
>
> --
> Regards,
> Nazir Bilal Yavuz
> Microsoft
>

Bilal Yavuz (Nazir Bilal Yavuz?), I did not get a chance to do any work on
this today, but wanted to thank you for finding my logic errors in counting
special chars for CSV, and hacking on my naive solution to make it faster.
By attempting Andrew Dunstan's suggestion, I got a better feel for the
reality that the "housekeeping" code produces a significant amount of
overhead.
--=20
-- Manni Wood EDB: https://www.enterprisedb.com

--000000000000c22d6806458c37b5
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote g=
mail_quote_container"><div dir=3D"ltr" class=3D"gmail_attr">On Tue, Dec 9, =
2025 at 7:40=E2=80=AFAM Bilal Yavuz &lt;<a href=3D"mailto:byavuz81@gmail.co=
m">byavuz81@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,20=
4);padding-left:1ex">Hi,<br>
<br>
On Sat, 6 Dec 2025 at 10:55, Bilal Yavuz &lt;<a href=3D"mailto:byavuz81@gma=
il.com" target=3D"_blank">byavuz81@gmail.com</a>&gt; wrote:<br>
&gt;<br>
&gt; Hi,<br>
&gt;<br>
&gt; On Sat, 6 Dec 2025 at 04:40, Manni Wood &lt;<a href=3D"mailto:manni.wo=
od@enterprisedb.com" target=3D"_blank">manni.wood@enterprisedb.com</a>&gt; =
wrote:<br>
&gt; &gt; Hello, all.<br>
&gt; &gt;<br>
&gt; &gt; Andrew, I tried your suggestion of just reading the first chunk o=
f the copy file to determine if SIMD is worth using. Attached are v4 versio=
ns of the patches showing a first attempt at doing that.<br>
&gt;<br>
&gt; Thank you for doing this!<br>
&gt;<br>
&gt; &gt; I attached test.sh.txt to show how I&#39;ve been testing, with 5 =
million lines of the various copy file variations introduced by Ayub Kazar.=
<br>
&gt; &gt;<br>
&gt; &gt; The text copy with no special chars is 30% faster. The CSV copy w=
ith no special chars is 48% faster. The text with 1/3rd escapes is 3% slowe=
r. The CSV with 1/3rd quotes is 0.27% slower.<br>
&gt; &gt;<br>
&gt; &gt; This set of patches follows the simplest suggestion of just testi=
ng the first N lines (actually first N bytes) of the file and then deciding=
 whether or not to enable SIMD. This set of patches does not follow Andrew&=
#39;s later suggestion of maybe checking again every million lines or so.<b=
r>
&gt;<br>
&gt; My input-generation script is not ready to share yet, but the inputs<b=
r>
&gt; follow this format: text_${n}.input, where n represents the number of<=
br>
&gt; normal characters before the delimiter. For example:<br>
&gt;<br>
&gt; n =3D 0 -&gt; &quot;\n\n\n\n\n...&quot; (no normal characters)<br>
&gt; n =3D 1 -&gt; &quot;a\n...&quot; (1 normal character before the delimi=
ter)<br>
&gt; ...<br>
&gt; n =3D 5 -&gt; &quot;aaaaa\n...&quot;<br>
&gt; =E2=80=A6 continuing up to n =3D 32.<br>
&gt;<br>
&gt; Each line has 4096 chars and there are a total of 100000 lines in each=
<br>
&gt; input file.<br>
&gt;<br>
&gt; I only benchmarked the text format. I compared the latest heuristic I<=
br>
&gt; shared [1] with the current method. The benchmarks show roughly a ~16%=
<br>
&gt; regression at the worst case (n =3D 2), with regressions up to n =3D 5=
.<br>
&gt; For the remaining values, performance was similar.<br>
<br>
I tried to improve the v4 patchset. My changes are:<br>
<br>
1 - I changed CopyReadLineText() to an inline function and sent the<br>
use_simd variable as an argument to get help from inlining.<br>
<br>
2 - A main for loop in the CopyReadLineText() function is called many<br>
times, so I moved the use_simd check to the CopyReadLine() function.<br>
<br>
3 - Instead of &#39;bytes_processed&#39;, I used &#39;chars_processed&#39; =
because<br>
cstate-&gt;bytes_processed is increased before we process them and this<br>
can cause wrong results.<br>
<br>
4 - Because of #2 and #3, instead of having<br>
&#39;SPECIAL_CHAR_SIMD_THRESHOLD&#39;, I used the ratio of &#39;chars_proce=
ssed /<br>
special_chars_encountered&#39; to determine whether we want to use SIMD.<br=
>
<br>
5 - cstate-&gt;special_chars_encountered is incremented wrongly for the<br>
CSV case. It is not incremented for the quote and escape delimiters. I<br>
moved all increments of cstate-&gt;special_chars_encountered to the<br>
central place and tried to optimize it but it still causes a<br>
regression as it creates one more branching.<br>
<br>
With these changes, I am able to decrease the regression to %10 from<br>
%16. Regression decreases to %7 if I modify #5 for the only text input<br>
but I did not do that.<br>
<br>
My changes are in the 0003.<br>
<br>
-- <br>
Regards,<br>
Nazir Bilal Yavuz<br>
Microsoft<br>
</blockquote></div><div><br clear=3D"all"></div><div>Bilal Yavuz (Nazir Bil=
al Yavuz?), I did not get a chance to do any work on this today, but wanted=
 to thank you for finding my logic errors in counting special chars for CSV=
, and hacking on my naive solution to make it faster. By attempting Andrew =
Dunstan&#39;s suggestion, I got a better feel for the reality that the &quo=
t;housekeeping&quot; code produces a significant amount of overhead.</div><=
span class=3D"gmail_signature_prefix">-- </span><br><div dir=3D"ltr" class=
=3D"gmail_signature"><div dir=3D"ltr"><span style=3D"color:rgb(29,28,29);fo=
nt-family:Monaco,Menlo,Consolas,&quot;Courier New&quot;,monospace;font-size=
:12px;background-color:rgba(29,28,29,0.04)">--
Manni Wood
EDB: </span><a href=3D"https://www.enterprisedb.com/" rel=3D"noopener noref=
errer" style=3D"font-family:Monaco,Menlo,Consolas,&quot;Courier New&quot;,m=
onospace;font-size:12px;background-color:rgba(29,28,29,0.04)" target=3D"_bl=
ank">https://www.enterprisedb.com</a></div></div></div>

--000000000000c22d6806458c37b5--