MIME-Version: 1.0
References: 
 <CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig@mail.gmail.com>
 <CAN55FZ247JdiT8Sd1SRiyOJxk3Ei=pDCL4kpdP=HqLRjOhKf1Q@mail.gmail.com>
 <CAN55FZ2AxiwSah7TiQoMB==r=JKT0bOtooCB7ov4xRrGkVmJ1A@mail.gmail.com>
 <CAOzEurR5nFt=-SijfU7y0BHVcrT6RG9ovvdVfKt_uBZfEQew9w@mail.gmail.com>
 <CAOzEurSqgA69er9SzhPnXwmsVpO7-piUOuOy3dXcHOi__nSQcg@mail.gmail.com>
In-Reply-To: 
 <CAOzEurSqgA69er9SzhPnXwmsVpO7-piUOuOy3dXcHOi__nSQcg@mail.gmail.com>
From: KAZAR Ayoub <ma_kazar@esi.dz>
Date: Thu, 14 Aug 2025 03:24:50 +0100
Message-ID: 
 <CA+K2RumC79NwWxBdofHOYo8SCSs0YCJic05Du=xOszRmoPf9FA@mail.gmail.com>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
To: Shinya Kato <shinya11.kato@gmail.com>
Cc: Nazir Bilal Yavuz <byavuz81@gmail.com>, pgsql-hackers@postgresql.org
Content-Type: multipart/alternative; boundary="000000000000056b3b063c49fbbb"
Archived-At: 
 <https://www.postgresql.org/message-id/CA%2BK2RumC79NwWxBdofHOYo8SCSs0YCJic05Du%3DxOszRmoPf9FA%40mail.gmail.com>
Precedence: bulk

--000000000000056b3b063c49fbbb
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Following Nazir's findings about 4096 bytes being the performant line
length, I did more benchmarks from my side on both TEXT and CSV formats
with two different cases of normal data (no special characters) and data
with many special characters.

Results are con good as expected and similar to previous benchmarks
 ~30.9% faster copy in TEXT format
 ~32.4% faster copy in CSV format
20%-30% reduces cycles per instructions

In the case of doing a lot of special characters in the lines (e.g., tables
with large numbers of columns maybe), we obviously expect regressions here
because of the overhead of many fallbacks to scalar processing.
Results for a 1/3 of line length of special characters:
~43.9% slower copy in TEXT format
~16.7% slower copy in CSV format
So for even less occurrences of special characters or wider distance
between there might still be some regressions in this case, a
non-significant case maybe, but can be treated in other patches if we
consider to not use SIMD path sometimes.

I hope this helps more and confirms the patch.

Regards,
Ayoub Kazar

Le jeu. 14 ao=C3=BBt 2025 =C3=A0 01:55, Shinya Kato <shinya11.kato@gmail.co=
m> a
=C3=A9crit :

> On Tue, Aug 12, 2025 at 4:25=E2=80=AFPM Shinya Kato <shinya11.kato@gmail.=
com>
> wrote:
>
> > > +         * However, SIMD optimization cannot be applied in the
> following cases:
> > > +         * - Inside quoted fields, where escape sequences and closin=
g
> quotes
> > > +         *   require sequential processing to handle correctly.
> > >
> > > I think you can continue SIMD inside quoted fields. Only important
> > > thing is you need to set last_was_esc to false when SIMD skipped the
> > > chunk.
> >
> > That's a clever point that last_was_esc should be reset to false when
> > a SIMD chunk is skipped. You're right about that specific case.
> >
> > However, the core challenge is not what happens when we skip a chunk,
> > but what happens when a chunk contains special characters like quotes
> > or escapes. The main reason we avoid SIMD inside quoted fields is that
> > the parsing logic becomes fundamentally sequential and
> > context-dependent.
> >
> > To correctly parse a "" as a single literal quote, we must perform a
> > lookahead to check the next character. This is an inherently
> > sequential operation that doesn't map well to SIMD's parallel nature.
> >
> > Trying to handle this stateful logic with SIMD would lead to
> > significant implementation complexity, especially with edge cases like
> > an escape character falling on the last byte of a chunk.
>
> Ah, you're right. My apologies, I misunderstood the implementation. It
> appears that SIMD can be used even within quoted strings.
>
> I think it would be better not to use the SIMD path when last_was_esc
> is true. The next character is likely to be a special character, and
> handling this case outside the SIMD loop would also improve
> readability by consolidating the last_was_esc toggle logic in one
> place.
>
> Furthermore, when inside a quote (in_quote) in CSV mode, the detection
> of \n and \r can be disabled.
>
> +               last_was_esc =3D false;
>
> Regarding the implementation, I believe we must set last_was_esc to
> false when advancing input_buf_ptr, as shown in the code below. For
> this reason, I think it=E2=80=99s best to keep the current logic for togg=
ling
> last_was_esc.
>
> +               int advance =3D pg_rightmost_one_pos32(mask);
> +               input_buf_ptr +=3D advance;
>
> I've attached a new patch that includes these changes. Further
> modifications are still in progress.
>
> --
> Best regards,
> Shinya Kato
> NTT OSS Center
>

--000000000000056b3b063c49fbbb
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div><div><div>Following Nazir&#39;s findings ab=
out 4096 bytes being the performant line length, I did more benchmarks from=
 my side on both TEXT and CSV formats with two different cases of=C2=A0norm=
al data (no special characters) and data with many special characters.<br><=
br></div>Results are con good as expected and similar to previous benchmark=
s</div>=C2=A0~30.9% faster copy in TEXT format<br></div>=C2=A0~32.4% faster=
 copy in CSV format<br></div><div>20%-30% reduces cycles per instructions<b=
r><br></div>In the case of doing a lot of special characters in the lines (=
e.g., tables with large numbers of columns maybe), we obviously expect regr=
essions here because of the overhead=C2=A0of many fallbacks to scalar proce=
ssing.<br>Results for a 1/3 of line length of special characters:<br>~43.9%=
 slower copy in TEXT format<br>~16.7% slower copy in CSV format<br></div><d=
iv>So for even less occurrences=C2=A0of special characters or wider distanc=
e between there might still be some regressions in this case, a non-signifi=
cant case maybe, but can be treated in other patches if we consider to not =
use SIMD path sometimes.<br><br></div><div>I hope this helps more and confi=
rms the patch.<br><br></div><div>Regards,<br></div><div>Ayoub Kazar</div></=
div><br><div class=3D"gmail_quote gmail_quote_container"><div dir=3D"ltr" c=
lass=3D"gmail_attr">Le=C2=A0jeu. 14 ao=C3=BBt 2025 =C3=A0=C2=A001:55, Shiny=
a Kato &lt;<a href=3D"mailto:shinya11.kato@gmail.com">shinya11.kato@gmail.c=
om</a>&gt; a =C3=A9crit=C2=A0:<br></div><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);pad=
ding-left:1ex">On Tue, Aug 12, 2025 at 4:25=E2=80=AFPM Shinya Kato &lt;<a h=
ref=3D"mailto:shinya11.kato@gmail.com" target=3D"_blank">shinya11.kato@gmai=
l.com</a>&gt; wrote:<br>
<br>
&gt; &gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* However, SIMD optimization c=
annot be applied in the following cases:<br>
&gt; &gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* - Inside quoted fields, wher=
e escape sequences and closing quotes<br>
&gt; &gt; +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*=C2=A0 =C2=A0require sequenti=
al processing to handle correctly.<br>
&gt; &gt;<br>
&gt; &gt; I think you can continue SIMD inside quoted fields. Only importan=
t<br>
&gt; &gt; thing is you need to set last_was_esc to false when SIMD skipped =
the<br>
&gt; &gt; chunk.<br>
&gt;<br>
&gt; That&#39;s a clever point that last_was_esc should be reset to false w=
hen<br>
&gt; a SIMD chunk is skipped. You&#39;re right about that specific case.<br=
>
&gt;<br>
&gt; However, the core challenge is not what happens when we skip a chunk,<=
br>
&gt; but what happens when a chunk contains special characters like quotes<=
br>
&gt; or escapes. The main reason we avoid SIMD inside quoted fields is that=
<br>
&gt; the parsing logic becomes fundamentally sequential and<br>
&gt; context-dependent.<br>
&gt;<br>
&gt; To correctly parse a &quot;&quot; as a single literal quote, we must p=
erform a<br>
&gt; lookahead to check the next character. This is an inherently<br>
&gt; sequential operation that doesn&#39;t map well to SIMD&#39;s parallel =
nature.<br>
&gt;<br>
&gt; Trying to handle this stateful logic with SIMD would lead to<br>
&gt; significant implementation complexity, especially with edge cases like=
<br>
&gt; an escape character falling on the last byte of a chunk.<br>
<br>
Ah, you&#39;re right. My apologies, I misunderstood the implementation. It<=
br>
appears that SIMD can be used even within quoted strings.<br>
<br>
I think it would be better not to use the SIMD path when last_was_esc<br>
is true. The next character is likely to be a special character, and<br>
handling this case outside the SIMD loop would also improve<br>
readability by consolidating the last_was_esc toggle logic in one<br>
place.<br>
<br>
Furthermore, when inside a quote (in_quote) in CSV mode, the detection<br>
of \n and \r can be disabled.<br>
<br>
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0last_was_esc =3D fa=
lse;<br>
<br>
Regarding the implementation, I believe we must set last_was_esc to<br>
false when advancing input_buf_ptr, as shown in the code below. For<br>
this reason, I think it=E2=80=99s best to keep the current logic for toggli=
ng<br>
last_was_esc.<br>
<br>
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0int advance =3D pg_=
rightmost_one_pos32(mask);<br>
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0input_buf_ptr +=3D =
advance;<br>
<br>
I&#39;ve attached a new patch that includes these changes. Further<br>
modifications are still in progress.<br>
<br>
-- <br>
Best regards,<br>
Shinya Kato<br>
NTT OSS Center<br>
</blockquote></div>

--000000000000056b3b063c49fbbb--