MIME-Version: 1.0
References: <CAKWEB6rLxPVtN4ffZ3CMTL518zhk_BWzzBt6ZE2oUSaErdphxA@mail.gmail.com>
 <CAKWEB6oO4gQd+UJBrU=uuUTE8Hv7GMznjMouvn0Lskr52UqjhQ@mail.gmail.com>
 <CAN55FZ0Nd9FL=aDSjOTJTeFAn8VNrZgWG+WbcHR+R7GkDMvUyw@mail.gmail.com>
 <CAN55FZ1fwKgGo2wEie1w2M2jzJko6cMi1NWD05Xm47_L9a3D+g@mail.gmail.com>
 <CAKWEB6oZdQhhBV3ojHLBwjQgKzfDw0fkqncurt9oi7vNsq41ww@mail.gmail.com>
 <CAN55FZ1p5UyUdTRO7iWR_ukjhJDOnpOR2rYNOq=+hcC45OuahQ@mail.gmail.com>
 <CAOW5sYZEx=fPw2wp7y2nK_-ifXFeYW4CTmFx_OQeoHFjG7rbHw@mail.gmail.com>
 <CA+K2Ru=C_woAnd-3-pGHoNSTR8FOf=7eeSWE1xaLt9ojVWndVg@mail.gmail.com>
 <CAN55FZ0FRB2OD6-oEESLvgUT4bLZQVD72pAqUqzdw7Rx5cN0ig@mail.gmail.com>
 <CA+K2Run1VdLnmp-5_Qv2Fax0KgT7LLJMH-uzjaaf-NZD1oU-=w@mail.gmail.com>
 <aYZdKSTw6N3khsVE@nathan> <CAN55FZ2DOeLjSXE2Jos99bgHG-Zeo3KjStrSgoA8Rf=2Mu+hFA@mail.gmail.com>
In-Reply-To: <CAN55FZ2DOeLjSXE2Jos99bgHG-Zeo3KjStrSgoA8Rf=2Mu+hFA@mail.gmail.com>
From: KAZAR Ayoub <ma_kazar@esi.dz>
Date: Fri, 6 Feb 2026 23:36:13 +0100
Message-ID: <CA+K2RunXPYZ+xz8OSkUa6LVjdbLYX=mEvkGR6mmqHXEQgMd1DA@mail.gmail.com>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
To: Nazir Bilal Yavuz <byavuz81@gmail.com>
Cc: Nathan Bossart <nathandbossart@gmail.com>, Neil Conway <neil.conway@gmail.com>, 
	Manni Wood <manni.wood@enterprisedb.com>, Andrew Dunstan <andrew@dunslane.net>, 
	Shinya Kato <shinya11.kato@gmail.com>, 
	PostgreSQL-development <pgsql-hackers@postgresql.org>
Content-Type: multipart/alternative; boundary="00000000000059faae064a2f6be3"
Archived-At: <https://www.postgresql.org/message-id/CA%2BK2RunXPYZ%2Bxz8OSkUa6LVjdbLYX%3DmEvkGR6mmqHXEQgMd1DA%40mail.gmail.com>
Precedence: bulk

--00000000000059faae064a2f6be3
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hello,

On Fri, Feb 6, 2026 at 11:19=E2=80=AFPM Nazir Bilal Yavuz <byavuz81@gmail.c=
om>
wrote:

> Hi,
>
> Thank you for sharing your thoughts!
>
> On Sat, 7 Feb 2026 at 00:29, Nathan Bossart <nathandbossart@gmail.com>
> wrote:
> >
> > It looks like a lot of energy has been put into benchmarking and refini=
ng
> > the heuristic for deciding when to use the SIMD path so that we avoid
> large
> > regressions when there are special characters.  I think this is all
> > valuable work, but I'm a bit concerned that we are putting the cart
> before
> > the horse.  IMHO it would be better to first get the SIMD code committe=
d
> > with the absolute simplest heuristic we can think of (e.g., as soon as =
we
> > see a special character, switch to the scalar path for the remainder of
> > COPY FROM).  My hope is that would be far easier to reason about from a
> > performance angle.  If we immediately fall back to the existing code
> path,
> > we don't need to worry about how many special characters there are and
> > whether they are sparse or clustered or whatever.  We just need to
> measure
> > the overhead of the new branches and ensure they don't produce meaningf=
ul
> > regressions.  Assuming that all looks good, we can then focus on the SI=
MD
> > code itself and make sure that is correct and optimal.  And once we get
> > that portion committed, we could then consider more sophisticated
> > heuristics.
>
I also agree on this, especially for the line_buf refilling idea, it needs
a bit more time to find the good value of threshold than work for
heuristic.

>
> I have three possible approaches in my mind, they are actually similar
> to each other.
>
> 1- After encountering a special character, disable SIMD for the rest
> of the current line and also for the rest of the data.
>
> 2- It is a mixed version of the current heuristic and #1. After
> encountering a special character, skip SIMD for the current line (let'
> say line 1) and for the next line (line 2). Then try running SIMD for
> the next line (line 3), if there is no special character continue to
> run SIMD but if there is a special character then skip running SIMD
> for two lines this time. And it goes like that, everytime special
> character is encountered in the SIMD run, skipped SIMD lines are
> doubled.
>
> 3- This version is a bit different from #2. Instead of calculating the
> number of lines to skip dynamically, skip the constant N number of
> lines and then try to run SIMD again after these lines. N could be
> something like 100, 1000, or 10000 etc.. Actually, you and Andrew
> suggested this approach before [1].
>
> I think what you suggested is closer to #1 or #3. I just wanted to
> hear your opinions, and whether you think any of these approaches are
> good to implement / work on.
>
For v19, #1 seems like a "wasted potential", #3 sounds more relaxed than
v4.2 so this has good potential, i can fully benchmark it against v3 as
soon as you send a patch for it.


Regards,
Ayoub

--00000000000059faae064a2f6be3
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Hello,</div><br><div class=3D"gmail_quote gmail_quote=
_container"><div dir=3D"ltr" class=3D"gmail_attr">On Fri, Feb 6, 2026 at 11=
:19=E2=80=AFPM Nazir Bilal Yavuz &lt;<a href=3D"mailto:byavuz81@gmail.com">=
byavuz81@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote=
" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);=
padding-left:1ex">Hi,<br>
<br>
Thank you for sharing your thoughts!<br>
<br>
On Sat, 7 Feb 2026 at 00:29, Nathan Bossart &lt;<a href=3D"mailto:nathandbo=
ssart@gmail.com" target=3D"_blank">nathandbossart@gmail.com</a>&gt; wrote:<=
br>
&gt;<br>
&gt; It looks like a lot of energy has been put into benchmarking and refin=
ing<br>
&gt; the heuristic for deciding when to use the SIMD path so that we avoid =
large<br>
&gt; regressions when there are special characters.=C2=A0 I think this is a=
ll<br>
&gt; valuable work, but I&#39;m a bit concerned that we are putting the car=
t before<br>
&gt; the horse.=C2=A0 IMHO it would be better to first get the SIMD code co=
mmitted<br>
&gt; with the absolute simplest heuristic we can think of (e.g., as soon as=
 we<br>
&gt; see a special character, switch to the scalar path for the remainder o=
f<br>
&gt; COPY FROM).=C2=A0 My hope is that would be far easier to reason about =
from a<br>
&gt; performance angle.=C2=A0 If we immediately fall back to the existing c=
ode path,<br>
&gt; we don&#39;t need to worry about how many special characters there are=
 and<br>
&gt; whether they are sparse or clustered or whatever.=C2=A0 We just need t=
o measure<br>
&gt; the overhead of the new branches and ensure they don&#39;t produce mea=
ningful<br>
&gt; regressions.=C2=A0 Assuming that all looks good, we can then focus on =
the SIMD<br>
&gt; code itself and make sure that is correct and optimal.=C2=A0 And once =
we get<br>
&gt; that portion committed, we could then consider more sophisticated<br>
&gt; heuristics.<br></blockquote><div>I also agree on this, especially for =
the line_buf refilling idea, it needs a bit more time to find the good valu=
e of threshold than work for heuristic.=C2=A0</div><blockquote class=3D"gma=
il_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,2=
04,204);padding-left:1ex">
<br>
I have three possible approaches in my mind, they are actually similar<br>
to each other.<br>
<br>
1- After encountering a special character, disable SIMD for the rest<br>
of the current line and also for the rest of the data.<br>
<br>
2- It is a mixed version of the current heuristic and #1. After<br>
encountering a special character, skip SIMD for the current line (let&#39;<=
br>
say line 1) and for the next line (line 2). Then try running SIMD for<br>
the next line (line 3), if there is no special character continue to<br>
run SIMD but if there is a special character then skip running SIMD<br>
for two lines this time. And it goes like that, everytime special<br>
character is encountered in the SIMD run, skipped SIMD lines are<br>
doubled.<br>
<br>
3- This version is a bit different from #2. Instead of calculating the<br>
number of lines to skip dynamically, skip the constant N number of<br>
lines and then try to run SIMD again after these lines. N could be<br>
something like 100, 1000, or 10000 etc.. Actually, you and Andrew<br>
suggested this approach before [1].<br>
<br>
I think what you suggested is closer to #1 or #3. I just wanted to<br>
hear your opinions, and whether you think any of these approaches are<br>
good to implement / work on.<br></blockquote><div>For v19, #1 seems like a =
&quot;wasted potential&quot;, #3 sounds more relaxed than v4.2 so this has =
good potential, i can fully benchmark it against v3 as soon as you send a p=
atch for it.<br><br></div><div><br></div><div>Regards,<br></div><div>Ayoub<=
/div></div></div>

--00000000000059faae064a2f6be3--