MIME-Version: 1.0
References: 
 <CAN55FZ1J+6eM=F5GreWEBMJcNV_gifYyYY1b6xpYzun=nWPhMQ@mail.gmail.com>
 <CAN55FZ109W90Ux_EBEqkkU2TyNqBNhdhN_1XPRGo3iiZ2L9b=A@mail.gmail.com>
 <8615c983-1662-43b4-b0c9-49d194ac33aa@dunslane.net>
 <CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com>
 <673d92f7-2489-475f-a208-9414ea35d4d8@dunslane.net> <aPZrg6lxb5bgy_px@nathan>
 <8e045899-2023-48b1-bd91-f8cdffeb511d@dunslane.net>
 <CAN55FZ2GonAeSJHn-c2nJgUO-v6sDMOQzn97evVdZbcHeu3ihw@mail.gmail.com>
 <aPfTiX0HwV42R6Od@nathan>
 <CAN55FZ0AYP4ZEczBJ5ur-=9QuEhMysH9Yfrq5srr0ZakK1M0FA@mail.gmail.com>
 <aPkvi5P7kpA8oQKc@nathan> <5d81fbbb-7609-4445-9bc4-8af211fb7674@dunslane.net>
 <CAKWEB6qdyhN3EoUNAK23etXX-kXH-_79NNbTsKqtF1g1WkuaBQ@mail.gmail.com>
 <CA+K2RumMC+avYGSX-AWNeod3w+XOGHrVPz8HiqkvJj7AZ5tZXA@mail.gmail.com>
In-Reply-To: 
 <CA+K2RumMC+avYGSX-AWNeod3w+XOGHrVPz8HiqkvJj7AZ5tZXA@mail.gmail.com>
From: Manni Wood <manni.wood@enterprisedb.com>
Date: Wed, 12 Nov 2025 20:40:35 -0600
Message-ID: 
 <CAKWEB6pev=pNVi4qDYWS50N=YFrKRbjH1h=5F1bXpnK7WR5CYg@mail.gmail.com>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
To: KAZAR Ayoub <ma_kazar@esi.dz>
Cc: Andrew Dunstan <andrew@dunslane.net>,
 Nathan Bossart <nathandbossart@gmail.com>,
	Nazir Bilal Yavuz <byavuz81@gmail.com>,
 Shinya Kato <shinya11.kato@gmail.com>,
	pgsql-hackers@postgresql.org
Content-Type: multipart/alternative; boundary="000000000000fc9557064370ceed"
Archived-At: 
 <https://www.postgresql.org/message-id/CAKWEB6pev%3DpNVi4qDYWS50N%3DYFrKRbjH1h%3D5F1bXpnK7WR5CYg%40mail.gmail.com>
Precedence: bulk

--000000000000fc9557064370ceed
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Wed, Nov 12, 2025 at 8:44=E2=80=AFAM KAZAR Ayoub <ma_kazar@esi.dz> wrote=
:

> On Tue, Nov 11, 2025 at 11:23=E2=80=AFPM Manni Wood <manni.wood@enterpris=
edb.com>
> wrote:
>
>> Hello!
>>
>> I wanted reproduce the results using files attached by Shinya Kato and
>> Ayoub Kazar. I installed a postgres compiled from master, and then I
>> installed a postgres built from master plus Nazir Bilal Yavuz's v3 patch=
es
>> applied.
>>
>> The master+v3patches postgres naturally performed better on copying into
>> the database: anywhere from 11% better for the t.csv file produced by
>> Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file
>> created by Ayoub Kazar's simd-copy-from-bench.sql.
>>
>> But here's where it gets weird. The two files created by Ayoub Kazar's
>> simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.t=
xt,
>> and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5%
>> respectively.
>>
>> This seems impossible.
>>
>> A few things I should note:
>>
>> I timed the commands using the Unix time command, like so:
>>
>> time psql -X -U mwood -h localhost -d postgres -c '\copy t from
>> /tmp/t_4096_escape.txt'
>>
>> For each file, I timed the copy 6 times and took the average.
>>
>> This was done on my work Linux machine while also running Chrome and an
>> Open Office spreadsheet; not a dedicated machine only running postgres.
>>
> Hello,
> I think if you do a perf benchmark (if it still reproduces) it would
> probably be possible to explain why it's performing like that looking at
> the CPI and other metrics and compare it to my findings.
> What i also suggest is to make the data close even closer to the worst
> case i.e: more special characters where it hurts the switching between SI=
MD
> and scalar processing (in simd-copy-from-bench.sql file), if still does a
> good job then there's something to look at.
>
>>
>>
>
>> All of the copy results took between 4.5 seconds (Shinyo's t.csv copied
>> into postgres compiled from master) to 2 seconds (Ayoub
>> Kazar's t_4096_none.csv copied into postgres compiled from master plus
>> Nazir's v3 patches).
>>
>> Perhaps I need to fiddle with the provided SQL to produce larger files t=
o
>> get longer run times? Maybe sub-second differences won't tell as
>> interesting a story as minutes-long copy commands?
>>
> I did try it on some GBs (around 2-5GB only), the differences were not
> that much, but if you can run this on more GBs (at least 10GB) it would b=
e
> good to look at, although i don't suspect anything interesting since the
> shape of data is the same for the totality of the COPY.
>
>>
>> Thanks for reading this.
>> --
>> -- Manni Wood EDB: https://www.enterprisedb.com
>>
> Thanks for the info.
>
>
> Regards,
> Ayoub Kazar.
>

Hello again!

It looks like using 10 times the data removed the apparent speedup in the
simd code when the simd code has to deal with t_4096_escape.txt
and t_4096_quote.csv. When both files contain 1,000,000 lines each,
postgres master+v3patch imports 0.63% slower and 0.54% slower respectively.
For 1,000,000 lines of t_4096_none.txt, the v3 patch yields a 30% speedup.
For 1,000,000 lines of t_4096_none.csv, the v3 patch yields a 33% speedup.

I got these numbers just via simple timing, though this time I used psql's
\timing feature. I left psql running rather than launching it each time as
I did when I used the unix "time" command. I ran the copy command 5 times
for each file and averaged the results. Again, this happened on a Linux
machine that also happened to be running Chrome and Open Office's
spreadsheet.

I should probably try to construct some .txt or .csv files that would trip
up the simd on/off heuristic in the v3 patch.

If data "in the wild" tend to be roughly the same "shape" from row to row,
as Andrew's experience has shown, I imagine these million row results bode
well for the v3 patch...
--=20
-- Manni Wood EDB: https://www.enterprisedb.com

--000000000000fc9557064370ceed
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote g=
mail_quote_container"><div dir=3D"ltr" class=3D"gmail_attr">On Wed, Nov 12,=
 2025 at 8:44=E2=80=AFAM KAZAR Ayoub &lt;<a href=3D"mailto:ma_kazar@esi.dz"=
>ma_kazar@esi.dz</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" =
style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);pa=
dding-left:1ex"><div dir=3D"ltr"><div dir=3D"ltr">On Tue, Nov 11, 2025 at 1=
1:23=E2=80=AFPM Manni Wood &lt;<a href=3D"mailto:manni.wood@enterprisedb.co=
m" target=3D"_blank">manni.wood@enterprisedb.com</a>&gt; wrote:<br></div><d=
iv class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margin:=
0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">=
<div dir=3D"ltr"><div dir=3D"ltr">Hello!</div><div><br></div><div>I wanted =
reproduce the results using files attached by Shinya Kato and Ayoub Kazar. =
I installed a postgres compiled from master, and then I installed a postgre=
s built from master plus Nazir Bilal Yavuz&#39;s v3 patches applied.</div><=
div><br></div><div>The master+v3patches postgres naturally performed better=
 on copying into the database: anywhere from 11% better for the t.csv file =
produced by Shinyo&#39;s test.sql, to 35% better copying in the t_4096_none=
.csv file created by=C2=A0Ayoub Kazar&#39;s simd-copy-from-bench.sql.</div>=
<div><br></div><div>But here&#39;s where it gets weird. The two files creat=
ed by Ayoub Kazar&#39;s simd-copy-from-bench.sql that are supposed to be sl=
ower,=C2=A0t_4096_escape.txt, and=C2=A0t_4096_quote.csv, actually ran faste=
r on my machine, by 11% and 5% respectively.</div><div><br></div><div>This =
seems impossible.</div><div><br></div><div>A few things I should note:</div=
><div><br></div><div>I timed the commands using the Unix time command, like=
 so:</div><div><br></div><div>time psql -X -U mwood -h localhost -d postgre=
s -c &#39;\copy t from /tmp/t_4096_escape.txt&#39;</div><div><br></div><div=
>For each file, I timed the copy 6 times and took the average.</div><div><b=
r></div><div>This was done on my work Linux machine while also running Chro=
me and an Open Office spreadsheet; not a dedicated machine only running pos=
tgres.</div></div></blockquote><div>Hello,<br></div><div>I think if you do =
a perf benchmark (if it still reproduces) it would probably be possible to =
explain why it&#39;s performing like that looking at the CPI and other metr=
ics and compare it to my findings.</div><div>What i also suggest is to make=
 the data close even closer to the worst case i.e: more special characters =
where it hurts the switching between SIMD and scalar processing (in simd-co=
py-from-bench.sql file), if still does a good job then there&#39;s somethin=
g to look at.</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0p=
x 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div d=
ir=3D"ltr"><div>=C2=A0</div></div></blockquote><blockquote class=3D"gmail_q=
uote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,2=
04);padding-left:1ex"><div dir=3D"ltr"><div><br></div><div>All of the copy =
results took between 4.5 seconds (Shinyo&#39;s t.csv copied into postgres c=
ompiled from master) to 2 seconds (Ayoub Kazar&#39;s=C2=A0t_4096_none.csv c=
opied into postgres compiled from master plus Nazir&#39;s v3 patches).</div=
><div><br></div><div>Perhaps I need to fiddle with the provided SQL to prod=
uce larger files to get longer run times? Maybe sub-second differences won&=
#39;t tell as interesting a story as minutes-long copy commands?</div></div=
></blockquote><div>I did try it on some GBs (around 2-5GB only), the differ=
ences were not that much, but if you can run this on more GBs (at least 10G=
B) it would be good to look at, although i don&#39;t suspect anything inter=
esting since the shape of data is the same for the totality of the COPY.</d=
iv><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;bord=
er-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div>=
<br></div><div>Thanks for reading this.</div><span class=3D"gmail_signature=
_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_signature"><div dir=
=3D"ltr"><span style=3D"color:rgb(29,28,29);font-family:Monaco,Menlo,Consol=
as,&quot;Courier New&quot;,monospace;font-size:12px;background-color:rgba(2=
9,28,29,0.04)">--
Manni Wood
EDB: </span><a href=3D"https://www.enterprisedb.com/" rel=3D"noopener noref=
errer" style=3D"font-family:Monaco,Menlo,Consolas,&quot;Courier New&quot;,m=
onospace;font-size:12px;background-color:rgba(29,28,29,0.04)" target=3D"_bl=
ank">https://www.enterprisedb.com</a></div></div></div></blockquote><div>Th=
anks for the info.</div><div><br></div><div><br></div><div>Regards,</div><d=
iv>Ayoub Kazar.=C2=A0</div></div></div>
</blockquote></div><div><br clear=3D"all"></div><div>Hello again!</div><div=
><br></div><div>It looks like using 10 times the data removed the apparent =
speedup in the simd code when the simd code has to deal with=C2=A0t_4096_es=
cape.txt and=C2=A0t_4096_quote.csv. When both files contain 1,000,000 lines=
 each, postgres master+v3patch imports 0.63% slower and 0.54% slower respec=
tively. For 1,000,000 lines of t_4096_none.txt, the v3 patch yields a 30% s=
peedup. For 1,000,000 lines of=C2=A0t_4096_none.csv, the v3 patch yields a =
33% speedup.</div><div><br></div><div>I got these numbers just via simple t=
iming, though this time I used psql&#39;s \timing feature. I left psql runn=
ing rather than launching it each time as I did when I used the unix &quot;=
time&quot; command. I ran the copy command 5 times for each file and averag=
ed the results. Again, this happened on a Linux machine that also happened =
to be running Chrome and Open Office&#39;s spreadsheet.</div><div><br></div=
><div>I should probably try to construct some .txt or .csv files that would=
 trip up the simd on/off heuristic in the v3 patch.</div><div><br></div><di=
v>If data &quot;in the wild&quot; tend to be roughly the same &quot;shape&q=
uot; from row to row, as Andrew&#39;s experience has shown, I imagine these=
 million row results bode well for the v3 patch...</div><span class=3D"gmai=
l_signature_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_signature=
"><div dir=3D"ltr"><span style=3D"color:rgb(29,28,29);font-family:Monaco,Me=
nlo,Consolas,&quot;Courier New&quot;,monospace;font-size:12px;background-co=
lor:rgba(29,28,29,0.04)">--
Manni Wood
EDB: </span><a href=3D"https://www.enterprisedb.com/" rel=3D"noopener noref=
errer" style=3D"font-family:Monaco,Menlo,Consolas,&quot;Courier New&quot;,m=
onospace;font-size:12px;background-color:rgba(29,28,29,0.04)" target=3D"_bl=
ank">https://www.enterprisedb.com</a></div></div></div>

--000000000000fc9557064370ceed--