MIME-Version: 1.0
References: 
 <CAN55FZ1J+6eM=F5GreWEBMJcNV_gifYyYY1b6xpYzun=nWPhMQ@mail.gmail.com>
 <CAN55FZ109W90Ux_EBEqkkU2TyNqBNhdhN_1XPRGo3iiZ2L9b=A@mail.gmail.com>
 <8615c983-1662-43b4-b0c9-49d194ac33aa@dunslane.net>
 <CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com>
 <673d92f7-2489-475f-a208-9414ea35d4d8@dunslane.net> <aPZrg6lxb5bgy_px@nathan>
 <8e045899-2023-48b1-bd91-f8cdffeb511d@dunslane.net>
 <CAN55FZ2GonAeSJHn-c2nJgUO-v6sDMOQzn97evVdZbcHeu3ihw@mail.gmail.com>
 <aPfTiX0HwV42R6Od@nathan>
 <CAN55FZ0AYP4ZEczBJ5ur-=9QuEhMysH9Yfrq5srr0ZakK1M0FA@mail.gmail.com>
 <aPkvi5P7kpA8oQKc@nathan> <5d81fbbb-7609-4445-9bc4-8af211fb7674@dunslane.net>
In-Reply-To: <5d81fbbb-7609-4445-9bc4-8af211fb7674@dunslane.net>
From: Manni Wood <manni.wood@enterprisedb.com>
Date: Tue, 11 Nov 2025 16:23:20 -0600
Message-ID: 
 <CAKWEB6qdyhN3EoUNAK23etXX-kXH-_79NNbTsKqtF1g1WkuaBQ@mail.gmail.com>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
To: Andrew Dunstan <andrew@dunslane.net>
Cc: Nathan Bossart <nathandbossart@gmail.com>,
 Nazir Bilal Yavuz <byavuz81@gmail.com>,
	KAZAR Ayoub <ma_kazar@esi.dz>, Shinya Kato <shinya11.kato@gmail.com>,
 pgsql-hackers@postgresql.org
Content-Type: multipart/alternative; boundary="0000000000000cf6d3064359193e"
Archived-At: 
 <https://www.postgresql.org/message-id/CAKWEB6qdyhN3EoUNAK23etXX-kXH-_79NNbTsKqtF1g1WkuaBQ%40mail.gmail.com>
Precedence: bulk

--0000000000000cf6d3064359193e
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Wed, Oct 29, 2025 at 5:23=E2=80=AFPM Andrew Dunstan <andrew@dunslane.net=
> wrote:

>
> On 2025-10-22 We 3:24 PM, Nathan Bossart wrote:
> > On Wed, Oct 22, 2025 at 03:33:37PM +0300, Nazir Bilal Yavuz wrote:
> >> On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <nathandbossart@gmail.com=
>
> wrote:
> >>> I wonder if we could mitigate the regression further by spacing out t=
he
> >>> checks a bit more.  It could be worth comparing a variety of values t=
o
> >>> identify what works best with the test data.
> >> Do you mean that instead of doubling the SIMD sleep, we should
> >> multiply it by 3 (or another factor)? Or are you referring to
> >> increasing the maximum sleep from 1024? Or possibly both?
> > I'm not sure of the precise details, but the main thrust of my suggesti=
on
> > is to assume that whatever sampling you do to determine whether to use
> SIMD
> > is good for a larger chunk of data.  That is, if you are sampling 1K
> lines
> > and then using the result to choose whether to use SIMD for the next 10=
0K
> > lines, we could instead bump the latter number to 1M lines (or
> something).
> > That way we minimize the regression for relatively uniform data sets
> while
> > retaining some ability to adapt in case things change halfway through a
> > large table.
> >
>
>
> I'd be ok with numbers like this, although I suspect the numbers of
> cases where we see shape shifts like this in the middle of a data set
> would be vanishingly small.
>
>
> cheers
>
>
> andrew
>
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com
>
>
>
>
Hello!

I wanted reproduce the results using files attached by Shinya Kato and
Ayoub Kazar. I installed a postgres compiled from master, and then I
installed a postgres built from master plus Nazir Bilal Yavuz's v3 patches
applied.

The master+v3patches postgres naturally performed better on copying into
the database: anywhere from 11% better for the t.csv file produced by
Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file
created by Ayoub Kazar's simd-copy-from-bench.sql.

But here's where it gets weird. The two files created by Ayoub Kazar's
simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.txt,
and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5%
respectively.

This seems impossible.

A few things I should note:

I timed the commands using the Unix time command, like so:

time psql -X -U mwood -h localhost -d postgres -c '\copy t from
/tmp/t_4096_escape.txt'

For each file, I timed the copy 6 times and took the average.

This was done on my work Linux machine while also running Chrome and an
Open Office spreadsheet; not a dedicated machine only running postgres.

All of the copy results took between 4.5 seconds (Shinyo's t.csv copied
into postgres compiled from master) to 2 seconds (Ayoub
Kazar's t_4096_none.csv copied into postgres compiled from master plus
Nazir's v3 patches).

Perhaps I need to fiddle with the provided SQL to produce larger files to
get longer run times? Maybe sub-second differences won't tell as
interesting a story as minutes-long copy commands?

Thanks for reading this.
--=20
-- Manni Wood EDB: https://www.enterprisedb.com

--0000000000000cf6d3064359193e
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote g=
mail_quote_container"><div dir=3D"ltr" class=3D"gmail_attr">On Wed, Oct 29,=
 2025 at 5:23=E2=80=AFPM Andrew Dunstan &lt;<a href=3D"mailto:andrew@dunsla=
ne.net">andrew@dunslane.net</a>&gt; wrote:<br></div><blockquote class=3D"gm=
ail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,=
204,204);padding-left:1ex"><br>
On 2025-10-22 We 3:24 PM, Nathan Bossart wrote:<br>
&gt; On Wed, Oct 22, 2025 at 03:33:37PM +0300, Nazir Bilal Yavuz wrote:<br>
&gt;&gt; On Tue, 21 Oct 2025 at 21:40, Nathan Bossart &lt;<a href=3D"mailto=
:nathandbossart@gmail.com" target=3D"_blank">nathandbossart@gmail.com</a>&g=
t; wrote:<br>
&gt;&gt;&gt; I wonder if we could mitigate the regression further by spacin=
g out the<br>
&gt;&gt;&gt; checks a bit more.=C2=A0 It could be worth comparing a variety=
 of values to<br>
&gt;&gt;&gt; identify what works best with the test data.<br>
&gt;&gt; Do you mean that instead of doubling the SIMD sleep, we should<br>
&gt;&gt; multiply it by 3 (or another factor)? Or are you referring to<br>
&gt;&gt; increasing the maximum sleep from 1024? Or possibly both?<br>
&gt; I&#39;m not sure of the precise details, but the main thrust of my sug=
gestion<br>
&gt; is to assume that whatever sampling you do to determine whether to use=
 SIMD<br>
&gt; is good for a larger chunk of data.=C2=A0 That is, if you are sampling=
 1K lines<br>
&gt; and then using the result to choose whether to use SIMD for the next 1=
00K<br>
&gt; lines, we could instead bump the latter number to 1M lines (or somethi=
ng).<br>
&gt; That way we minimize the regression for relatively uniform data sets w=
hile<br>
&gt; retaining some ability to adapt in case things change halfway through =
a<br>
&gt; large table.<br>
&gt;<br>
<br>
<br>
I&#39;d be ok with numbers like this, although I suspect the numbers of <br=
>
cases where we see shape shifts like this in the middle of a data set <br>
would be vanishingly small.<br>
<br>
<br>
cheers<br>
<br>
<br>
andrew<br>
<br>
<br>
--<br>
Andrew Dunstan<br>
EDB: <a href=3D"https://www.enterprisedb.com" rel=3D"noreferrer" target=3D"=
_blank">https://www.enterprisedb.com</a><br><br>
<br>
<br>
</blockquote></div><div><br clear=3D"all"></div><div>Hello!</div><div><br><=
/div><div>I wanted reproduce the results using files attached by Shinya Kat=
o and Ayoub Kazar. I installed a postgres compiled from master, and then I =
installed a postgres built from master plus Nazir Bilal Yavuz&#39;s v3 patc=
hes applied.</div><div><br></div><div>The master+v3patches postgres natural=
ly performed better on copying into the database: anywhere from 11% better =
for the t.csv file produced by Shinyo&#39;s test.sql, to 35% better copying=
 in the t_4096_none.csv file created by=C2=A0Ayoub Kazar&#39;s simd-copy-fr=
om-bench.sql.</div><div><br></div><div>But here&#39;s where it gets weird. =
The two files created by Ayoub Kazar&#39;s simd-copy-from-bench.sql that ar=
e supposed to be slower,=C2=A0t_4096_escape.txt, and=C2=A0t_4096_quote.csv,=
 actually ran faster on my machine, by 11% and 5% respectively.</div><div><=
br></div><div>This seems impossible.</div><div><br></div><div>A few things =
I should note:</div><div><br></div><div>I timed the commands using the Unix=
 time command, like so:</div><div><br></div><div>time psql -X -U mwood -h l=
ocalhost -d postgres -c &#39;\copy t from /tmp/t_4096_escape.txt&#39;</div>=
<div><br></div><div>For each file, I timed the copy 6 times and took the av=
erage.</div><div><br></div><div>This was done on my work Linux machine whil=
e also running Chrome and an Open Office spreadsheet; not a dedicated machi=
ne only running postgres.</div><div><br></div><div>All of the copy results =
took between 4.5 seconds (Shinyo&#39;s t.csv copied into postgres compiled =
from master) to 2 seconds (Ayoub Kazar&#39;s=C2=A0t_4096_none.csv copied in=
to postgres compiled from master plus Nazir&#39;s v3 patches).</div><div><b=
r></div><div>Perhaps I need to fiddle with the provided SQL to produce larg=
er files to get longer run times? Maybe sub-second differences won&#39;t te=
ll as interesting a story as minutes-long copy commands?</div><div><br></di=
v><div>Thanks for reading this.</div><span class=3D"gmail_signature_prefix"=
>-- </span><br><div dir=3D"ltr" class=3D"gmail_signature"><div dir=3D"ltr">=
<span style=3D"color:rgb(29,28,29);font-family:Monaco,Menlo,Consolas,&quot;=
Courier New&quot;,monospace;font-size:12px;background-color:rgba(29,28,29,0=
.04)">--
Manni Wood
EDB: </span><a href=3D"https://www.enterprisedb.com/" rel=3D"noopener noref=
errer" style=3D"font-family:Monaco,Menlo,Consolas,&quot;Courier New&quot;,m=
onospace;font-size:12px;background-color:rgba(29,28,29,0.04)" target=3D"_bl=
ank">https://www.enterprisedb.com</a></div></div></div>

--0000000000000cf6d3064359193e--